animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

1724 of 214

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.92T 0.34

Claude Sonnet 4.6

A 0.86T 0.30

LLaMA 4 Scout

A 0.63T 0.23

model outputs

GPT-5.4

A 0.70T 0.24

Claude Sonnet 4.6

A 0.93T 0.27

LLaMA 4 Scout

A 0.36T 0.00

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.93T 0.39

Claude Sonnet 4.6

A 0.80T 0.22

LLaMA 4 Scout

A 0.79T 0.19

ground truth

Rotating text

model outputs

no output

Gemini 3 Flash Preview

A T

GPT-5.4

A 0.92T 0.38
no output

Claude Sonnet 4.6

A T

LLaMA 4 Scout

A 0.86T 0.25

model outputs

GPT-5.4

A 0.76T 0.31

Claude Sonnet 4.6

A 0.87T 0.28

LLaMA 4 Scout

A 0.55T 0.14

model outputs

GPT-5.4

A 0.84T 0.31

Claude Sonnet 4.6

A 0.84T 0.50

LLaMA 4 Scout

A 0.48T 0.00

model outputs

GPT-5.4

A 0.90T 0.31

Claude Sonnet 4.6

A 0.85T 0.33

LLaMA 4 Scout

A 0.75T 0.39

model outputs

GPT-5.4

A 0.87T 0.31

Claude Sonnet 4.6

A 0.85T 0.34

LLaMA 4 Scout

A 0.70T 0.45