animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

8188 of 214

ground truth

Nice spinny stuff

model outputs

GPT-5.4

A 0.89T 0.28

Claude Sonnet 4.6

A 0.92T 0.23

LLaMA 4 Scout

A 0.46T 0.00

ground truth

Nice spinny stuff

model outputs

GPT-5.4

A 0.94T 0.40

Claude Sonnet 4.6

A 0.94T 0.27

LLaMA 4 Scout

A 0.91T 0.13

ground truth

Nice spinny stuff

model outputs

GPT-5.4

A 0.96T 0.28

Claude Sonnet 4.6

A 0.96T 0.35

LLaMA 4 Scout

A 0.48T 0.13

ground truth

Nice spinny stuff

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.91T 0.33

Claude Sonnet 4.6

A 0.77T 0.31

LLaMA 4 Scout

A 0.54T 0.33

model outputs

GPT-5.4

A 0.82T 0.34

Claude Sonnet 4.6

A 0.79T 0.33

LLaMA 4 Scout

A 0.61T 0.27

model outputs

GPT-5.4

A 0.83T 0.33

Claude Sonnet 4.6

A 0.80T 0.31

LLaMA 4 Scout

A 0.44T 0.05

model outputs

GPT-5.4

A 0.81T 0.34

Claude Sonnet 4.6

A 0.81T 0.33

LLaMA 4 Scout

A 0.63T 0.37

model outputs

GPT-5.4

A 0.92T 0.21

Claude Sonnet 4.6

A 0.81T 0.27

LLaMA 4 Scout

A 0.55T 0.28