animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

113120 of 214

ground truth

Simple Spinner

model outputs

GPT-5.4

A 0.91T 0.21

Claude Sonnet 4.6

A 0.95T 0.28

LLaMA 4 Scout

A 0.80T 0.29

ground truth

Simple Spinner

model outputs

GPT-5.4

A 0.78T 0.26

Claude Sonnet 4.6

A 0.83T 0.25

LLaMA 4 Scout

A 0.60T 0.17

ground truth

Animation

model outputs

GPT-5.4

A 0.88T 0.30

Claude Sonnet 4.6

A 0.91T 0.36

LLaMA 4 Scout

A 0.42T 0.18

model outputs

GPT-5.4

A 0.90T 0.22

Claude Sonnet 4.6

A 0.78T 0.38

LLaMA 4 Scout

A 0.65T 0.21

model outputs

GPT-5.4

A 0.86T 0.22

Claude Sonnet 4.6

A 0.79T 0.23

LLaMA 4 Scout

A 0.55T 0.34

model outputs

GPT-5.4

A 0.89T 0.30

Claude Sonnet 4.6

A 0.65T 0.29

LLaMA 4 Scout

A 0.33T 0.30

model outputs

GPT-5.4

A 0.84T 0.36

Claude Sonnet 4.6

A 0.74T 0.37

LLaMA 4 Scout

A 0.30T 0.23

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.90T 0.33

Claude Sonnet 4.6

A 0.60T 0.16

LLaMA 4 Scout

A 0.19T 0.00