animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

4148 of 214

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.94T 0.43

Claude Sonnet 4.6

A 0.85T 0.52

LLaMA 4 Scout

A 0.40T 0.30

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.61T 0.41

Claude Sonnet 4.6

A 0.71T 0.62

LLaMA 4 Scout

A 0.29T 0.00

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.63T 0.33

Claude Sonnet 4.6

A 0.78T 0.35

LLaMA 4 Scout

A 0.39T 0.22

model outputs

GPT-5.4

A 0.65T 0.29

Claude Sonnet 4.6

A 0.71T 0.30
no output

LLaMA 4 Scout

A T

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.83T 0.28

Claude Sonnet 4.6

A 0.77T 0.22

LLaMA 4 Scout

A 0.49T 0.20

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.83T 0.43

Claude Sonnet 4.6

A 0.63T 0.33

LLaMA 4 Scout

A 0.14T 0.00

ground truth

Only CSS: Cry Baby

model outputs

GPT-5.4

A 0.74T 0.22

Claude Sonnet 4.6

A 0.61T 0.21

LLaMA 4 Scout

A 0.31T 0.00

ground truth

Only CSS: Peacock

model outputs

GPT-5.4

A 0.58T 0.18

Claude Sonnet 4.6

A 0.48T 0.15

LLaMA 4 Scout

A 0.42T 0.17