animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

193200 of 214

ground truth

Neon Loaders

model outputs

GPT-5.4

A 0.86T 0.23

Claude Sonnet 4.6

A 0.72T 0.27

LLaMA 4 Scout

A 0.40T 0.00

model outputs

GPT-5.4

A 0.69T 0.24

Claude Sonnet 4.6

A 0.66T 0.17

LLaMA 4 Scout

A 0.55T 0.23

model outputs

GPT-5.4

A 0.65T 0.24

Claude Sonnet 4.6

A 0.76T 0.23

LLaMA 4 Scout

A 0.62T 0.28

model outputs

GPT-5.4

A 0.54T 0.33

Claude Sonnet 4.6

A 0.61T 0.24

LLaMA 4 Scout

A 0.44T 0.26

model outputs

GPT-5.4

A 0.78T 0.23

Claude Sonnet 4.6

A 0.84T 0.25

LLaMA 4 Scout

A 0.48T 0.29

model outputs

GPT-5.4

A 0.64T 0.26

Claude Sonnet 4.6

A 0.74T 0.26

LLaMA 4 Scout

A 0.43T 0.11

model outputs

GPT-5.4

A 0.71T 0.20

Claude Sonnet 4.6

A 0.59T 0.27

LLaMA 4 Scout

A 0.64T 0.21

model outputs

GPT-5.4

A 0.85T 0.28

Claude Sonnet 4.6

A 0.88T 0.24

LLaMA 4 Scout

A 0.39T 0.24