animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

185192 of 214

model outputs

GPT-5.4

A 0.83T 0.24

Claude Sonnet 4.6

A 0.88T 0.24

LLaMA 4 Scout

A 0.83T 0.14

model outputs

GPT-5.4

A 0.89T 0.19

Claude Sonnet 4.6

A 0.90T 0.34

LLaMA 4 Scout

A 0.83T 0.30

model outputs

no output

Gemini 3 Flash Preview

A T

GPT-5.4

A 0.79T 0.28

Claude Sonnet 4.6

A 0.68T 0.31

LLaMA 4 Scout

A 0.82T 0.38

ground truth

jiggle?

model outputs

GPT-5.4

A 0.93T 0.21

Claude Sonnet 4.6

A 0.90T 0.09

LLaMA 4 Scout

A 0.86T 0.03

ground truth

Neon Loaders

model outputs

GPT-5.4

A 0.80T 0.25

Claude Sonnet 4.6

A 0.51T 0.30

LLaMA 4 Scout

A 0.70T 0.23

ground truth

Neon Loaders

model outputs

GPT-5.4

A 0.82T 0.38

Claude Sonnet 4.6

A 0.81T 0.45

LLaMA 4 Scout

A 0.43T 0.00

ground truth

Neon Loaders

model outputs

GPT-5.4

A 0.83T 0.29

Claude Sonnet 4.6

A 0.89T 0.29

LLaMA 4 Scout

A 0.72T 0.26

ground truth

Neon Loaders

model outputs

GPT-5.4

A 0.94T 0.33

Claude Sonnet 4.6

A 0.86T 0.30

LLaMA 4 Scout

A 0.77T 0.31