animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

916 of 214

model outputs

GPT-5.4

A 0.96T 0.20

Claude Sonnet 4.6

A 0.84T 0.29

LLaMA 4 Scout

A 0.44T 0.19

ground truth

Star Burst

model outputs

GPT-5.4

A 0.81T 0.28

Claude Sonnet 4.6

A 0.78T 0.30

LLaMA 4 Scout

A 0.50T 0.31

ground truth

Snow (Pure CSS)

model outputs

GPT-5.4

A 0.95T 0.32

Claude Sonnet 4.6

A 0.89T 0.16

LLaMA 4 Scout

A 0.53T 0.35

ground truth

Bubble Float

model outputs

GPT-5.4

A 0.89T 0.47

Claude Sonnet 4.6

A 0.96T 0.63

LLaMA 4 Scout

A 0.53T 0.31

ground truth

Spiral Tower

model outputs

GPT-5.4

A 0.65T 0.28

Claude Sonnet 4.6

A 0.78T 0.27

LLaMA 4 Scout

A 0.64T 0.28

model outputs

GPT-5.4

A 0.97T 0.22

Claude Sonnet 4.6

A 0.97T 0.72

LLaMA 4 Scout

A 0.80T 0.10

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.87T 0.31

Claude Sonnet 4.6

A 0.79T 0.33

LLaMA 4 Scout

A 0.54T 0.27

model outputs

GPT-5.4

A 0.91T 0.35

Claude Sonnet 4.6

A 0.89T 0.32

LLaMA 4 Scout

A 0.60T 0.12