animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

121128 of 214

model outputs

GPT-5.4

A 0.62T 0.35

Claude Sonnet 4.6

A 0.77T 0.21

LLaMA 4 Scout

A 0.22T 0.00

model outputs

GPT-5.4

A 0.91T 0.25

Claude Sonnet 4.6

A 0.78T 0.11

LLaMA 4 Scout

A 0.70T 0.13

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.88T 0.31

Claude Sonnet 4.6

A 0.78T 0.22

LLaMA 4 Scout

A 0.71T 0.14

ground truth

iOS 7 Progress Bar

model outputs

GPT-5.4

A 0.91T 0.25

Claude Sonnet 4.6

A 0.92T 0.54

LLaMA 4 Scout

A 0.77T 0.41

ground truth

Blob SVG

model outputs

GPT-5.4

A 0.91T 0.30

Claude Sonnet 4.6

A 0.89T 0.33

LLaMA 4 Scout

A 0.55T 0.18

ground truth

SVG Loading icons

model outputs

GPT-5.4

A 0.87T 0.32

Claude Sonnet 4.6

A 0.72T 0.24

LLaMA 4 Scout

A 0.65T 0.12

ground truth

SVG Loading icons

model outputs

GPT-5.4

A 0.75T 0.29

Claude Sonnet 4.6

A 0.84T 0.22

LLaMA 4 Scout

A 0.56T 0.14

ground truth

SVG Loading icons

model outputs

GPT-5.4

A 0.90T 0.32

Claude Sonnet 4.6

A 0.88T 0.30

LLaMA 4 Scout

A 0.40T 0.00