animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

169176 of 214

ground truth

Loaders (WIP)

model outputs

GPT-5.4

A 0.83T 0.28

Claude Sonnet 4.6

A 0.89T 0.24

LLaMA 4 Scout

A 0.72T 0.19

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.75T 0.31

Claude Sonnet 4.6

A 0.63T 0.34

LLaMA 4 Scout

A 0.64T 0.26

model outputs

GPT-5.4

A 0.91T 0.30

Claude Sonnet 4.6

A 0.92T 0.29

LLaMA 4 Scout

A 0.82T 0.32

model outputs

GPT-5.4

A 0.76T 0.25

Claude Sonnet 4.6

A 0.81T 0.26

LLaMA 4 Scout

A 0.47T 0.34

model outputs

GPT-5.4

A 0.95T 0.25

Claude Sonnet 4.6

A 0.95T 0.49

LLaMA 4 Scout

A 0.66T 0.36

model outputs

GPT-5.4

A 0.74T 0.14

Claude Sonnet 4.6

A 0.95T 0.20

LLaMA 4 Scout

A 0.59T 0.12

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.95T 0.24

Claude Sonnet 4.6

A 0.93T 0.31

LLaMA 4 Scout

A 0.56T 0.18

model outputs

GPT-5.4

A 0.95T 0.27

Claude Sonnet 4.6

A 0.97T 0.55

LLaMA 4 Scout

A 0.87T 0.20