animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

18 of 214

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.79T 0.30

Claude Sonnet 4.6

A 0.67T 0.30

LLaMA 4 Scout

A 0.40T 0.24

ground truth

Road Block

model outputs

no output

Qwen3-VL-8B-Instruct

A T

GPT-5.4

A 0.56T 0.23

Claude Sonnet 4.6

A 0.65T 0.23

LLaMA 4 Scout

A 0.39T 0.21

ground truth

SVG Draught Beer

model outputs

GPT-5.4

A 0.89T 0.22

Claude Sonnet 4.6

A 0.80T 0.15

LLaMA 4 Scout

A 0.64T 0.00

ground truth

SVG Multi-Drip

model outputs

GPT-5.4

A 0.90T 0.24

Claude Sonnet 4.6

A 0.70T 0.16

LLaMA 4 Scout

A 0.68T 0.24

ground truth

sting

model outputs

GPT-5.4

A 0.42T 0.18

Claude Sonnet 4.6

A 0.75T 0.30

LLaMA 4 Scout

A 0.62T 0.34

model outputs

GPT-5.4

A 0.84T 0.40

Claude Sonnet 4.6

A 0.87T 0.53

LLaMA 4 Scout

A 0.84T 0.55

model outputs

GPT-5.4

A 0.75T 0.27

Claude Sonnet 4.6

A 0.45T 0.14

LLaMA 4 Scout

A 0.58T 0.24

ground truth

Orbit 3D

model outputs

GPT-5.4

A 0.92T 0.19

Claude Sonnet 4.6

A 0.96T 0.17

LLaMA 4 Scout

A 0.85T 0.28
← Previous1 / 27
go to
Next →