animation2code benchmark
For best compatibility, please view this dashboard in a Chrome browser.

Zero-shot video (or image-frame) → code results on the test set, across commercial and open-source models.

Each output is tagged with A = appearance similarity and T = temporal similarity; higher is better for both. Click a video to inspect its code.

177184 of 214

model outputs

GPT-5.4

A 0.90T 0.24

Claude Sonnet 4.6

A 0.85T 0.31

LLaMA 4 Scout

A 0.54T 0.31

model outputs

GPT-5.4

A 0.94T 0.35

Claude Sonnet 4.6

A 0.95T 0.32

LLaMA 4 Scout

A 0.85T 0.35

model outputs

GPT-5.4

A 0.96T 0.58

Claude Sonnet 4.6

A 0.94T 0.33

LLaMA 4 Scout

A 0.75T 0.17

model outputs

GPT-5.4

A 0.89T 0.35

Claude Sonnet 4.6

A 0.92T 0.15

LLaMA 4 Scout

A 0.88T 0.14

ground truth

Worm cycle

model outputs

GPT-5.4

A 0.74T 0.24

Claude Sonnet 4.6

A 0.74T 0.25

LLaMA 4 Scout

A 0.56T 0.14

model outputs

GPT-5.4

A 0.83T 0.28

Claude Sonnet 4.6

A 0.86T 0.26

LLaMA 4 Scout

A 0.76T 0.17

model outputs

GPT-5.4

A 0.92T 0.38

Claude Sonnet 4.6

A 0.90T 0.37

LLaMA 4 Scout

A 0.78T 0.15

model outputs

GPT-5.4

A 0.78T 0.21

Claude Sonnet 4.6

A 0.87T 0.33

LLaMA 4 Scout

A 0.86T 0.24