LLM Comparison

How well can different language models implement a 3D dancing stick figure using Three.js? We gave each model the same prompt and measured the results.

This is not a scientific benchmark. It's a visual, qualitative comparison to get a feel for how different models handle a creative coding task. Results depend on hardware, quantization, prompt wording, and randomness.