Changelog
Track all updates, additions, and changes to our benchmark collection.
[2026-02-13]
Added
- CAIS Dashboard Benchmarks: Added 7 benchmarks from Center for AI Safety dashboard (dashboard.safe.ai):
- TextQuests: 25 classic text-based adventure games testing long-horizon reasoning (Agentic, Tier A)
- Machiavelli: Ethical reasoning in 134 Choose-Your-Own-Adventure games (Agentic, Tier A)
- ERQA: Embodied Reasoning QA for robotics applications (Visual Reasoning, Tier A)
- MindCube: Spatial navigation and working memory with 21,154 questions (Visual Reasoning, Tier A)
- IntPhys 2: Intuitive physics understanding through video clips (Video, Tier A)
- MASK: Model honesty under pressure testing (Intelligence, Tier A)
- VCT: Virology Capabilities Test measuring model refusal of hazardous expert-level virology queries (Intelligence, Tier A)
- Windsurf Arena: Added Codeium's Windsurf Arena benchmark for evaluating AI coding assistants (Human Preference, Tier A)
[2026-02-06]
Added
- First release: Added some benchmarks and examples
- GDPval-AA: Added Artificial Analysis' evaluation framework for OpenAI's GDPval dataset
- EQ-Bench: Added six benchmarks from eqbench.com (EQ-Bench 3, Creative Writing v3, Longform Writing, Judgemark v2.1, DiploBench, Spiral-Bench v1.2)
Changed
- Vending-Bench: Added Vending-Bench 2 and Vending-Bench Arena. Removed Vending-Bench (deprecated)
- Arena: Updated LMArena to Arena with new domain (lmarena.ai → arena.ai)