Changelog

Track all updates, additions, and changes to our benchmark collection.

[2026-02-13]

Added

  • CAIS Dashboard Benchmarks: Added 7 benchmarks from Center for AI Safety dashboard (dashboard.safe.ai):
    • TextQuests: 25 classic text-based adventure games testing long-horizon reasoning (Agentic, Tier A)
    • Machiavelli: Ethical reasoning in 134 Choose-Your-Own-Adventure games (Agentic, Tier A)
    • ERQA: Embodied Reasoning QA for robotics applications (Visual Reasoning, Tier A)
    • MindCube: Spatial navigation and working memory with 21,154 questions (Visual Reasoning, Tier A)
    • IntPhys 2: Intuitive physics understanding through video clips (Video, Tier A)
    • MASK: Model honesty under pressure testing (Intelligence, Tier A)
    • VCT: Virology Capabilities Test measuring model refusal of hazardous expert-level virology queries (Intelligence, Tier A)
  • Windsurf Arena: Added Codeium's Windsurf Arena benchmark for evaluating AI coding assistants (Human Preference, Tier A)

[2026-02-06]

Added

  • First release: Added some benchmarks and examples
  • GDPval-AA: Added Artificial Analysis' evaluation framework for OpenAI's GDPval dataset
  • EQ-Bench: Added six benchmarks from eqbench.com (EQ-Bench 3, Creative Writing v3, Longform Writing, Judgemark v2.1, DiploBench, Spiral-Bench v1.2)

Changed

  • Vending-Bench: Added Vending-Bench 2 and Vending-Bench Arena. Removed Vending-Bench (deprecated)
  • Arena: Updated LMArena to Arena with new domain (lmarena.ai → arena.ai)