LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms

AI Benchmarks

Compare leading AI models across standardized benchmarks. Last updated 2026-04-17.

How do you know if Claude is smarter than GPT-4? How does the new Llama 4 stack up against Gemini 2.5? Benchmarks provide the answer. These standardized tests measure specific AI capabilities across diverse domains and let us compare models objectively. They're imperfect (benchmarks are often gamed), but they're the only shared language we have for understanding AI progress.

MMLU measures broad knowledge across multiple choice questions across chemistry, history, law, and 50+ other domains. A score of 92 percent means the model answers 92 out of 100 random questions correctly across all topics. MMLU is the closest we have to a general intelligence test for AI. HumanEval tests code generation: the model writes functions to solve programming problems that humans created. GPQA (Graduate-Level Google-Proof Questions) is deliberately hard, asking obscure questions that require deep expertise. MATH benchmarks raw mathematical reasoning. SWE-bench tests software engineering tasks: given a failing test and a codebase, can the model write code to fix it?

No single benchmark captures everything. A model that excels at MMLU might struggle with code. Benchmarks have been leaked and learned during training. And real-world performance depends on your specific task, how you prompt, and how you integrate the model into your system. Use this data to narrow the field of candidates. Then test the finalists on your actual workloads. We've also collected this data in our model comparison tool for side-by-side analysis.

MMLU-Pro: General knowledge and reasoning across 57 subjects. Max score: 100.

RankModelProviderScoreReleased
#1Claude Opus 4.7Anthropic93.8/ 1002026-04
#2Claude Opus 4.6Anthropic92.4/ 1002026-03
#3o1OpenAI91.8/ 1002025-09
#4Gemini 2.5 ProGoogle91.2/ 1002026-01
#5GPT-4.5OpenAI90.1/ 1002025-12
#6Llama 4 MaverickMeta89.3/ 1002026-03
#7Claude Sonnet 4.6Anthropic88.7/ 1002026-02
#8DeepSeek V3DeepSeek88.1/ 1002025-12
#9GPT-4oOpenAI87.2/ 1002025-05
#10Mistral LargeMistral86.8/ 1002025-11
#11o3-miniOpenAI86.3/ 1002025-11
#12Llama 4 ScoutMeta85.9/ 1002026-02
#13Gemini 2.0 FlashGoogle84.5/ 1002025-10
#14Claude Haiku 4.5Anthropic82.1/ 1002026-01
#15Mistral SmallMistral78.4/ 1002025-09