LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms

AI Research

Latest papers, benchmarks, and research developments

AI moves in two tracks: products and research. Products are what you use (Claude, ChatGPT, Gemini). Research is where those products come from. Understanding AI research helps you see what's coming 18 to 36 months before it ships. Right now, papers on mixture-of-experts, improved scaling laws, and multimodal reasoning are being published. Six months later, new products will implement those ideas. A year later, they're commoditized.

This feed aggregates AI research from arXiv, academic conferences (NeurIPS, ICML, ACL, ICLR), and technical journals. We filter for practical relevance: papers that describe techniques likely to appear in products, new benchmarks that measure important capabilities, and theoretical work that advances our understanding of how to build better AI. We exclude purely mathematical papers not relevant to building systems. When a major paper drops, it spreads through the community in waves. Researchers read it first. Engineers implement it. Products ship it. Hype follows. Our job is to show you the research before the hype cycle obscures the substance.

Trends to watch: sparse models (mixture-of-experts, for efficiency), improved reasoning (chain-of-thought, tree search, planning), better calibration (understanding when models are uncertain), and mechanistic interpretability (understanding how models actually work). Below we also track benchmark scores for major models to show empirical progress on standardized tasks.

Latest Research

Benchmark Tracker

BenchmarkClaude Opus 4.7GPT-4.5Gemini 2.5 ProLlama 4
MMLU93.890.891.186.3
HumanEval96.293.794.288.9
GPQA76.571.272.863.5

Scores represent published results as of April 2026. Higher is better.