APEX evaluates frontier AI models across four critical medical roles: General Practitioner, Radiology Expert, Pathology Specialist, and Cardiology Annotator. Our benchmark measures real-world medical task performance using validated datasets from leading medical institutions.
Aggregated performance across all medical specialties
| Rank | Model | Score |
|---|---|---|
1 | O GPT 5 OpenAI | 67% ± 2.1% |
2 | G Gemini 3 Pro Google | 65.4% ± 1.8% |
3 | x Grok 4 xAI | 64.2% ± 2.3% |
4 | O o3 OpenAI | 63.8% ± 1.9% |
5 | A Opus 4.5 Anthropic | 63.1% ± 2% |
6 | A Sonnet 4.5 Anthropic | 62.1% ± 1.7% |
7 | G Gemini 2.5 Flash Google | 61.5% ± 2.2% |
8 | O GPT OSS OpenAI | 59.8% ± 1.6% |
Detailed rankings for each medical role
The Medical AI Productivity Index (APEX) is developed in collaboration with experts from University of Pennsylvania, Northwestern University, Cornell Medical Center, Brigham and Women's Hospital, and Mount Sinai Health System.
Our benchmark evaluates AI models on authentic medical tasks including diagnosis support, medical imaging analysis, pathology review, and clinical documentation. All scores represent performance on validated test sets with error margins calculated using bootstrap sampling.