OpenAI builds world’s toughest AI health test

By Swaleha | Published on May 13, 2025

Technology / May 13, 2025

OpenAI builds world’s toughest AI health test

OpenAI has launched HealthBench, a new benchmark that scores AI health responses using real-world conversations and rubrics written by 262 doctors across 60 countries. The test evaluates models like ChatGPT, Grok, and Gemini in 49 languages and 26 medical fields.

New Delhi:

The benchmark comes with 5,000 complex, multi-turn conversations, ranging from emergency situations to everyday health questions. Each of these conversations is scored using detailed rubrics designed by medical professionals. It’s not your usual multiple-choice test. HealthBench checks if AI can understand real health contexts, give clear answers, ask the right follow-ups, and avoid dangerous errors.

OpenAI has introduced HealthBench, a new open-source benchmark that evaluates how well AI models can handle real-world health-related conversations. Built with help from 262 physicians across 60 countries, HealthBench is one of the most ambitious attempts yet to hold AI accountable for the way it responds to medical queries.

AI in healthcare gets a reality check

OpenAI says it built HealthBench because existing tools didn’t go far enough. Most old benchmarks either relied on medical exams or yes/no questions. In contrast, HealthBench covers 48,562 unique scoring criteria and spans 26 medical specialties including global health, emergency care, and even neurological surgery.

The benchmark also checks how AI behaves under uncertainty, in emergencies, and when dealing with users from different countries and languages, 49 languages are supported in total, including Amharic and Nepali.

OpenAI’s own “o3” model topped the performance list with a 60% score, outperforming Grok 3 (54%) and Gemini 2.5 Pro (52%), as reported in the company’s blog and research paper. That might not sound like a high score, but it shows just how tough the test is, especially because HealthBench Hard, a special section of difficult questions, brought even the best models below 32%.

Why it matters for India

For countries like India, where access to professional medical advice can be patchy in rural areas, AI tools could someday help fill the gap. But benchmarks like HealthBench serve as a reality check before that future becomes real. The focus is now not just on building flashy models, but on making sure they actually help, and don’t harm.

OpenAI has released all the code, data, evaluation methods, and detailed scoring systems on GitHub, encouraging global researchers to build safer AI for health.

Real doctors in the loop

In a scenario where a user reports their 70-year-old neighbour lying on the floor and unresponsive, the benchmark not only checks if the AI knows to call emergency services, but also if it gives clear and accurate instructions, and does so quickly. One model’s answer in that example scored 77%, with feedback provided for improvement.

Each HealthBench score is grounded in physician-written rubrics. In some cases, the same questions were answered by both AI and doctors. Surprisingly, newer models like o3 were able to outperform unassisted human responses, though doctors still had the edge when given a chance to edit AI outputs.