Frontier AI Models Struggle to Diagnose Medical X-Rays Accurately

Stanford researchers find AI models can 'hallucinate' detailed medical analysis without ever seeing the images.

Apr. 7, 2026 at 1:06pm

A bold, abstract painting in soft blues, greens, and grays featuring sweeping geometric shapes and intersecting waveforms, conceptually representing the complex interplay between language models and medical imagery analysis.As AI models become increasingly relied upon in medical settings, the potential for 'mirage reasoning' - fabricating detailed analysis of unseen imagery - raises serious concerns about the reliability and safety of these systems.Stanford Today

A team of researchers at Stanford University found that frontier AI models readily generated 'detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.' The researchers coined this phenomenon 'mirage reasoning,' where the AI models construct a false epistemic frame and describe a multi-modal input that was never actually shown to them. This raises serious concerns about the reliability of using AI models to analyze medical scans, as they may provide confident but completely fabricated diagnoses.

Why it matters

As hospitals and healthcare providers increasingly look to deploy AI systems to assist or even replace radiologists, this research highlights major flaws in the current state of medical AI. If these models are unable to reliably distinguish between real and imagined medical imagery, it could lead to dangerous false positives and misdiagnoses with severe consequences for patient health and safety.

The details

The Stanford researchers tested frontier AI models like OpenAI's GPT-5, Google's Gemini 3 Pro, and Anthropic's Claude Opus 4.5 on a new benchmark they created that included visual questions across medicine, science, technical, and general topics - but with the images removed. They found that all the models confidently provided 'descriptions of visual details' and 'pathology-biased clinical findings' for the missing images. In one experiment, a model even achieved the top rank on a standard chest X-ray question-answering benchmark without ever seeing the actual X-ray images.

  • The research paper is yet to be peer-reviewed.
  • The experiments were conducted by a team at Stanford University in 2026.

The players

Mohammad Asadi

A Stanford PhD student and co-author of the research paper.

OpenAI

The company behind the GPT-5 language model tested in the experiments.

Google

The company behind the Gemini 3 Pro language model tested in the experiments.

Anthropic

The company behind the Claude Opus 4.5 language model tested in the experiments.

Got photos? Submit your photos here. ›

What they’re saying

“What we try to show is that even on the best benchmarks, although a question would seem unsolvable for a human, the LLMs might still be able to leverage question-level and dataset-level patterns behind it and use general statistics and prevalence data to answer them right, while also learning to talk 'as if' they were seeing the image.”

— Mohammad Asadi, Stanford PhD student

“To conclude, we believe that the AI models are able to use their super-human memory and language skills to hide their weaknesses in multimodal understanding (and by talking like [they] are actually doing multi-modal reasoning).”

— Mohammad Asadi, Stanford PhD student

What’s next

The researchers are calling for an overhaul of existing benchmarks to avoid negative consequences, particularly 'in medical contexts where miscalibrated AI carries the greatest consequence.' They have proposed a new framework called 'B-Clean' to identify and remove compromised questions that AI models could answer without visual input.

The takeaway

This research highlights a major flaw in the current state of medical AI, where language models can confidently fabricate detailed medical analysis without ever seeing the actual images. As hospitals increasingly look to deploy AI systems to assist or replace radiologists, this raises serious concerns about the reliability and safety of relying on these models for high-stakes medical diagnosis.