Automated Radiology Report Generation Using Multimodal Foundation Models: A Systematic Review Of Clinical Accuracy And Safety
DOI:
https://doi.org/10.70082/vdzvtc32Abstract
Background: The global landscape of diagnostic radiology is currently navigating a precarious inflection point. As imaging volumes surge due to aging populations and expanded screening protocols, the workforce remains critically constrained. Recent global surveys indicate that over 53% of radiologists are experiencing burnout, with workforce shortages cited as a primary concern by nearly half of the profession. This systemic strain exacerbates the risk of diagnostic error—a phenomenon already estimated to affect approximately 40 million patients annually worldwide. Against this backdrop, the integration of Artificial Intelligence (AI), specifically Multimodal Foundation Models (MFMs), has emerged as a potential panacea. These models, capable of processing both visual and textual data, promise to automate the labor-intensive process of radiology report generation (RRG), thereby potentially alleviating clinician workload and standardizing diagnostic quality. However, the transition from experimental architectures to clinical deployment is fraught with challenges related to factual consistency, safety, and trust.
Objective: This systematic review aims to provide an exhaustive, nuanced evaluation of the current state of automated radiology report generation using MFMs.
Methods: A comprehensive systematic literature search was conducted across major medical and technical databases covering the period from 2020 to 2025. The review adhered to PRISMA guidelines where applicable. Inclusion criteria prioritized studies that evaluated MFMs on standard benchmarks (MIMIC-CXR, CheXpert) or through direct comparison with board-certified radiologists.
Results: The review reveals a distinct dichotomy in MFM performance: while linguistic fluency has achieved near-human levels, factual reliability remains volatile. It was demonstrated that a Fact-Aware Multimodal Retrieval-Augmented Generation (FactMM-RAG) pipeline significantly outperforms standard foundation models. By grounding generation in retrieved, factually distinct report pairs mined via RadGraph, FactMM-RAG achieved a 6.5% improvement in F1CheXbert and a 2% improvement in F1RadGraph on the MIMIC-CXR dataset compared to state-of-the-art retrievers. Conversely, large-scale comparative studies indicate that generalist models like GPT-4V and Gemini Pro Vision still trail human radiologists in diagnostic accuracy (49% vs. 61% in complex cases), although they show promise as "second readers" in specific subspecialties like chest radiology.
Safety analysis presents the most concerning findings. A multimodal evaluation reported a 74.4% overall hallucination rate across leading visual language models, with a predominance of "fabricated imaging findings" that are statistically plausible but visually absent. The review identifies a "Plausibility Paradox" where the most advanced models (e.g., Gemini 2.0) generate the most convincing, yet factually hallucinated, reports, posing a high risk of automation bias. However, specialized models like MAIRA-X have reduced critical error rates to 4.6%, approaching the human baseline of 3.0%.
Conclusion: The era of autonomous radiology reporting has not yet arrived, but the era of AI-augmented reporting is imminent. Retrieval-augmented architectures represent a critical leap forward, offering a mechanism to constrain the stochastic nature of generative AI with verified clinical facts. While current hallucination rates preclude independent use, the potential for these systems to democratize access to high-quality diagnostics—particularly in underserved regions—is profound. Future implementation must prioritize "human-in-the-loop" workflows, robust uncertainty quantification, and the development of safety-critical metrics that penalize plausible fabrications.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
