Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From Proof‐of‐Concept to Clinical Practice: Reproducibility in Medical Imaging AI
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Artificial intelligence (AI) has rapidly transitioned from experimental research to regulated clinical deployment in medicine. As of 2026, the U.S. Food and Drug Administration (FDA) has authorised more than 1300 AI-enabled medical devices [1], the majority within radiology [2], reflecting an unprecedented expansion of commercially available clinical AI systems (Figure 1). Despite this growth, AI has not been widely embedded into routine workflows [3], and robust evidence supporting clinical generalisability remains limited [4]. Regulatory authorisation does not substitute for transparent, reproducible evidence of real-world clinical impact, and approval alone may be insufficient to establish durable clinical trust. Bridging the gap between innovation and clinical application requires a reproducible evidence base. Improving reporting and reproducibility practices is one of the most immediate steps toward accelerating responsible clinical translation. Clinical adoption depends not only on algorithmic performance but also on scientific reliability. Over the past decade, numerous reporting frameworks have emerged to support transparent development and evaluation of AI systems [5-11], with the recently proposed FUTURE-AI guidelines [12] covering the entire lifecycle of healthcare AI. These initiatives collectively aim to ensure that AI studies describe datasets, model development, validation strategies, testing procedures, and clinical applicability with sufficient clarity to enable meaningful interpretation of results. For radiology researchers, these frameworks provide a practical structure during study design, guiding methodological rigor and encouraging deliberate consideration of clinical context and translational relevance. Despite the availability of these frameworks, systematic reviews across AI in medical imaging have consistently identified reporting deficiencies within published studies [13-19]. Essential methodological components are either incompletely or not reported at all. Consequently, many published AI studies cannot be independently validated or reliably compared. Beyond limiting clinical translation, poor reporting also leads to inefficient use of research resources. When methodological detail is insufficient, subsequent investigators may unknowingly duplicate efforts, misinterpret findings, or expend time evaluating systems that cannot be meaningfully benchmarked. In a rapidly expanding field supported by substantial investment, this fragmentation risks slowing progress. AI systems are increasingly positioned to influence diagnostic and procedural decisions, yet inadequate reporting limits interpretability and trust. Reported metrics often fail to characterise how systems behave under real clinical conditions. For clinicians, this determines whether reported performance can be expected to hold in their setting, including differences in patient mix, scanners, acquisition protocols, and workflow. The field therefore risks producing an expanding literature of increasingly complex models that inflate expectations but remain difficult to translate into practice. Further technical innovation, without methodological transparency, will not drive adoption at scale. The proliferation of reporting guidelines indicates growing recognition of the need for transparent AI research. However, adoption remains inconsistent. Checklists are often not required by journals, submitted without verification of compliance, or completed retrospectively at manuscript submission rather than integrated into study design [20-22]. More fundamentally, academic incentives often reward novelty and incremental performance gains over replication studies, negative results, or rigorous external validation. Until incentives better align with scientific robustness, guidelines alone are unlikely to resolve reporting deficiencies. The consequences are practical. Clinicians evaluating AI tools for departmental use are often left interpreting studies that provide insufficient information to determine whether reported performance will generalise to their own practice environment. Key details that influence clinical reliability—such as scanner vendor and model, acquisition protocols, patient population characteristics, and external validation across institutions—are crucial. Without this information, it becomes difficult to judge whether a model trained or validated in a single centre will behave similarly when deployed elsewhere. For radiology departments considering procurement, this is not an abstract concern; it directly affects whether a tool can be responsibly deployed or meaningfully compared against alternatives. Reporting checklists are not administrative exercises; they are the mechanism by which published evidence becomes clinically actionable. Whether improved adherence is sufficient, however, depends on whether existing frameworks reflect the evolving scope of medical imaging AI. As AI applications diversify, the adequacy of existing reporting frameworks warrants reconsideration. Many current guidelines were developed primarily for predictive or classification models. Emerging applications introduce distinct challenges. For example, automated report generation using vision–language models introduces different reproducibility concerns. Because investigators frequently deploy pretrained or proprietary systems, reproducibility depends on clearly describing prompting strategies, system configuration, and how generated outputs are clinically evaluated. Similarly, image-generative models introduce risks including hallucinated structures, degradation of image fidelity, and unintended modification of clinically relevant features. Analyses of existing checklists further highlight broader limitations, including variable applicability across study designs and subjective interpretation of compliance [18]. The development of application-specific reporting extensions may therefore represent a logical next step. Just as clinical trial reporting evolved through specialised extensions [23], AI reporting standards may need to adapt to emerging methodological paradigms. This evolution should be viewed not as regulatory expansion but as the maturation of a rapidly advancing field. Data availability statements have become standard components of AI publications, intended to clarify whether datasets supporting reported findings can be accessed by other investigators. In practice, however, the commonly used statement ‘data available upon reasonable request’ often fails to translate into genuine accessibility [24, 25]. Within medical imaging, the challenges are particularly complex. While publicly curated benchmark datasets have advanced algorithm development, many clinically embedded studies rely on institutional cohorts subject to privacy protections, ethics approvals, institutional governance, proprietary considerations, and substantial investment in data curation and annotation. Even when investigators are willing to share data, these institutional or legal constraints may preclude dissemination. Mandating universal open data as a prerequisite for publication would therefore risk excluding clinically embedded investigators and concentrating innovation within highly resourced research environments. A more realistic approach is greater honesty and specificity in data availability statements. Authors should clearly describe the level of accessibility achievable, the barriers to sharing, and any alternative resources that can be provided. While clinical data sharing remains challenging, code sharing is often considerably more feasible. Publication of preprocessing pipelines, model architectures, training procedures, and inference workflows can substantially enhance external verifiability even when patient datasets cannot be released. Reproducibility should therefore be viewed along a spectrum rather than as a binary state. Fully open code and datasets represent one end of this spectrum, but meaningful transparency can also be achieved through controlled-access repositories, pretrained model weights, release of synthetic data, executable containers, or sufficiently detailed methodological reporting that allows independent reconstruction and benchmarking. Adopting a pragmatic approach to openness acknowledges real-world clinical constraints while still advancing the science. Journals and reviewers can facilitate this shift by encouraging layered transparency requirements and expanding data availability statements beyond simple declarations of access to clearly describe which components of a study can be independently evaluated. Improving reporting quality requires coordinated action across the research ecosystem. Radiology researchers should engage reporting frameworks at study design, not only at manuscript submission. Reviewers should assess whether reported methods are sufficiently transparent to judge whether findings would hold beyond the study cohort. For clinicians, transparent reporting underpins informed procurement decisions: understanding where a model was validated, in what patient population, and under what acquisition conditions is essential to assessing whether it will perform safely in a different clinical environment and where it might fail. Journals play a central role. Moving beyond guideline endorsement toward active implementation, requiring structured checklist submission and integrating reporting assessment into peer review, would meaningfully raise the standard of published AI evidence without additional scientific burden. Higher reporting standards should be framed as enablers of clinical adoption, not administrative overhead. Clinicians need literature they can act on. For radiologists considering AI tools for clinical use, the questions are practical: does this tool reduce radiologist burnout, improve diagnostic accuracy, or deliver measurable value in my department? If the evidence base cannot answer questions like these reliably [26-30], the evidentiary bar for trust must be higher. Reporting quality is what makes those claims verifiable; it is the foundation for responsible procurement, fair comparison between competing tools, and confident clinical deployment. If medical imaging AI is to move beyond proof-of-concept publications and toward sustained clinical integration, rigorous reporting standards must become the norm rather than the exception. For the clinician reading a study, this means being able to determine not just whether a model can be trained but whether it will hold up under conditions that reflect their own practice. That is the standard the field must meet. Stanley A. Norris: conceptualization, writing – review and editing, writing – original draft. Mohamed K. Badawy: conceptualization, writing – review and editing, supervision. The authors have nothing to report. The authors have nothing to report. The authors have nothing to report. Mohamed K. Badawy is an Editorial Board member of JMIRO and a co-author of this article. To minimise bias, he was excluded from all editorial decision-making related to this article. The authors have nothing to report.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.496 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.386 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.848 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.562 Zit.