Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Auto-METRICS: LLM-assisted scientific quality control for radiomics research
3
Zitationen
2
Autoren
2025
Jahr
Abstract
PURPOSE: The quality of radiomics research is critical for reliable clinical translation, yet methodological flaws remain prevalent. This study evaluates whether large language models (LLMs) can reliably assess radiomics methodological quality using the METhodological RadiomICs Score (METRICS). METHODS: We compared a commercial cloud-based LLM (Gemini Flash 2.0) METRICS assessments for 46 articles with those of radiologists using two reproducibility studies (ADA2025 and K2025, with 6 radiologist groups and 3 radiologists, respectively, with varying degrees of experience). Cohen's kappa (κ) and METRICS Pearson's correlation (PC), and error rates between LLMs and human raters were evaluated. Prompt clarifications to METRICS were suggested to improve human-LLM agreement. Twenty four privacy-preserving open LLMs were compared with Gemini Flash 2.0. RESULTS: In ADA2025, the commercial LLM achieved inter-rater agreements with human raters comparable to those between human raters (average κ = 0.48 vs. average κ = 0.48, respectively, Wilcoxon rank-sum test p = 0.41), leading to similar correlation values in METRICS scoring (average PC = 0.62 vs. average PC = 0.56, Wilcoxon rank-sum test p = 0.11). This was confirmed with K2025 (mean human-LLM κ = 0.58 vs. human-human κ = 0.57, Wilcoxon rank-sum test p = 0.28), with no evidence for correlation differences (PC = 0.68 vs. PC = 0.51, respectively, Wilcoxon rank-sum test p = 0.55). Phi4-Reasoning, an open model which can be run locally, performed comparably to Gemini Flash 2.0 (median ranking = 1 vs. median ranking = 3, respectively, across all raters). CONCLUSION: LLMs can assist in standardized radiomics quality assessment. Open privacy-preserving models can offer comparable performance to commercial cloud-based LLMs, suggesting their utility in supporting human raters for evaluating radiomics research integrity.
Ähnliche Arbeiten
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 14.092 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.912 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.149 Zit.
UNet++: A Nested U-Net Architecture for Medical Image Segmentation
2018 · 8.831 Zit.