OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.04.2026, 04:41

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Leveraging Large Language Models for Automated Extraction of Abdominal Aortic Aneurysm Features from Radiology Reports

2026·0 Zitationen·DiagnosticsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2026

Jahr

Abstract

<b>Background/Objectives</b>. Abdominal computed tomography (CT) radiology reports contain critical information for abdominal aortic aneurysm (AAA) management, including aneurysm presence, size, rupture status, and prior repair. However, this information is often embedded within lengthy, heterogeneous reports, making manual extraction inefficient. We evaluated the performance of multiple large language models (LLMs) for automated extraction of AAA-related findings from radiology reports. <b>Methods</b>. We retrospectively analyzed 500 abdominal CT reports mentioning AAA from an urban academic health system (2020-2024). Ground truth labels were established by manual review. Four open-source LLMs (Qwen2.5-7B-Instruct, Llama3-Med42-8B, GPT-OSS-20B, and MedGemma-27B-text-it) were evaluated for extraction of aneurysm presence, size, morphology, rupture status, impending rupture, and prior aortic repair. Model outputs were compared with ground truth using exact-match accuracy, and inter-model agreement was assessed using Fleiss' kappa. Reasoning traces were examined to characterize correct and incorrect model behavior. <b>Results</b>. Accuracy for identifying AAA presence ranged from 0.90 to 0.95 (κ = 0.851), and prior aortic repair from 0.90 to 0.97 (κ = 0.793). Accuracy for aneurysm size ranged from 0.67 to 0.88 (κ = 0.340), with low κ's due to class imbalance or dimension misselection. Rupture and impending rupture were identified with accuracies exceeding 0.90 across models, though agreement was lower (κ = 0.485 and 0.589), reflecting low event prevalence. Larger models (GPT-OSS-20B, MedGemma-27B) generally outperformed smaller models. Reasoning analysis revealed strengths in measurement prioritization but recurrent errors, including dimension misselection, over-inference of prior repair, and conservative classification of rupture-related findings. <b>Conclusions</b>. LLMs can accurately extract clinically relevant AAA information from radiology reports with interpretable reasoning, with larger and medically trained models outperforming smaller or general-purpose models. Performance varies by task and model, underscoring the need for careful validation and human-in-the-loop deployment in clinical settings.

Ähnliche Arbeiten