Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Comparative Benchmark of 19 Large Language Models for Structured Data Extraction from Neurosurgical Clinical Records (Preprint)
0
Zitationen
16
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language models (LLMs) are increasingly used to extract information from electronic health records (EHRs). Given the rapid pace of LLM development, robust scenario-specific benchmarks are essential to evaluate clinical usefulness and support safe deployment. </sec> <sec> <title>OBJECTIVE</title> To compare contemporary LLMs on structured data extraction from real neurosurgical EHRs written in the Czech language. </sec> <sec> <title>METHODS</title> In a prospective single-center cohort, 172 hospitalized patients provided informed consent for use of anonymized EHRs. For each patient, predefined records were collected and concatenated. Ground truth for 35 data points was established by dual extraction with consensus. A standardized prompt requesting JSON output was submitted to 19 LLMs. Primary outcome was overall accuracy; secondary outcomes were category-level accuracy and proportion of complete machine-readable outputs. </sec> <sec> <title>RESULTS</title> 6,264 documents were collected (median 33 per patient). Ground truth was established with 92.6% initial inter-rater agreement before consensus seeking. Several models produced complete JSON outputs for 100% of cases (Claude 4.1 Opus, Grok 4, Gemini 2.5 Flash); GPT-4.1 (DeepSearch) and GPT-5 completed 99.4%. Highest accuracy was achieved by GPT-4.1 (87.6%), followed by GPT-4.5 (85.6%), Claude 4.1 (84.8%), and Grok 4 (84.2%). Accuracy declined by data type: binary (up to 95%), numeric (~89%), multiple-choice (~75%), and short text (~78%). </sec> <sec> <title>CONCLUSIONS</title> Currently available LLMs can reliably extract structured clinical information from full, non-English EHRs, while older or smaller models show major limitations. A hybrid workflow—automated extraction with targeted validation—appears practical for research use. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.485 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.371 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.827 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.549 Zit.