Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Not Ready for Prime Time: Limitations of a Retrieval-Augmented Generation Large Language Model in Assessing Risk of Bias in Observational Studies
1
Zitationen
5
Autoren
2025
Jahr
Abstract
Background: Current research has focused on the use of large language models (LLMs) to augment systematic reviews. LLMs are limited by their vulnerability to "hallucinations"; retrieval-augmented generation (RAG) reduces these by limiting the model's source knowledge to user-provided material. The purpose of this study was to evaluate the accuracy and reliability of a RAG-LLM in quality assessment of observational studies in pediatric orthopaedic literature as compared to manual review. Methods: Previously published systematic reviews of observational studies in pediatric orthopaedics containing reported Newcastle-Ottawa Scale (NOS) scores from our group were included. After uploading observational study source files, NotebookLM (Google, Mountain View, CA) evaluated each of the included studies using the NOS scoring sheet. Agreement among scores across all NotebookLM trials was determined using a two-way random, average measures, absolute agreement intraclass correlation coefficient [ICC(2,k)]. Agreement among individual scores generated by each NotebookLM instance (LM1, LM2, LM3, and LM4) and ground truth (published manual review score) was calculated using a two-way random, single measures, absolute agreement intraclass correlation coefficient [ICC(2,1)]. Results: Two systematic reviews comprising a total of 27 observational studies were included. ICC across all measurements (ICC(2,k)-Reviewer-LM1,2,3,4) was 0.69 (95% CI: 0.46-0.84), indicating moderate agreement. ICC comparing individual NotebookLM scores to ground truth demonstrated poor agreement [ICC(2,1) LM1-Reviewer = 0.27 (95% CI: -0.064 to 0.57), LM2-Reviewer = 0.18 (95% CI: -0.12 to 0.48), LM3-Reviewer = 0.081 (95% CI: -0.24 to 0.41), and LM4-Reviewer = 0.23 (95% CI: -0.14 to 0.55)]. Percent agreement ranged from 14.8% to 29.6%. Single measures ICCs comparing individual NotebookLM scores across multiple trials demonstrated moderate-to-poor agreement. Conclusions: NotebookLM demonstrated low reliability and accuracy in performing quality assessment of observational studies. Caution should be taken when implementing LLMs to augment research efforts in pediatric orthopaedics. Key Concepts: (1)NotebookLM (Google, Mountain View, CA) demonstrated low reliability and accuracy in performing quality assessment of observational studies.(2)Caution should be taken when implementing artificial intelligence tools such as large language models (LLMs) to augment research efforts, even retrieval-augmented generation (RAG)-LLM models that reduce hallucinations.(3)Until emerging artificial intelligence technologies are further validated it remains essential that researchers and clinicians continue to critically appraise new studies independently. Level of Evidence: IV.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.