OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.05.2026, 17:57

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Not Ready for Prime Time: Limitations of a Retrieval-Augmented Generation Large Language Model in Assessing Risk of Bias in Observational Studies

2025·1 Zitationen·Journal of the Pediatric Orthopaedic Society of North AmericaOpen Access
Volltext beim Verlag öffnen

1

Zitationen

5

Autoren

2025

Jahr

Abstract

Background: Current research has focused on the use of large language models (LLMs) to augment systematic reviews. LLMs are limited by their vulnerability to "hallucinations"; retrieval-augmented generation (RAG) reduces these by limiting the model's source knowledge to user-provided material. The purpose of this study was to evaluate the accuracy and reliability of a RAG-LLM in quality assessment of observational studies in pediatric orthopaedic literature as compared to manual review. Methods: Previously published systematic reviews of observational studies in pediatric orthopaedics containing reported Newcastle-Ottawa Scale (NOS) scores from our group were included. After uploading observational study source files, NotebookLM (Google, Mountain View, CA) evaluated each of the included studies using the NOS scoring sheet. Agreement among scores across all NotebookLM trials was determined using a two-way random, average measures, absolute agreement intraclass correlation coefficient [ICC(2,k)]. Agreement among individual scores generated by each NotebookLM instance (LM1, LM2, LM3, and LM4) and ground truth (published manual review score) was calculated using a two-way random, single measures, absolute agreement intraclass correlation coefficient [ICC(2,1)]. Results: Two systematic reviews comprising a total of 27 observational studies were included. ICC across all measurements (ICC(2,k)-Reviewer-LM1,2,3,4) was 0.69 (95% CI: 0.46-0.84), indicating moderate agreement. ICC comparing individual NotebookLM scores to ground truth demonstrated poor agreement [ICC(2,1) LM1-Reviewer = 0.27 (95% CI: -0.064 to 0.57), LM2-Reviewer = 0.18 (95% CI: -0.12 to 0.48), LM3-Reviewer = 0.081 (95% CI: -0.24 to 0.41), and LM4-Reviewer = 0.23 (95% CI: -0.14 to 0.55)]. Percent agreement ranged from 14.8% to 29.6%. Single measures ICCs comparing individual NotebookLM scores across multiple trials demonstrated moderate-to-poor agreement. Conclusions: NotebookLM demonstrated low reliability and accuracy in performing quality assessment of observational studies. Caution should be taken when implementing LLMs to augment research efforts in pediatric orthopaedics. Key Concepts: (1)NotebookLM (Google, Mountain View, CA) demonstrated low reliability and accuracy in performing quality assessment of observational studies.(2)Caution should be taken when implementing artificial intelligence tools such as large language models (LLMs) to augment research efforts, even retrieval-augmented generation (RAG)-LLM models that reduce hallucinations.(3)Until emerging artificial intelligence technologies are further validated it remains essential that researchers and clinicians continue to critically appraise new studies independently. Level of Evidence: IV.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingMeta-analysis and systematic reviews
Volltext beim Verlag öffnen