Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Validation of Generative AI Models to Expedite Title and Abstract Screening in Systematic Reviews
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Introduction Systematic literature reviews (SLRs) are a vital aspect of evidence-based research, directing healthcare decisions and impacting policymaking-specific issues, however, the traditional process of conducting them could be lengthy, labor-intensive, and costly. This augments the need for more efficient strategies, such as automation using generative artificial intelligence (AI), which could help the researchers reduce their workload and streamline the SLR process. The current study investigates the relative efficiency of the generative AI models (Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4) in the title and abstract screening phase of SLRs. Methods Key biomedical databases, including Embase ® , Medline ® , and Cochrane, were searched to identify relevant randomised controlled trials in patients with schizophrenia. This study presented a hybrid approach for systematic reviews, where one reviewer is a human expert and the other leverages three large language models (LLM). A subject matter expert in conducting SLRs, optimized and fine-tuned the final prompt, delivered through a Python application programming interface, to identify evidence meeting key inclusion and exclusion criteria. The screening results obtained from one human reviewer and three AI models were reviewed by subject matter expert (SME). AI models’ performance was evaluated using metrics such as accuracy, sensitivity, specificity, and precision to assess their success in identifying publications included in the final SLR. Results All three AI models performed exceptionally well in screening based on titles and abstracts. While there were no significant differences in accuracy rates, Gemini Flash 1.5 exhibited the highest accuracy rate at 96.02%, followed by GPT-4 (95.00%) and Claude Sonnet 3.5 (94.69%). In terms of sensitivity, GPT-4 exhibited better results, attaining 95.97% of sensitivity, followed by 94.63% with Gemini Flash 1.5, and 88.59% with Claude Sonnet 3.5. Among the AI models evaluated, GPT demonstrated highest concordance with the human reviewer at 88.77%, followed closely by Gemini Flash at 86.63% and Claude Sonnet at 85.81%, indicating a consistently high level of agreement across all models.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.786 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.700 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.270 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.908 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.