Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

2024·0 Zitationen·Open Access CRIS of the University of BernOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

Background and objective It is unknown whether large language models (LLMs) may facilitate time- and resource-intensive text-related processes in evidence appraisal. The objective was to quantify the agreement of LLMs with human consensus in appraisal of scientific reporting (Preferred Reporting Items for Systematic reviews and Meta-Analyses [PRISMA]) and methodological rigor (A MeaSurement Tool to Assess systematic Reviews [AMSTAR]) of systematic reviews and design of clinical trials (PRagmatic Explanatory Continuum Indicator Summary 2 [PRECIS-2]) and to identify areas where collaboration between humans and artificial intelligence (AI) would outperform the traditional consensus process of human raters in efficiency. Study design and setting Five LLMs (Claude-3-Opus, Claude-2, GPT-4, GPT-3.5, Mixtral-8x22B) assessed 112 systematic reviews applying the PRISMA and AMSTAR criteria and 56 randomized controlled trials applying PRECIS-2. We quantified the agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs approach; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM. Results Individual human rater accuracy was 89% for PRISMA and AMSTAR, and 75% for PRECIS-2. Individual LLM accuracy was ranging from 63% (GPT-3.5) to 70% (Claude-3-Opus) for PRISMA, 53% (GPT-3.5) to 74% (Claude-3-Opus) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 75%-88% for PRISMA (4%-74% deferred), 74%-89% for AMSTAR (6%-84% deferred), and 64%-79% for PRECIS-2 (29%-88% deferred). Human-AI collaboration resulted in the best accuracies from 89% to 96% for PRISMA (25/35% deferred), 91%-95% for AMSTAR (27/30% deferred), and 80%-86% for PRECIS-2 (76/71% deferred). Conclusion Current LLMs alone appraised evidence worse than humans. Human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR) but not for complex tasks such as PRECIS-2.

Autoren

Institutionen

MRC Clinical Trials Unit at UCL(GB)

Themen

Meta-analysis and systematic reviewsArtificial Intelligence in Healthcare and EducationDelphi Technique in Research

Volltext beim Verlag öffnen

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen