Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery

2026·0 Zitationen·Graefe s Archive for Clinical and Experimental OphthalmologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

PURPOSE: To systematically evaluate and compare the performance of four leading large language models (LLMs) in generating medical literature reviews across topics of varying research maturity, thereby providing insights for their effective and responsible application in academic writing. METHODS: In this comparative study, using standardized prompts, we instructed four leading LLMs (GPT-4, Gemini 2.5 Pro, Grok-3, and DeepSeek R1) to generate literature reviews on nine topics related to small incision lenticule extraction (SMILE) surgery. These topics were categorized into three groups by research maturity: well-researched, controversial, and open. Seven ophthalmology experts evaluated the generated content across four dimensions: quality, accuracy, bias, and relevance, while all references were verified for authenticity. Performance differences among models were evaluated using group comparison tests followed by post-hoc analysis. RESULTS: Significant performance variations were identified across all four models and dimensions (p < 0.001). Specifically, Gemini ranked highest in content quality, accuracy, and bias control. In contrast, DeepSeek, despite its high-quality score, received the lowest relevance score. Grok-3 demonstrated the highest reference authenticity (p < 0.001), whereas GPT-4's was the lowest (p < 0.001). All models showed diminished performance on open topics and exhibited severe reference fabrication ("hallucinations"). CONCLUSION: Rather than excelling universally, LLMs exhibit distinct and task-specific strengths that mandate a task-driven, hybrid strategy in tool selection. Reference fabrication was found to be a pervasive issue across all models, regardless of the task topic, elevating human verification from a best practice to an essential safeguard for academic integrity.

Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen