Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery
0
Zitationen
16
Autoren
2026
Jahr
Abstract
PURPOSE: To systematically evaluate and compare the performance of four leading large language models (LLMs) in generating medical literature reviews across topics of varying research maturity, thereby providing insights for their effective and responsible application in academic writing. METHODS: In this comparative study, using standardized prompts, we instructed four leading LLMs (GPT-4, Gemini 2.5 Pro, Grok-3, and DeepSeek R1) to generate literature reviews on nine topics related to small incision lenticule extraction (SMILE) surgery. These topics were categorized into three groups by research maturity: well-researched, controversial, and open. Seven ophthalmology experts evaluated the generated content across four dimensions: quality, accuracy, bias, and relevance, while all references were verified for authenticity. Performance differences among models were evaluated using group comparison tests followed by post-hoc analysis. RESULTS: Significant performance variations were identified across all four models and dimensions (p < 0.001). Specifically, Gemini ranked highest in content quality, accuracy, and bias control. In contrast, DeepSeek, despite its high-quality score, received the lowest relevance score. Grok-3 demonstrated the highest reference authenticity (p < 0.001), whereas GPT-4's was the lowest (p < 0.001). All models showed diminished performance on open topics and exhibited severe reference fabrication ("hallucinations"). CONCLUSION: Rather than excelling universally, LLMs exhibit distinct and task-specific strengths that mandate a task-driven, hybrid strategy in tool selection. Reference fabrication was found to be a pervasive issue across all models, regardless of the task topic, elevating human verification from a best practice to an essential safeguard for academic integrity.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.774 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.685 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.244 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Nanchang University(CN)
- Sun Yat-sen University(CN)
- Second Affiliated Hospital of Nanchang University(CN)
- Third Hospital of Nanchang(CN)
- Shanghai Eye Disease Prevention & Treatment Center(CN)
- Eye & ENT Hospital of Fudan University(CN)
- He Eye Hospital(CN)
- First Affiliated Hospital of Gannan Medical University(CN)
- The Central Hospital of Xiao gan(CN)
- Xiaogan First People's Hospital(CN)
- ShenZhen People’s Hospital(CN)
- First Affiliated Hospital of Xi'an Jiaotong University(CN)
- First Hospital of Xi'an(CN)