Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking Large Language Models Against Web of Science
0
Zitationen
7
Autoren
2026
Jahr
Abstract
BACKGROUND: As large language models (LLMs) grow in sophistication, their potential role in scientific writing is being explored with growing interest and caution. However, LLMs vary in their performance, contextual accuracy, and reliability. This study compares the outputs of 3 leading LLMs (ChatGPT-4o, Deepseek, and Claude 3.7) against a manually curated bibliometric analysis of the most highly cited panniculectomy articles. METHODS: The 50 most highly cited panniculectomy publications were manually extracted from Web of Science (WoS) to serve as a reference data set. ChatGPT-4o, Deepseek, and Claude 3.7 Sonnet were each prompted to generate their own list of the 50 most cited panniculectomy articles. Outputs were compared across citation totals and averages, publication year trends, journal distribution, author co-occurrence, and article authenticity. RESULTS: The manual data set totaled 2494 citations (density: 49.8). ChatGPT-4o, Deepseek, and Claude 3.7 produced 2111 (42.2), 4736 (94.7), and 8592 (171.8) citations, respectively. Overlap with the manual list was limited: ChatGPT-4o (14.00%), Claude 3.7 (4.00%), Deepseek (0.00%) (P<0.001). "Plastic and Reconstructive Surgery" was the most cited journal across all outputs. Unique authors: manual (241), ChatGPT-4o (114), Deepseek (72), and Claude 3.7 (129). Article accuracy: ChatGPT-4o had 34.00% accurate, 26.00% confabulated, and 40.00% hallucinated articles. Claude 3.7: 4.00% accurate, 26.00% confabulated, and 70.00% hallucinated. Deepseek: 100.00% hallucinated (P<0.001). Year trends and journal representation varied notably from the manual set. CONCLUSIONS: Current LLMs struggle to replicate accurate bibliometric data. ChatGPT-4o performed best but still showed major limitations. WoS remains the gold standard, and LLM-generated outputs should be treated cautiously in bibliometric analyses.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.611 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.504 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.025 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.835 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.