Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Detection of Medical Misinformation in Hemangioma Patient Education: Comparative Study of ChatGPT-4o and DeepSeek-R1 Large Language Models
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Background: This study examines the capability of large language models (LLMs) in detecting medical rumors, using hemangioma-related information as an example. It compares the performances of ChatGPT-4o and DeepSeek-R1. Objective: This study aimed to evaluate and compare the accuracy, stability, and expert-rated reliability of 2 LLMs, ChatGPT-4o and DeepSeek-R1, in classifying medical information related to hemangiomas as either "rumors" or "accurate information." Methods: We collected 82 publicly available texts from social media platforms, medical education websites, international guidelines, and journals. Of the 82 items, 47/82 (57%) were labeled as "rumors," and 35/82 (43%) were labeled as "accurate information." Three vascular anomaly specialists with extensive clinical experience independently annotated the texts in a double-blinded manner, and disagreements were resolved by arbitration to ensure labeling reliability. Subsequently, these texts were input into ChatGPT-4o and DeepSeek-R1, with each model generating 2 rounds of results under identical instructions. Output stability was assessed using bidirectional encoder representations from transformers-based semantic similarity scores. Classification accuracy, precision, recall, and F1-score were calculated to evaluate the performance. Additionally, 2 medical experts independently rated the model outputs using a 5-point scale based on clinical guidelines. Statistical analyses included paired t tests, Wilcoxon signed-rank tests, and bootstrap resampling to compute confidence intervals. Results: In terms of semantic stability, the similarity distributions for the 2 models largely overlapped, with no statistically significant difference observed (mean difference=-0.003, 95% CI -0.011 to 0.005; P=.30). Regarding classification performance, DeepSeek-R1 achieved higher accuracy (0.963) compared to ChatGPT-4o (0.910), and also performed better in terms of precision (0.978 vs 0.940), recall (0.957 vs 0.894), and F1-score (0.967 vs 0.916). Expert evaluations revealed that DeepSeek-R1 significantly outperformed ChatGPT-4o on both "rumor" items (mean difference=0.431; P<.001; Cohen dz=0.594) and "accurate information" items (mean difference=0.264; P=.045; Cohen dz=0.352), with a particularly pronounced advantage in rumor detection. Conclusions: DeepSeek-R1 demonstrated greater accuracy and rationale in detecting medical rumors compared with ChatGPT-4o. This study provides empirical support for the application of LLMs and recommends optimizing accuracy and incorporating real-time verification mechanisms to mitigate the harmful impact of misleading information on patient health.
Ähnliche Arbeiten
The spread of true and false news online
2018 · 8.275 Zit.
What is Twitter, a social network or a news media?
2010 · 6.679 Zit.
Social Media and Fake News in the 2016 Election
2017 · 6.509 Zit.
Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception
1983 · 6.291 Zit.
The Matthew Effect in Science
1968 · 6.216 Zit.