OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.04.2026, 06:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative performance of Chinese and international large language models on the Chinese radiology attending physician qualification examination

2025·1 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

1

Zitationen

8

Autoren

2025

Jahr

Abstract

This study evaluates the accuracy and reliability of six large language models (LLMs)-three Chinese (Doubao, Kimi, DeepSeek) and three international (ChatGPT-4o, Gemini 2.0 Pro, Grok3)-in radiology, using simulated questions from the 2025 Chinese Radiology Attending Physician Qualification Examination (CRAPQE). Analysis covered 400 CRAPQE-simulated questions, spanning various formats (A1, A2-A4, B, C-type) and modalities (text-only, image-based). Expert radiologists scored responses against official answer keys. Performance comparisons within and between Chinese and international LLM groups assessed overall, unit-specific, question-type-specific, and modality-specific accuracy. All LLMs passed the CRAPQE simulation, showing proficiency comparable to a radiology attending. Chinese LLMs achieved a higher mean accuracy (87.2%) than international LLMs (80.4%, P < 0.05), excelling in text-only and A1-type questions (P < 0.05). DeepSeek (91.6%) and Doubao (89.5%) outperformed Kimi (80.5%, P < 0.0167), while international LLMs showed no significant differences (P > 0.05). All models surpassed the passing threshold on image-based questions but performed worse than on text questions, with no group difference (P > 0.05). This pioneering comparison highlights the potential of LLMs in radiology, with Chinese models outperforming their international counterparts, likely due to localized training, providing evidence to guide the development of medical AI.

Ähnliche Arbeiten