Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking large language models for congenital cataract parent counseling: safety, readability, and knowledge translation of developmental and genetic information
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Background Congenital cataract (CC) is a time-critical cause of preventable childhood visual impairment. After diagnosis, parents frequently experience uncertainty and increasingly seek guidance online. The safety, readability, and counseling quality of large language models (LLMs) responses for CC remain insufficiently benchmarked, particularly for explanations involving lens development, etiology, and genetic risk. Methods We performed a cross-sectional comparative evaluation of five publicly accessible Chinese conversational LLMs (ChatGPT-5.2, Gemini 3 Pro, DeepSeek-V3.1, Doubao, and Kimi K2). Thirty standardized parent-facing CC questions were developed by senior ophthalmologists and mapped to five domains, with specific incorporation of scenarios requiring translation of lens developmental pathology and genetic counseling knowledge. Two researchers independently performed standardized zero-shot querying and response recording under identical conditions. Output efficiency and textual structure were extracted. Two blinded ophthalmologists rated each response on a 5-point Likert scale across Accuracy, Logic, Coherence, Safety, and Content Accessibility; inter-rater agreement was assessed using quadratic weighted Cohen’s kappa. Group differences were tested using ANOVA or Kruskal–Wallis H tests with Bonferroni-corrected pairwise comparisons. Results Significant between-model differences were observed in output efficiency and text characteristics (all P < 0.001). ChatGPT-5.2 was fastest (17.94 ± 5.11), whereas DeepSeek-V3.1 and Kimi K2 were slowest (41.46 ± 3.22 and 40.02 ± 4.67). DeepSeek-V3.1 generated the longest responses (1,456.93 ± 224.99 words) and Kimi K2 the shortest (640.83 ± 252.95). ChatGPT-5.2 showed the strongest tendency toward structured/tabular output [2.00 (1.00, 2.00)] followed by Gemini 3 Pro [1.00 (1.00, 1.25)], while the other models rarely produced tables. Quadratic weighted Cohen’s kappa indicated good inter-rater reliability (0.686–0.767). Content quality differed significantly across models (Accuracy H = 41.15, Logic H = 32.95, Content accessibility H = 41.33; all P < 0.001). ChatGPT-5.2 and Gemini 3 Pro achieved higher overall profiles and did not differ significantly from each other, whereas Kimi K2 scored lower on multiple dimensions. Conclusion LLM performance in translating lens developmental pathology and genetics for CC parent counseling is model-dependent. Longer outputs did not necessarily translate into higher quality; structured presentation was more closely associated with better safety and accessibility. These findings provide quantitative benchmarks for safer, parent-centered deployment of LLMs in pediatric ophthalmology education and support more reliable translation of complex disease-related knowledge into actionable parent guidance.
Ähnliche Arbeiten
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.241 Zit.
A global reference for human genetic variation
2015 · 19.579 Zit.
The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data
2012 · 18.149 Zit.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
2010 · 15.357 Zit.
A method and server for predicting damaging missense mutations
2010 · 13.473 Zit.