Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking large language models for congenital cataract parent counseling: safety, readability, and knowledge translation of developmental and genetic information

2026·0 Zitationen·Frontiers in Cell and Developmental BiologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background Congenital cataract (CC) is a time-critical cause of preventable childhood visual impairment. After diagnosis, parents frequently experience uncertainty and increasingly seek guidance online. The safety, readability, and counseling quality of large language models (LLMs) responses for CC remain insufficiently benchmarked, particularly for explanations involving lens development, etiology, and genetic risk. Methods We performed a cross-sectional comparative evaluation of five publicly accessible Chinese conversational LLMs (ChatGPT-5.2, Gemini 3 Pro, DeepSeek-V3.1, Doubao, and Kimi K2). Thirty standardized parent-facing CC questions were developed by senior ophthalmologists and mapped to five domains, with specific incorporation of scenarios requiring translation of lens developmental pathology and genetic counseling knowledge. Two researchers independently performed standardized zero-shot querying and response recording under identical conditions. Output efficiency and textual structure were extracted. Two blinded ophthalmologists rated each response on a 5-point Likert scale across Accuracy, Logic, Coherence, Safety, and Content Accessibility; inter-rater agreement was assessed using quadratic weighted Cohen’s kappa. Group differences were tested using ANOVA or Kruskal–Wallis H tests with Bonferroni-corrected pairwise comparisons. Results Significant between-model differences were observed in output efficiency and text characteristics (all P &lt; 0.001). ChatGPT-5.2 was fastest (17.94 ± 5.11), whereas DeepSeek-V3.1 and Kimi K2 were slowest (41.46 ± 3.22 and 40.02 ± 4.67). DeepSeek-V3.1 generated the longest responses (1,456.93 ± 224.99 words) and Kimi K2 the shortest (640.83 ± 252.95). ChatGPT-5.2 showed the strongest tendency toward structured/tabular output [2.00 (1.00, 2.00)] followed by Gemini 3 Pro [1.00 (1.00, 1.25)], while the other models rarely produced tables. Quadratic weighted Cohen’s kappa indicated good inter-rater reliability (0.686–0.767). Content quality differed significantly across models (Accuracy H = 41.15, Logic H = 32.95, Content accessibility H = 41.33; all P &lt; 0.001). ChatGPT-5.2 and Gemini 3 Pro achieved higher overall profiles and did not differ significantly from each other, whereas Kimi K2 scored lower on multiple dimensions. Conclusion LLM performance in translating lens developmental pathology and genetics for CC parent counseling is model-dependent. Longer outputs did not necessarily translate into higher quality; structured presentation was more closely associated with better safety and accessibility. These findings provide quantitative benchmarks for safer, parent-centered deployment of LLMs in pediatric ophthalmology education and support more reliable translation of complex disease-related knowledge into actionable parent guidance.

Autoren

Institutionen

Themen

Genomics and Rare DiseasesIntraocular Surgery and LensesArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Benchmarking large language models for congenital cataract parent counseling: safety, readability, and knowledge translation of developmental and genetic information

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen