OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.05.2026, 06:33

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

2025·2 Zitationen·Digital HealthOpen Access
Volltext beim Verlag öffnen

2

Zitationen

20

Autoren

2025

Jahr

Abstract

Importance: Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings. Objective: To evaluate generative AI models in enhancing nephrology patient communication and education. Design: Generative AI in Nephrology. Setting: In a study conducted from December 8-12, 2023, and October 21-23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency. Interventions for clinical trials or Exposures for observational studies: None. Main Outcomes and Measures: tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions. Results: GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate. Conclusions and Relevance: GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

Ähnliche Arbeiten