OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.05.2026, 12:03

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination

2026·0 Zitationen·BMJ OpenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

OBJECTIVES: To assess and compare the performance of four contemporary frontier large language models (LLMs)-GPT-5.2 (OpenAI), Gemini 3 Pro (Google DeepMind), Claude Sonnet 4.6 (Anthropic) and Grok 4.1 (xAI)-on a simulated Fellowship of The Royal College of Surgeons Urology (FRCS(Urol)) Part A examination, evaluating overall accuracy, subspecialty-level performance, output consistency and response time. DESIGN: Controlled comparative evaluation study using a standardised simulation framework with repeated independent testing runs per model. SETTING: All models were accessed via their respective consumer-facing interfaces. No clinical setting or patient data were involved. Testing was conducted under uniform conditions with conversational memory disabled across all sessions. PARTICIPANTS: Four large language models were evaluated. No human participants were involved. Models were selected to represent the current frontier of publicly accessible LLMs from four distinct commercial developers. No models were excluded following selection. INTERVENTIONS: Each model was presented with 240 FRCS (Urol) Part A single best answer questions, mapped to the Joint Committee on Intercollegiate Examinations' Urology Syllabus Blueprint (2023). A standardised prompt was delivered at the start of each session. Each model completed five independent examination runs. No fine-tuning or system-level modification was applied to any model. PRIMARY AND SECONDARY OUTCOME MEASURES: The primary outcome was overall examination accuracy for each model, benchmarked against an indicative pass threshold for the FRCS (Urol) Part A examination. Secondary outcomes were performance across 18 individual urology subspecialty topics; response time reported as mean total and per-question elapsed time; and consistency of performance quantified by SD and 95% CIs derived from a sequential Monte Carlo sampling procedure. All outcomes were prospectively planned and fully measured as specified. RESULTS: Three of four models exceeded the indicative 74% pass threshold: Gemini 3 Pro (82.4%±0.9%; 95% CI 81.3 to 83.6%), Claude Sonnet 4.6 (79.3%±1.1%; 95% CI 77.9 to 80.6%) and GPT-5.2 (76.1%±2.4%; 95% CI 73.1 to 79.1%). Grok 4.1 failed (70.4%±0.6%; 95% CI 69.6 to 71.2%), with its entire CI below 74%. All models completed the assessment in under 3 min. Strong performance was observed in research methodology (90-98%) and andrology (92-98%), with the weakest results in paediatric urology (38.7-54.7%) and testicular cancer (48.2-67.3%). Substantial within-model output instability was identified across several domains, most notably GPT-5.2 in female urology (SD±22.8%) and anatomy (SD±14.2%). CONCLUSIONS: Three of four frontier LLMs achieved scores consistent with passing the FRCS (Urol) Part A examination, representing a substantial advance since ChatGPT-3.5. Aggregate accuracy alone, however, obscures important subspecialty weaknesses and output instability. LLMs should be regarded as adjunctive revision aids rather than authoritative knowledge sources and always used alongside expert-led teaching. Future work should evaluate performance on Part B and viva-style assessments.

Ähnliche Arbeiten