Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing

2026·0 Zitationen·Journal of the American Medical Informatics AssociationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for trial-success classification across temperature and reasoning/thinking settings and determine whether single-run reporting suffices. Materials and Methods We utilized 250 trial abstracts labeled based on primary endpoint success. We evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0-2.0 and GPT-5.2 across reasoning-effort levels (none to x-high) with an additional temperature sweep when reasoning was disabled. Each setting was run 3 times. Results Reproducibility was high for Gemini (κ = 0.942-1.000; invalid outputs 0%-1.5%) and GPT-5.2 (κ = 0.984-0.995; no invalid outputs). F1 remained stable (mean/majority vote 0.955-0.971), with marginal gains from majority voting. Conclusion For binary biomedical classification with tightly constrained outputs, both models were reproducible across decoding and reasoning settings, suggesting single runs are often sufficient, with minimal replication as a practical stability check.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationBiomedical Text Mining and OntologiesGenomics and Rare Diseases

Volltext beim Verlag öffnen

Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen