Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Abstract Background To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for trial-success classification across temperature and reasoning/thinking settings and determine whether single-run reporting suffices. Materials and Methods We utilized 250 trial abstracts labeled based on primary endpoint success. We evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0-2.0 and GPT-5.2 across reasoning-effort levels (none to x-high) with an additional temperature sweep when reasoning was disabled. Each setting was run 3 times. Results Reproducibility was high for Gemini (κ = 0.942-1.000; invalid outputs 0%-1.5%) and GPT-5.2 (κ = 0.984-0.995; no invalid outputs). F1 remained stable (mean/majority vote 0.955-0.971), with marginal gains from majority voting. Conclusion For binary biomedical classification with tightly constrained outputs, both models were reproducible across decoding and reasoning settings, suggesting single runs are often sufficient, with minimal replication as a practical stability check.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.786 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.700 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.270 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.908 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.