Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Vision-Capable LLMs in Microsurgery: A Blinded Comparison of Two AI Models with Expert Microsurgeons in the Appraisal of 200 Experimental Anastomoses
0
Zitationen
16
Autoren
2026
Jahr
Abstract
Background/Objectives: Objective end-product assessment of microsurgical anastomoses is intensive and partly subjective. Vision-capable large language models (LLMs) may enable standardized image-based scoring, but their agreement with expert assessment remains uncertain. Methods: We studied 200 end-to-end femoral artery anastomoses, performed on chicken legs by novice, intermediate, and experienced microsurgeons. Images were scored independently by two blinded expert panels; disagreements were adjudicated by a third senior reviewer to establish expert consensus. Two LLMs, ChatGPT 5.2 Thinking Extended and Gemini 3.1 Pro, were evaluated using the exact same prompt and rubric. Each image was analyzed three times per model. Final scores were aggregated by median for numeric items and majority vote for categorical items. The primary endpoint was exact-match agreement with expert consensus. Agreement within ±1 was also assessed for numeric items. Agreement was measured using simple percentage agreement, Light’s kappa, and Krippendorff’s alpha; Bland–Altman analysis was used for numeric count items. Results: LLM 1 achieved a higher overall exact-match agreement than LLM 2 (0.659 vs. 0.539). Both models performed better on categorical than numeric items (0.713 vs. 0.610 and 0.651 vs. 0.445, respectively). LLM 1 showed the greatest advantages for gaps, knots, oblique stitches, and wide bites. Krippendorff’s alpha was positive for most endpoints with LLM 1, whereas LLM 2 showed negative values throughout. Allowing a ±1 tolerance for numeric items greatly improved agreement, suggesting only minor counting discrepancies, from 0.610 to 0.900 for LLM 1 and from 0.445 to 0.826 for LLM 2. Conclusions: Under a constrained scoring workflow, LLMs partially approximated intraluminal microsurgical end-product scoring. LLM 1 outperformed LLM 2, but agreement remained insufficient to replace the expert assessment entirely. These models can be assistive tools within a human-in-the-loop framework.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.