OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.04.2026, 10:49

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures

2026·0 Zitationen·npj Digital MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2026

Jahr

Abstract

Exploring large language models (LLMs) performance in the specific medical domain can help understand their generalizability in real-world application. We assessed the predictive and decision-support value of two state-of-the-art LLMs in predicting bone cement leakage (BCL) and new vertebral fractures (NVF) after percutaneous kyphoplasty (PKP) and to compare them with those of traditional machine learning (TML) and spine surgeon. This study utilized combined retrospective and prospective data at a single tertiary hospital. Two LLMs (GPT-5 and DeepSeek R1) with zero- and few-shot strategy, five TML models, and two spine surgeons with/without exposure to LLM responses, were asked to predict complications based on demographic, perioperative baseline, and radiographic data. We also tested LLMs' ability to predict complication subtype. For BCL prediction, both LLMs demonstrated acceptable performance (F1-score, 0.857-0.871; MCC, 0.164-0.332) under zero-shot conditions, comparable to TML models (F1-score, 0.758-0.867; MCC, 0.265-0.416), and slightly superior to surgeons alone (F1-score, 0.675-0.684; MCC, 0.074-0.185). Few-shot prompting enhanced specificity but yielded uncertain overall gains. For NVF prediction, the zero-shot LLM performance was poor (F1-score, 0.309; MCC, 0.044) but improved with few-shot learning. The RBF-SVM model showed the best performance for NVF prediction (F1-score, 0.536; MCC, 0.414). LLM explanations enhanced surgeon performance in BCL prediction but not in NVF. LLMs showed poor prediction of complication subtypes. The findings suggest that current LLMs hold diverse predictive performances for different complications after PKP, they are still immature for real clinical applicability and need further improvement.

Ähnliche Arbeiten