Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines
12
Zitationen
5
Autoren
2025
Jahr
Abstract
OBJECTIVES: Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results. This exploratory study aimed to determine how accurate a large language model generative artificial intelligence system (AI-LLM) was for determining reporting guideline compliance in a sample of sports medicine clinical trial reports. DESIGN: This study was an exploratory retrospective data analysis. OpenAI GPT-4 and Meta Llama 2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of sports medicine and exercise science clinical trial reports. SETTING: Academic research institution. PARTICIPANTS: The study sample included 113 published sports medicine and exercise science clinical trial papers. For each paper, the GPT-4 Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4 Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama 2 model was fine-tuned using the data from the GPT-4 Turbo analysis of the TRAIN dataset. PRIMARY AND SECONDARY OUTCOME MEASURES: The primary outcome was the F1-score, a measure of model performance on the TEST dataset. The secondary outcome was the model's classification accuracy (%). RESULTS: Across all questions about the article text, the GPT-4 Turbo AI-LLM demonstrated acceptable performance (F1-score=0.89, accuracy (95% CI) = 90% (85% to 94%)). Accuracy for all reporting guidelines was >80%. The Llama 2 model accuracy was initially poor (F1-score=0.63, accuracy (95% CI) = 64% (57% to 71%)) and improved with fine-tuning (F1-score=0.84, accuracy (95% CI) = 83% (77% to 88%)). The GPT-4 Vision model accurately identified all participant flow diagrams (accuracy (95% CI) = 100% (89% to 100%)) but was less accurate at identifying when details were missing from the flow diagram (accuracy (95% CI) = 57% (39% to 73%)). CONCLUSIONS: Both the GPT-4 and fine-tuned Llama 2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.697 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.602 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.127 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.872 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.