Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of Prompt Design and Internal Reasoning in Chatbot-Based Medical History Taking (Preprint)
0
Zitationen
6
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> A persistent discrepancy exists between patient-reported information and physician documentation. While conversational agents have been developed to collect medical histories prior to consultation, existing evaluations have largely focused on diagnostic accuracy or user satisfaction rather than the completeness and clinical usefulness of the information collected. There remains a need to assess the extent of clinically relevant information captured through chatbot-based interviews and to understand how model configurations and instructional strategies influence this coverage. </sec> <sec> <title>OBJECTIVE</title> This study aimed to evaluate the extent to which a chatbot can obtain clinically useful patient history information and to examine how prompt detail and internal reasoning influence information coverage during chatbot-based medical interviews. </sec> <sec> <title>METHODS</title> We developed a medical history-taking chatbot using the Qwen3-14B-Instruct model and evaluated four configurations in a 2×2 factorial design: Detailed/Thinking (DT), Detailed/Non-thinking (DN), Minimal/Thinking (MT), and Minimal/Non-thinking (MN). These configurations were compared against a rule-based system baseline (choice-based mode) using 66 standardized primary care clinical cases, with simulated patients interacting with the chatbot according to predefined case scripts. Information coverage (%) was assessed using a checklist inspired by Objective Structured Clinical Examination (OSCE) frameworks. Three physicians independently evaluated transcript coverage, with inter-rater agreement assessed using full agreement rates and Fleiss’ κ. Coverage percentages were compared across configurations using repeated-measures analysis of variance with post hoc testing. </sec> <sec> <title>RESULTS</title> Inter-rater agreement was substantial (Fleiss’ κ = 0.75). Across all 66 simulated cases, information coverage differed significantly among configurations (p < .001), with the detailed prompt with thinking (DT) mode achieving the highest mean coverage (72.3%), compared with moderate coverage in configurations using either thinking or detailed prompts alone (approximately 60%) and lower coverage in minimal non-thinking and rule-based configurations (approximately 51-54%). Differences were most pronounced for past medical and family history domains. Symptom-level analyses revealed substantial variability, with higher coverage for symptoms associated with well-defined diagnostic frameworks and lower coverage for multi-system presentations. </sec> <sec> <title>CONCLUSIONS</title> The combination of clinically detailed prompt instructions and internal reasoning significantly enhances the clinical usefulness of AI-driven history-taking by ensuring more comprehensive data collection. This approach allows for a more systematic and robust foundation for automated clinical documentation, facilitating better integration into healthcare workflows. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.626 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.532 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.046 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.843 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.