Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

STELLA: Safety Testing Engine for Large Language Assistants

2025·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Background Assistants incorporating large language models are increasingly applied in the context of health care, where they represent a promising means of expanding access to care. However, there is growing recognition of the risks that these chatbots may fail to respond appropriately to individuals in crisis, and may adversely affect mental health in some circumstances. Methods We developed and implemented an automated system for assessing voice or text AI assistant response to users across a range of health scenarios. This set of tools incorporates simulated users with a specified set of characteristics; scenarios in which they interact with a chatbot over multiple rounds; and designs that allow multiple cohorts to be compared. Study designs including simulated randomized trials can be generated via natural language prompts. Chatbot session transcripts are then quantified in terms of safety, efficacy, and user engagement according to prespecified rubrics and exemplars with an ensemble of judging language models, allowing specific exchanges to be flagged for manual review. To illustrate this approach, we assessed 10 safety scenarios in 11 frontier language model chatbots, including Claude Opus 4.5, ChatGPT-5.2, and Gemini 3, using 5 personas, each followed over 10 exchanges, with a subset assessed for an additional 5 personas. Results Total proportion of responses flagged for possible harmful content ranged from 3.2% (95% CI 2.0-5.1%) for GPT 5.2 to 34.0% (95% CI 30.0-38.3%) for Grok-4.1-fast-non-reasoning. Total proportion of responses flagged for failing to provide beneficial content ranged from 19.6% (95% CI 16.4-23.3%) for GPT 5.2 to 66.0% (95% CI 61.7-70.0%) for Grok-4.1-fast-reasoning. In aggregate, proportion of unsafe content increased across turns – for failure to provide beneficial content, by 0.7% per turn (95% CI 0.3%-1.1%). Conclusion A simulation-based test harness can facilitate the rapid characterization and comparison of large language model assistant performance according to standardized rubrics. Existing frontier models vary substantially on these metrics. Simulation strategies such as this one may accelerate efforts to ensure that chatbots yield benefit rather than harm to users who seek to apply them to address mental health and well-being.

Autoren

Institutionen

Themen

Digital Mental Health InterventionsAI in Service InteractionsArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

STELLA: Safety Testing Engine for Large Language Assistants

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen