Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
STELLA: Safety Testing Engine for Large Language Assistants
0
Zitationen
3
Autoren
2025
Jahr
Abstract
Abstract Background Assistants incorporating large language models are increasingly applied in the context of health care, where they represent a promising means of expanding access to care. However, there is growing recognition of the risks that these chatbots may fail to respond appropriately to individuals in crisis, and may adversely affect mental health in some circumstances. Methods We developed and implemented an automated system for assessing voice or text AI assistant response to users across a range of health scenarios. This set of tools incorporates simulated users with a specified set of characteristics; scenarios in which they interact with a chatbot over multiple rounds; and designs that allow multiple cohorts to be compared. Study designs including simulated randomized trials can be generated via natural language prompts. Chatbot session transcripts are then quantified in terms of safety, efficacy, and user engagement according to prespecified rubrics and exemplars with an ensemble of judging language models, allowing specific exchanges to be flagged for manual review. To illustrate this approach, we assessed 10 safety scenarios in 11 frontier language model chatbots, including Claude Opus 4.5, ChatGPT-5.2, and Gemini 3, using 5 personas, each followed over 10 exchanges, with a subset assessed for an additional 5 personas. Results Total proportion of responses flagged for possible harmful content ranged from 3.2% (95% CI 2.0-5.1%) for GPT 5.2 to 34.0% (95% CI 30.0-38.3%) for Grok-4.1-fast-non-reasoning. Total proportion of responses flagged for failing to provide beneficial content ranged from 19.6% (95% CI 16.4-23.3%) for GPT 5.2 to 66.0% (95% CI 61.7-70.0%) for Grok-4.1-fast-reasoning. In aggregate, proportion of unsafe content increased across turns – for failure to provide beneficial content, by 0.7% per turn (95% CI 0.3%-1.1%). Conclusion A simulation-based test harness can facilitate the rapid characterization and comparison of large language model assistant performance according to standardized rubrics. Existing frontier models vary substantially on these metrics. Simulation strategies such as this one may accelerate efforts to ensure that chatbots yield benefit rather than harm to users who seek to apply them to address mental health and well-being.
Ähnliche Arbeiten
Amazon's Mechanical Turk
2011 · 10.034 Zit.
The Transtheoretical Model of Health Behavior Change
1997 · 7.707 Zit.
COVID-19 and mental health: A review of the existing literature
2020 · 3.710 Zit.
Cognitive Therapy and the Emotional Disorders
1977 · 2.931 Zit.
Mental health problems and social media exposure during COVID-19 outbreak
2020 · 2.793 Zit.