Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Training large language models on narrow tasks can lead to broad misalignment
9
Zitationen
9
Autoren
2026
Jahr
Abstract
. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.692 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.980 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.794 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.499 Zit.
Xception: Deep Learning with Depthwise Separable Convolutions
2017 · 18.701 Zit.