Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Detection of LLM Deceptive Behaviour Triggered by the Poisonous Context Injection: The Problem Demonstration
0
Zitationen
2
Autoren
2025
Jahr
Abstract
This paper presents a focused demonstration of deceptive behaviour in Large Language Models (LLMs) arising under poisonous context injection. The case study is constructed around a Japanese haiku, selected for its inherent ambiguity, which serves as a probe for LLM alignment with the humans’ real-world model. When presented with a poisonous context, ChatGPT generated translation, interpretation, and literary criticism that were not only incorrect but also internally inconsistent. This experiment highlights a fundamental risk: LLMs can produce outputs that are both linguistically convincing and semantically deceptive. The novelty of this work is in framing LLM deception as a measurable phenomenon and in articulating the feasibility of automated detection through cross-verification with independent models. The contribution of this work establishes the problem space by demonstrating how subtle poisoning can systematically induce deceptive generations. By formalising the problem and identifying a methodological direction, this study positions itself as an initial step in an ongoing research program on trustworthy and self-aware AI. Proof of the concept experiments demonstrated that a committee of five major LLMs estimates the trustworthiness of the poisonous context haiku interpretations at 0.57±0.33 range, while non-poisoned haiku interpretations are estimated at the 0.86±0.15 trustworthiness range.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.436 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.311 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.753 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.523 Zit.