OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.04.2026, 12:49

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Detection of LLM Deceptive Behaviour Triggered by the Poisonous Context Injection: The Problem Demonstration

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

2

Autoren

2025

Jahr

Abstract

This paper presents a focused demonstration of deceptive behaviour in Large Language Models (LLMs) arising under poisonous context injection. The case study is constructed around a Japanese haiku, selected for its inherent ambiguity, which serves as a probe for LLM alignment with the humans’ real-world model. When presented with a poisonous context, ChatGPT generated translation, interpretation, and literary criticism that were not only incorrect but also internally inconsistent. This experiment highlights a fundamental risk: LLMs can produce outputs that are both linguistically convincing and semantically deceptive. The novelty of this work is in framing LLM deception as a measurable phenomenon and in articulating the feasibility of automated detection through cross-verification with independent models. The contribution of this work establishes the problem space by demonstrating how subtle poisoning can systematically induce deceptive generations. By formalising the problem and identifying a methodological direction, this study positions itself as an initial step in an ongoing research program on trustworthy and self-aware AI. Proof of the concept experiments demonstrated that a committee of five major LLMs estimates the trustworthiness of the poisonous context haiku interpretations at 0.57±0.33 range, while non-poisoned haiku interpretations are estimated at the 0.86±0.15 trustworthiness range.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen