OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.05.2026, 07:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Behavioural Evaluation Framework for AI Judgement Systems

2026·0 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2026

Jahr

Abstract

Beyond the Average Research Series – Working Paper Description This working paper introduces a conceptual framework for evaluating the behavioural reliability of AI judgement systems. The framework emerged from the Agents at Work research series (Hull, 2025–2026), which examined how large language models interpret age-coded language in recruitment text and how stable those judgements remain when evaluation tasks are repeated. Abstract Large language models are increasingly used to perform evaluative or judgement-based tasks, including classification, moderation, and analytical assessment. In such contexts, reliability cannot be assessed solely through single outputs, as language models may produce varying interpretations across repeated executions of the same task. This paper proposes a behavioural evaluation framework for examining how AI judgement systems behave under repeated evaluation. The framework focuses on three complementary analytical perspectives: repeated execution of the same evaluative task, observation of internal signals such as confidence or agreement indicators, and independent system comparison across multiple AI models. Together, these perspectives allow researchers to observe patterns of behavioural stability, convergence, drift or fragmentation in AI judgement processes. Rather than focusing solely on output accuracy, the framework emphasises behavioural observation as a means of understanding how AI systems interpret complex text and how consistently those interpretations are maintained. The proposed structure is intended to support more systematic analysis of reliability in AI judgement systems and to provide a conceptual foundation for future empirical evaluation studies. Note This paper is released as a working paper to present the conceptual framework. Future work will extend the framework through larger-scale empirical experiments and behavioural stress testing of AI judgement systems.

Ähnliche Arbeiten

Autoren

Themen

Ethics and Social Impacts of AIExplainable Artificial Intelligence (XAI)Artificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen