Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Behavioural Evaluation Framework for AI Judgement Systems
0
Zitationen
1
Autoren
2026
Jahr
Abstract
Beyond the Average Research Series – Working Paper Description This working paper introduces a conceptual framework for evaluating the behavioural reliability of AI judgement systems. The framework emerged from the Agents at Work research series (Hull, 2025–2026), which examined how large language models interpret age-coded language in recruitment text and how stable those judgements remain when evaluation tasks are repeated. Abstract Large language models are increasingly used to perform evaluative or judgement-based tasks, including classification, moderation, and analytical assessment. In such contexts, reliability cannot be assessed solely through single outputs, as language models may produce varying interpretations across repeated executions of the same task. This paper proposes a behavioural evaluation framework for examining how AI judgement systems behave under repeated evaluation. The framework focuses on three complementary analytical perspectives: repeated execution of the same evaluative task, observation of internal signals such as confidence or agreement indicators, and independent system comparison across multiple AI models. Together, these perspectives allow researchers to observe patterns of behavioural stability, convergence, drift or fragmentation in AI judgement processes. Rather than focusing solely on output accuracy, the framework emphasises behavioural observation as a means of understanding how AI systems interpret complex text and how consistently those interpretations are maintained. The proposed structure is intended to support more systematic analysis of reliability in AI judgement systems and to provide a conceptual foundation for future empirical evaluation studies. Note This paper is released as a working paper to present the conceptual framework. Future work will extend the framework through larger-scale empirical experiments and behavioural stress testing of AI judgement systems.
Ähnliche Arbeiten
The global landscape of AI ethics guidelines
2019 · 4.782 Zit.
The Limitations of Deep Learning in Adversarial Settings
2016 · 3.893 Zit.
Trust in Automation: Designing for Appropriate Reliance
2004 · 3.541 Zit.
Fairness through awareness
2012 · 3.311 Zit.
AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations
2018 · 3.255 Zit.