Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

When Judgement Does Not Stay the Same

2026·0 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Beyond the Average Research Series – Working Paper Description This working paper examines judgement stability in AI systems under repeated evaluation. It builds on the Behavioural Evaluation Framework (Hull, 2026), extending the conceptual approach through empirical observation. The analysis draws on the Phase 4 behavioural evaluation study within the Agents at Work research series (Hull, 2025–2026), which examined how large language models interpret age-coded language in recruitment text and how those judgements behave when the same evaluative task is repeated. The paper focuses on how classification outcomes vary under identical conditions, with particular attention to the structure and distribution of variation across repeated executions. Abstract Large language models are increasingly used to perform evaluative or judgement-based tasks, including classification, moderation, and analytical assessment. While existing approaches to evaluation often focus on individual outputs, such observations provide limited insight into how systems behave when the same task is repeated. Building on the Behavioural Evaluation Framework, this paper examines judgement stability under repeated execution. Using a series of repeated evaluations of recruitment text, the analysis explores how classification outcomes vary under identical conditions. The findings indicate that variation is not random but concentrated at decision boundaries, particularly between adjacent categories such as “Potentially Biased” and “Unclear”. In these cases, the system often identifies similar cues and produces comparable reasoning, while the final classification varies. These observations suggest that instability in AI judgement reflects structured sensitivity to interpretation rather than isolated error. The paper argues that reliability in AI judgement systems is better understood through patterns of behaviour across repeated evaluations than through the inspection of individual outputs. Note This paper is released as a working paper to present empirical findings on judgement stability within the Behavioural Evaluation Framework. It extends earlier conceptual work by examining how variation emerges under repeated execution. Future work will explore additional behavioural properties of AI judgement systems, including confidence behaviour, explanation stability, and sensitivity to input variation, as part of the ongoing Agents at Work research series.

Autoren

Imogen Hull

Institutionen

Quality and Reliability (Greece)(GR)

Themen

Ethics and Social Impacts of AIExplainable Artificial Intelligence (XAI)Artificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

When Judgement Does Not Stay the Same

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen