Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A systematic approach to quality scoring of AI-generated legal texts
0
Zitationen
4
Autoren
2026
Jahr
Abstract
The use of generative artificial intelligence (GenAI) and AI-generated content is reshaping work routines across multiple professional domains. While GenAI can lead to business process efficiency in legal routine tasks, it remains unclear whether similar efficiency can be achieved for higher legal tasks, such as analyzing laws. Hence, this study evaluates the legal analysis capabilities of twelve large language models (LLMs) across three use cases. For each use case, the LLMs were required to answer six judicial questions based on a provided law. The 216 generated answers were then evaluated against a legal reference answer using LLM-as-a-Judge across all twelve models, resulting in 2,592 answers. In addition, two authors independently assessed the same 216 AI-generated answers as Human-as-a-Judge, resulting in 432 answers. We evaluated both the accuracy of the LLM results and the agreement between the LLM-judges and the human judges. Our results show that LLMs are able to generate structured legal content that addresses the six questions. Furthermore, GPT-5-High achieved the highest average score from the LLM-judges in four questions, whereas Grok 4 Fast had the strongest performance in law summarization and GLM-4.6 in organizational effect. However, Qwen3-Max achieved the highest overall score from the human judges. The more critical assessments by human judges are also reflected in the Cohen’s Kappa results, which indicate disagreement between LLM- and human judges. This underlines the need for responsible AI practices in legal contexts. We show, that the reliability of GenAI-based legal evaluation varies and does not consistently align with human judgment.