OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.04.2026, 14:01

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

2025·0 Zitationen·ACM Transactions on Software Engineering and MethodologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs’ capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, HumanEval + and ClassEval , as well as 8 code generation LLMs, we show that TaskEval is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks’ characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks’ difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, TaskEval can assist researchers and practitioners in fostering better assessments of LLMs. The tasks’ characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.

Ähnliche Arbeiten