Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking LLM Agent Efficiency in Production Systems: An Observational Prospective Methodology
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Existing large language model (LLM) benchmarks measure model capability on synthetic tasks, but none address the operational efficiency of multi-agent LLM systems executing real production work over sustained sessions. This paper introduces an observational prospective methodology for benchmarking production LLM agent efficiency and applies it to a complete, instrumented production session (2026-04-03, 4.1 hours, Gallora ecosystem). We report the first end-to-end token accounting of a multi-agent session: 64,853,375 effective tokens processed at a 94.2% cache hit rate, producing 19 artifacts at a cost of $36.74 USD ($1.93 per artifact). We propose a standardized suite of six operational metrics — Cache Hit Rate (CHR), Output Density (OD), Agent Cost Multiplier (ACM), Cost Per Artifact (CPA), Tool Execution Ratio (TER), and Turns Per Hour (TPH) — as a reproducible benchmark framework for production agentic systems. Key finding: in production multi-agent systems, cost is dominated by context complexity (93.7% cache reads), not task complexity — a result with significant architectural implications for system design and cost governance.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.561 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.452 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.948 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.