OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 23.04.2026, 00:58

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking Knowledge and Capability of Large Language Models in Building Science Domain

2025·0 Zitationen·Energy UseOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

<p>Large language models (LLMs) are increasingly adopted across scientific and engineering fields. However, applying general-purpose LLMs to specialized engineering domains imposes stringent requirements for structured knowledge, rigorous reasoning, and technical precision. Thus, the suitability of current general-purpose LLMs for practical applications in engineering domains remains questionable. To understand the mastery level of LLMs in the building science domain as one broad but specific engineering domain, in this paper, we perform a comprehensive benchmark analysis (with benchmark dataset of 1,487 questions) to evaluate abilities of 15 state-of-the-art (SOTA) LLMs across 12 core subject topics in the building science domain. To enable scalable and robust evaluation, we propose and validate an AI-Judger for assessment across five dimensions of abilities, i.e., knowledge and concept, logic and consistency, clarity of expression, and reflection and exploratory. Overall, SOTA general-purposes LLMs achieve only ~50% accuracy on average in answering different types of questions. The capabilities of LLMs decrease progressively from linguistic expression and factual knowledge to logical reasoning, then reflection and exploratory thinking. For different tasks, LLMs exhibit notably low accuracy on calculation (~13%), short-answer (~23%), and cloze tasks (~30%), contrast to stronger performance on single-choice (74%) and multiple-choice questions (63%). Finally, pronounced variance of LLM performance exists across topics, with relatively low accuracy on physics fundamental and HVAC&R-related questions (median of 20%-40%) compared to ~80% for building standards and codes. These identified gaps highlight the limitations of general-purpose LLMs in engineering contexts, clearly pointing to the necessity of developing domain-specific LLMs tailored for engineering applications.</p>

Ähnliche Arbeiten

Autoren

Themen

Topic ModelingMachine Learning in Materials ScienceArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen