Digital repository of Slovenian research organisations

Show document
A+ | A- | Help | SLO | ENG

Title:User-defined trade-offs in LLM benchmarking : balancing accuracy, scale, and sustainability
Authors:ID Gjorgjevikj, Ana, Institut "Jožef Stefan" (Author)
ID Nikolikj, Ana, Institut "Jožef Stefan" (Author)
ID Koroušić-Seljak, Barbara, Institut "Jožef Stefan" (Author)
ID Eftimov, Tome, Institut "Jožef Stefan" (Author)
Files:URL URL - Source URL, visit https://www.sciencedirect.com/science/article/pii/S0950705125014443
 
.pdf PDF - Presentation file, download (33,40 MB)
MD5: 1A7A62F9E1CE24D517D98A3072D7E171
 
Language:English
Typology:1.01 - Original Scientific Article
Organization:Logo IJS - Jožef Stefan Institute
Abstract:This paper presents xLLMBench, a transparent, decision-centric benchmarking framework that empowers decision-makers to rank large language models (LLMs) based on their preferences across diverse, potentially conflicting performance and non-performance criteria, e.g., domain accuracy, model size, energy consumption, CO emissions. Existing LLM benchmarking methods often rely on individual performance criteria (metrics) or human feedback, so methods systematically combining multiple criteria into a single interpretable ranking lack. Methods considering human preferences typically rely on direct human feedback to determine rankings, which can be resource-intensive and not fully aligned with application-specific requirements. Motivated by current limitations of LLM benchmarking, xLLMBench leverages multi-criteria decision-making methods to provide decision-makers with the flexibility to tailor benchmarking processes to their requirements. It focuses on the final step of the benchmarking process (robust analysis of benchmarking results) which in LLMs’ case often involves their ranking. The framework assumes that the selection of datasets, metrics, and LLMs involved in the experiment is conducted following established best practices. We demonstrate xLLMBench’s usefulness in two scenarios: combining LLM results for one metric across different datasets and combining results for multiple metrics within one dataset. Our results show that while some LLMs maintain stable rankings, others exhibit significant changes when correlated datasets are removed, when the focus shifts to contamination-free datasets or fairness metrics. This highlights that LLMs have distinct strengths/weaknesses, going beyond overall performance. Our sensitivity analysis reveals robust rankings, while the diverse visualizations enhance transparency. xLLMBench can be used with existing platforms to support transparent, reproducible, and contextually-meaningful LLM benchmarking.
Keywords:large language models, benchmarking, multi-criteria decision-making
Publication status:Published
Publication version:Version of Record
Submitted for review:27.05.2025
Article acceptance date:01.09.2025
Publication date:10.09.2025
Publisher:Elsevier
Year of publishing:2025
Number of pages:str. 1-30
Numbering:Vol. 330, pt. A, [article no.] 114405
Source:Nizozemska
PID:20.500.12556/DiRROS-23765 New window
UDC:004.8
ISSN on article:1872-7409
DOI:10.1016/j.knosys.2025.114405 New window
COBISS.SI-ID:251254531 New window
Copyright:© 2025 The Author(s).
Note:Nasl. z nasl. zaslona; Soavtorji: Ana Nikjolikj, Barbara Koroušić Seljak, Tome Eftimov; Opis vira z dne 1. 9. 2025;
Publication date in DiRROS:01.10.2025
Views:328
Downloads:143
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Document is financed by a project

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:P2-0098
Name:Računalniške strukture in sistemi

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:GC-0001
Name:Umetna inteligenca za znanost

Funder:ARIS - Slovenian Research and Innovation Agency
Funding programme:Young Researchers Grant
Project number:PR-12897

Funder:EC - European Commission
Funding programme:HE
Project number:101211695
Name:AutoLLMSelect: Framework for Robust and Explainable Automated Large Language Model Selection
Acronym:AutoLLMSelect

Funder:EC - European Commission
Funding programme:HE
Project number:101187010
Name:Leveraging Benchmarking Data for Automated Machine Learning and Optimization
Acronym:AutoLearn-SI

Licences

License:CC BY 4.0, Creative Commons Attribution 4.0 International
Link:http://creativecommons.org/licenses/by/4.0/
Description:This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Licensing start date:10.09.2025
Applies to:VoR

Secondary language

Language:Slovenian
Title:User-defined trade-offs in LLM benchmarking: balancing accuracy, scale, and sustainability
Keywords:veliki jezikovni modeli, večkriterijsko odločanje


Back