User-defined trade-offs in LLM benchmarking : balancing accuracy, scale, and sustainability

Gjorgjevikj, Ana; Nikolikj, Ana; Koroušić-Seljak, Barbara; Eftimov, Tome

Izpis gradiva
A+ | A- | | SLO | ENG

Naslov:	User-defined trade-offs in LLM benchmarking : balancing accuracy, scale, and sustainability
Avtorji:	ID Gjorgjevikj, Ana, Institut "Jožef Stefan" (Avtor) ID Nikolikj, Ana, Institut "Jožef Stefan" (Avtor) ID Koroušić-Seljak, Barbara, Institut "Jožef Stefan" (Avtor) ID Eftimov, Tome, Institut "Jožef Stefan" (Avtor)
Datoteke:	URL - Izvorni URL, za dostop obiščite https://www.sciencedirect.com/science/article/pii/S0950705125014443 PDF - Predstavitvena datoteka, prenos (33,40 MB) MD5: 1A7A62F9E1CE24D517D98A3072D7E171
Jezik:	Angleški jezik
Tipologija:	1.01 - Izvirni znanstveni članek
Organizacija:	IJS - Institut Jožef Stefan
Povzetek:	This paper presents xLLMBench, a transparent, decision-centric benchmarking framework that empowers decision-makers to rank large language models (LLMs) based on their preferences across diverse, potentially conflicting performance and non-performance criteria, e.g., domain accuracy, model size, energy consumption, CO emissions. Existing LLM benchmarking methods often rely on individual performance criteria (metrics) or human feedback, so methods systematically combining multiple criteria into a single interpretable ranking lack. Methods considering human preferences typically rely on direct human feedback to determine rankings, which can be resource-intensive and not fully aligned with application-specific requirements. Motivated by current limitations of LLM benchmarking, xLLMBench leverages multi-criteria decision-making methods to provide decision-makers with the flexibility to tailor benchmarking processes to their requirements. It focuses on the final step of the benchmarking process (robust analysis of benchmarking results) which in LLMs’ case often involves their ranking. The framework assumes that the selection of datasets, metrics, and LLMs involved in the experiment is conducted following established best practices. We demonstrate xLLMBench’s usefulness in two scenarios: combining LLM results for one metric across different datasets and combining results for multiple metrics within one dataset. Our results show that while some LLMs maintain stable rankings, others exhibit significant changes when correlated datasets are removed, when the focus shifts to contamination-free datasets or fairness metrics. This highlights that LLMs have distinct strengths/weaknesses, going beyond overall performance. Our sensitivity analysis reveals robust rankings, while the diverse visualizations enhance transparency. xLLMBench can be used with existing platforms to support transparent, reproducible, and contextually-meaningful LLM benchmarking.
Ključne besede:	large language models, benchmarking, multi-criteria decision-making
Status publikacije:	Objavljeno
Verzija publikacije:	Objavljena publikacija
Poslano v recenzijo:	27.05.2025
Datum sprejetja članka:	01.09.2025
Datum objave:	10.09.2025
Založnik:	Elsevier
Leto izida:	2025
Št. strani:	str. 1-30
Številčenje:	Vol. 330, pt. A, [article no.] 114405
Izvor:	Nizozemska
PID:	20.500.12556/DiRROS-23765
UDK:	004.8
ISSN pri članku:	1872-7409
DOI:	10.1016/j.knosys.2025.114405
COBISS.SI-ID:	251254531
Avtorske pravice:	© 2025 The Author(s).
Opomba:	Nasl. z nasl. zaslona; Soavtorji: Ana Nikjolikj, Barbara Koroušić Seljak, Tome Eftimov; Opis vira z dne 1. 9. 2025;
Datum objave v DiRROS:	01.10.2025
Število ogledov:	598
Število prenosov:	258
Metapodatki:
:	Kopiraj citat

Objavi na:

Postavite miškin kazalec na naslov za izpis povzetka. Klik na naslov izpiše podrobnosti ali sproži prenos.

Gradivo je financirano iz projekta

Financer:	ARIS - Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost Republike Slovenije
Številka projekta:	P2-0098
Naslov:	Računalniške strukture in sistemi

Financer:	ARIS - Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost Republike Slovenije
Številka projekta:	GC-0001
Naslov:	Umetna inteligenca za znanost

Financer:	ARIS - Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost Republike Slovenije
Program financ.:	Young Researchers Grant
Številka projekta:	PR-12897

Financer:	EC - European Commission
Program financ.:	HE
Številka projekta:	101211695
Naslov:	AutoLLMSelect: Framework for Robust and Explainable Automated Large Language Model Selection
Akronim:	AutoLLMSelect

Financer:	EC - European Commission
Program financ.:	HE
Številka projekta:	101187010
Naslov:	Leveraging Benchmarking Data for Automated Machine Learning and Optimization
Akronim:	AutoLearn-SI

Licence

Licenca:	CC BY 4.0, Creative Commons Priznanje avtorstva 4.0 Mednarodna

Povezava:	http://creativecommons.org/licenses/by/4.0/deed.sl
Opis:	To je standardna licenca Creative Commons, ki daje uporabnikom največ možnosti za nadaljnjo uporabo dela, pri čemer morajo navesti avtorja.
Začetek licenciranja:	10.09.2025
Vezano na:	VoR

Sekundarni jezik

Jezik:	Slovenski jezik
Naslov:	User-defined trade-offs in LLM benchmarking: balancing accuracy, scale, and sustainability
Ključne besede:	veliki jezikovni modeli, večkriterijsko odločanje

Nazaj

Izpis gradiva A+ | A- | | SLO | ENG

Gradivo je financirano iz projekta

Licence

Sekundarni jezik

Izpis gradiva
A+ | A- | | SLO | ENG