Mono- and cross-lingual evaluation of representation language models on less-resourced languages

Ulčar, Matej; Žagar, Aleš; Armendariz, Carlos S.; Repar, Andraž; Pollak, Senja; Purver, Matthew; Robnik Šikonja, Marko

Show document
A+ | A- | | SLO | ENG

Title:	Mono- and cross-lingual evaluation of representation language models on less-resourced languages
Authors:	ID Ulčar, Matej (Author) ID Žagar, Aleš (Author) ID Armendariz, Carlos S. (Author) ID Repar, Andraž, Institut "Jožef Stefan" (Author) ID Pollak, Senja, Institut "Jožef Stefan" (Author) ID Purver, Matthew, Institut "Jožef Stefan" (Author) ID Robnik Šikonja, Marko (Author)
Files:	URL - Source URL, visit https://www.sciencedirect.com/science/article/pii/S0885230825000774?via%3Dihub PDF - Presentation file, download (2,28 MB) MD5: B4BE1F405393A7D919CA1A369BFD46BC
Language:	English
Typology:	1.01 - Original Scientific Article
Organization:	IJS - Jožef Stefan Institute
Abstract:	The current dominance of large language models in natural language processing is based on their contextual awareness. For text classification, text representation models, such as ELMo, BERT, and BERT derivatives, are typically fine-tuned for a specific problem. Most existing work focuses on English; in contrast, we present a large-scale multilingual empirical comparison of several monolingual and multilingual ELMo and BERT models using 14 classification tasks in nine languages. The results show, that the choice of best model largely depends on the task and language used, especially in a cross-lingual setting. In monolingual settings, monolingual BERT models tend to perform the best among BERT models. Among ELMo models, the ones trained on large corpora dominate. Cross-lingual knowledge transfer is feasible on most tasks already in a zero-shot setting without losing much performance.
Keywords:	monolingual models, multilingual models, ELMo, BERT, corpus, cross-lingual datasets
Publication status:	Published
Publication version:	Version of Record
Submitted for review:	03.09.2023
Article acceptance date:	03.06.2025
Publication date:	27.06.2025
Publisher:	Elsevier
Year of publishing:	2026
Number of pages:	1-29 str.
Numbering:	Vol. 95, [article no.] 101852
Source:	Nizozemska
PID:	20.500.12556/DiRROS-22874
UDC:	004.8
ISSN on article:	1095-8363
DOI:	10.1016/j.csl.2025.101852
COBISS.SI-ID:	241622275
Copyright:	© 2025 The Authors.
Note:	Nasl. z nasl. zaslona; Opis vira z dne 7. 7. 2025;
Publication date in DiRROS:	07.07.2025
Views:	326
Downloads:	214
Metadata:
:	Copy citation

Share:

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a journal

Title:	Computer speech & language
Shortened title:	Comput. speech lang.
Publisher:	Academic Press
ISSN:	1095-8363
COBISS.SI-ID:	203927043

Document is financed by a project

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	P6-0411
Name:	Jezikovni viri in tehnologije za slovenski jezik

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	P2-0103
Name:	Tehnologije znanja

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	L2-50070
Name:	Tehnike vektorskih vložitev za medijske aplikacije

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	J7-3159
Name:	Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	GC-0002
Name:	Veliki jezikovni modeli za digitalno humanistiko

Funder:	Ministry of Higher Education, Science and Innovation of the Republic of Slovenia and European Union – NextGeneration EU
Name:	Adaptive Natural Language Processing with Large Language Models
Acronym:	PoVeJMo

Funder:	ARIS - Slovenian Research and Innovation Agency
Project number:	BI-FR/23-24-PROTEUS-006
Name:	Čezjezikovne in čezdomenske metode za luščenje in poravnavo terminologije

Funder:	UK EPSRC
Project number:	EP/S033564/1

Funder:	EPSRC/AHRC Centre for Doctoral Training in Media and Arts Technology
Project number:	EP/L01632X/1

Funder:	EC - European Commission
Funding programme:	H2020
Project number:	825153
Name:	Cross-Lingual Embeddings for Less-Represented Languages in European News Media
Acronym:	EMBEDDIA

Funder:	EC - European Commission
Funding programme:	HE
Project number:	101186647
Name:	Centre of Excellence in Artificial Intelligence for Digital Humanities
Acronym:	AI4DH

Licences

License:	CC BY 4.0, Creative Commons Attribution 4.0 International

Link:	http://creativecommons.org/licenses/by/4.0/
Description:	This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Licensing start date:	27.06.2025
Applies to:	VoR

Secondary language

Language:	Slovenian
Keywords:	korpusi, večjetični veliki modeli

Back

Show document A+ | A- | | SLO | ENG

Record is a part of a journal

Document is financed by a project

Licences

Secondary language

Show document
A+ | A- | | SLO | ENG