Digital repository of Slovenian research organisations

Show document
A+ | A- | Help | SLO | ENG

Title:Fake news detection through LLM-driven text augmentation across media and languages
Authors:ID Sittar, Abdul, Institut "Jožef Stefan" (Author)
ID Smiljanić, Mateja (Author)
ID Guček, Alenka, Institut "Jožef Stefan" (Author)
ID Grobelnik, Marko, Institut "Jožef Stefan" (Author)
Files:URL URL - Source URL, visit https://www.mdpi.com/2504-4990/8/4/103
 
.pdf PDF - Presentation file, download (1,16 MB)
MD5: E39200DAAD502418FED92E1A52628BBC
 
Language:English
Typology:1.01 - Original Scientific Article
Organization:Logo IJS - Jožef Stefan Institute
Abstract:The proliferation of fake news across social media, headlines, and news articles poses major challenges for automated detection, particularly in multilingual and cross-media settings affected by data imbalance. We propose a fake news detection framework based on LLM-driven, feature-guided text augmentation. The method generates realistic synthetic samples across languages, media types, and text granularities while preserving mean ing and stylistic coherence. Experiments with classical and transformer-based models (Random Forest, Logistic Regression, BERT, XLM-R) across social media, headlines, and multilingual news datasets show consistent improvements in performance. For inherently balanced datasets (e.g., social media), synthetic augmentation yields negligible but stable performance changes. Across imbalanced scenarios, synthetic augmentation substantially improves minority-class recall and F1-score (e.g., fake news recall from 0.57 to 0.86), while preserving majority-class performance, leading to more balanced and reliable classifiers, whereas oversampling significantly degrades results due to overfitting on duplicated language patterns. Overall, a hybrid semantic- and style-based model proves to be the most robust strategy, outperforming oversampling and matching or exceeding baseline performance across datasets
Keywords:fake news detection, low-resource languages, data imbalance, synthetic data generation, prompt engineering, style-based features, semantic features
Publication status:Published
Publication version:Version of Record
Submitted for review:02.03.2026
Article acceptance date:09.04.2026
Publication date:15.04.2026
Publisher:MDPI
Year of publishing:2026
Number of pages:str. 1-32
Numbering:Vol. 8, iss. 4, [article no.] 103
Source:Švica
PID:20.500.12556/DiRROS-29227 New window
UDC:004.8
ISSN on article:2504-4990
DOI:10.3390/make8040103 New window
COBISS.SI-ID:276627715 New window
Copyright:© 2026 by the authors.
Note:Nasl. z nasl. zaslona; Soavtorji: Mateja Smiljanić, Alenka Guček, Marko Grobelnik; Opis vira z dne 28. 4. 2026;
Publication date in DiRROS:28.04.2026
Views:46
Downloads:21
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a journal

Title:Machine learning and knowledge extraction
Publisher:MDPI
ISSN:2504-4990
COBISS.SI-ID:1537706179 New window

Document is financed by a project

Funder:EC - European Commission
Project number:101095095
Name:TWin of Online Social Networks
Acronym:TWON

Funder:EC - European Commission
Project number:101252405
Name:PERISCOPE project

Licences

License:CC BY 4.0, Creative Commons Attribution 4.0 International
Link:http://creativecommons.org/licenses/by/4.0/
Description:This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.
Licensing start date:15.04.2026
Applies to:VoR

Secondary language

Language:Slovenian
Keywords:prepoznavanje lažnih novic, jeziki z omejenimi viri, neuravnoteženost podatkov, generiranje sintetičnih podatkov, promptno inženirstvo, stilske značilke, semantične značilke


Back