| Title: | Fake news detection through LLM-driven text augmentation across media and languages |
|---|
| Authors: | ID Sittar, Abdul, Institut "Jožef Stefan" (Author) ID Smiljanić, Mateja (Author) ID Guček, Alenka, Institut "Jožef Stefan" (Author) ID Grobelnik, Marko, Institut "Jožef Stefan" (Author) |
| Files: | URL - Source URL, visit https://www.mdpi.com/2504-4990/8/4/103
PDF - Presentation file, download (1,16 MB) MD5: E39200DAAD502418FED92E1A52628BBC
|
|---|
| Language: | English |
|---|
| Typology: | 1.01 - Original Scientific Article |
|---|
| Organization: | IJS - Jožef Stefan Institute
|
|---|
| Abstract: | The proliferation of fake news across social media, headlines, and news articles poses major challenges for automated detection, particularly in multilingual and cross-media settings affected by data imbalance. We propose a fake news detection framework based on LLM-driven, feature-guided text augmentation. The method generates realistic synthetic samples across languages, media types, and text granularities while preserving mean ing and stylistic coherence. Experiments with classical and transformer-based models (Random Forest, Logistic Regression, BERT, XLM-R) across social media, headlines, and multilingual news datasets show consistent improvements in performance. For inherently balanced datasets (e.g., social media), synthetic augmentation yields negligible but stable performance changes. Across imbalanced scenarios, synthetic augmentation substantially improves minority-class recall and F1-score (e.g., fake news recall from 0.57 to 0.86), while preserving majority-class performance, leading to more balanced and reliable classifiers, whereas oversampling significantly degrades results due to overfitting on duplicated language patterns. Overall, a hybrid semantic- and style-based model proves to be the most robust strategy, outperforming oversampling and matching or exceeding baseline performance across datasets |
|---|
| Keywords: | fake news detection, low-resource languages, data imbalance, synthetic data generation, prompt engineering, style-based features, semantic features |
|---|
| Publication status: | Published |
|---|
| Publication version: | Version of Record |
|---|
| Submitted for review: | 02.03.2026 |
|---|
| Article acceptance date: | 09.04.2026 |
|---|
| Publication date: | 15.04.2026 |
|---|
| Publisher: | MDPI |
|---|
| Year of publishing: | 2026 |
|---|
| Number of pages: | str. 1-32 |
|---|
| Numbering: | Vol. 8, iss. 4, [article no.] 103 |
|---|
| Source: | Švica |
|---|
| PID: | 20.500.12556/DiRROS-29227  |
|---|
| UDC: | 004.8 |
|---|
| ISSN on article: | 2504-4990 |
|---|
| DOI: | 10.3390/make8040103  |
|---|
| COBISS.SI-ID: | 276627715  |
|---|
| Copyright: | © 2026 by the authors. |
|---|
| Note: | Nasl. z nasl. zaslona;
Soavtorji: Mateja Smiljanić, Alenka Guček, Marko Grobelnik;
Opis vira z dne 28. 4. 2026;
|
|---|
| Publication date in DiRROS: | 28.04.2026 |
|---|
| Views: | 46 |
|---|
| Downloads: | 21 |
|---|
| Metadata: |  |
|---|
|
:
|
Copy citation |
|---|
| | | | Share: |  |
|---|
Hover the mouse pointer over a document title to show the abstract or click
on the title to get all document metadata. |