Skip to main navigation Skip to search Skip to main content

Addressing data scarcity in multilingual fake news detection: an LLM-based dataset augmentation approach

Research output: Contribution to journalArticlepeer-review

Abstract

The rise in online news consumption, especially during critical events, coupled with rapid advances in generative artificial intelligence (AI), has accelerated the spread of misinformation, underscoring the urgent need for fast and effective fake news detection approaches. However, the scarcity and imbalance of high-quality labeled datasets pose significant challenges to training accurate and reliable detection models. In this study, we tackle this issue by leveraging Large Language Models (LLMs) for data augmentation. Expanding upon our prior work, we employ Llama 3 to generate synthetic news samples under zero-shot and few-shot settings, enriching existing fake news datasets to improve the performance of detection models. To optimize augmentation effectiveness, we explore several strategies, including varying augmentation rates, random versus similarity-based subsampling, and class-specific augmentation. Our experiments, using BERT-based classifiers on two real-world multilingual datasets, reveal that selectively augmenting only the fake news class at lower rates typically yields the most consistent improvements, with similarity-based subsampling slightly outperforming random selection. The augmentation approach led to F1 score improvements of up to 7.7 points in some languages. Additionally, while few-shot-generated samples generally exhibit greater similarity to the original ones, their impact on classification remains inconsistent. These findings highlight the potential of LLM-driven data augmentation, when carefully tuned, to enhance fake news detection.

Original languageEnglish
Article number92
JournalSocial Network Analysis and Mining
Volume15
Issue number1
DOIs
Publication statusPublished - 1 Dec 2025

Keywords

  • Data augmentation
  • Fake news detection
  • Few-shot and zero-shot prompting
  • Large language models (LLMs)
  • Misinformation detection

Fingerprint

Dive into the research topics of 'Addressing data scarcity in multilingual fake news detection: an LLM-based dataset augmentation approach'. Together they form a unique fingerprint.

Cite this