Skip to main navigation Skip to search Skip to main content

How Dataset Diversity Affects Generalization in ML-Based NIDS

  • Telecom Sudparis
  • Institut Polytechnique de Paris

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Machine Learning-based Network Intrusion Detection Systems (ML-based NIDS) rely heavily on the quality of the datasets used for training and evaluation. However, widely used NIDS benchmarks often suffer from poor data diversity, which limits model generalization and undermines the reliability of evaluation protocols. While prior work has acknowledged this limitation, a systematic framework to quantify dataset diversity and analyze its relationship with performance is still missing. To address this gap, we introduce a structured approach for characterizing dataset diversity in ML-based NIDS, grounded in measurement theory. We distinguish three types of diversity—intra-class, inter-class, and domain-shift—and operationalize their measurement using established metrics such as the Vendi Score and the Jensen-Shannon divergence. Our empirical analysis on the CIC-IDS2018 dataset, spanning sixty diversity-controlled train–test experiments, provides new insights into the relationship between diversity and generalization and demonstrates the value of diversity-aware data sampling for improving evaluation reliability.

Original languageEnglish
Title of host publicationComputer Security – ESORICS 2025 - 30th European Symposium on Research in Computer Security, Proceedings
EditorsVincent Nicomette, Abdelmalek Benzekri, Nora Boulahia-Cuppens, Jaideep Vaidya
PublisherSpringer Science and Business Media Deutschland GmbH
Pages269-288
Number of pages20
ISBN (Print)9783032078834
DOIs
Publication statusPublished - 1 Jan 2026
Event30th European Symposium on Research in Computer Security, ESORICS 2025 - Toulouse, France
Duration: 22 Sept 202524 Sept 2025

Publication series

NameLecture Notes in Computer Science
Volume16053 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference30th European Symposium on Research in Computer Security, ESORICS 2025
Country/TerritoryFrance
CityToulouse
Period22/09/2524/09/25

Keywords

  • Diversity
  • Generalization
  • Machine Learning
  • Measurement Theory
  • NIDS Datasets
  • Performance Evaluation

Fingerprint

Dive into the research topics of 'How Dataset Diversity Affects Generalization in ML-Based NIDS'. Together they form a unique fingerprint.

Cite this