Skip to main navigation Skip to search Skip to main content

Fingerprinting and Building Large Reproducible Datasets

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

Original languageEnglish
Title of host publicationProceedings of the 1st ACM Conference on Reproducibility and Replicability, REP 2023
PublisherAssociation for Computing Machinery, Inc
Pages27-36
Number of pages10
ISBN (Electronic)9798400701764
DOIs
Publication statusPublished - 27 Jun 2023
Event1st ACM Conference on Reproducibility and Replicability, REP 2023 - Santa Cruz, United States
Duration: 27 Jun 202329 Jun 2023

Publication series

NameProceedings of the 1st ACM Conference on Reproducibility and Replicability, REP 2023

Conference

Conference1st ACM Conference on Reproducibility and Replicability, REP 2023
Country/TerritoryUnited States
CitySanta Cruz
Period27/06/2329/06/23

Keywords

  • dataset
  • empirical studies
  • open science
  • reproducibility

Fingerprint

Dive into the research topics of 'Fingerprinting and Building Large Reproducible Datasets'. Together they form a unique fingerprint.

Cite this