TY - GEN
T1 - Fingerprinting and Building Large Reproducible Datasets
AU - Lefeuvre, Romain
AU - Galasso, Jessie
AU - Combemale, Benoit
AU - Sahraoui, Houari
AU - Zacchiroli, Stefano
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.
AB - Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.
KW - dataset
KW - empirical studies
KW - open science
KW - reproducibility
UR - https://www.scopus.com/pages/publications/85165955535
U2 - 10.1145/3589806.3600043
DO - 10.1145/3589806.3600043
M3 - Conference contribution
AN - SCOPUS:85165955535
T3 - Proceedings of the 1st ACM Conference on Reproducibility and Replicability, REP 2023
SP - 27
EP - 36
BT - Proceedings of the 1st ACM Conference on Reproducibility and Replicability, REP 2023
PB - Association for Computing Machinery, Inc
T2 - 1st ACM Conference on Reproducibility and Replicability, REP 2023
Y2 - 27 June 2023 through 29 June 2023
ER -