Passer à la navigation principale Passer à la recherche Passer au contenu principal

The software heritage graph dataset: Public software development under one roof

  • INRIA Institut National de Recherche en Informatique et en Automatique
  • Athens Univ. of Econ. and Business
  • Laboratoire de Probabilités et Modèles Aléatoires

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.

langue originaleAnglais
titreProceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019
EditeurIEEE Computer Society
Pages138-142
Nombre de pages5
ISBN (Electronique)9781728134123
Les DOIs
étatPublié - 1 mai 2019
Modification externeOui
Evénement16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019 - Montreal, Canada
Durée: 26 mai 201927 mai 2019

Série de publications

NomIEEE International Working Conference on Mining Software Repositories
Volume2019-May
ISSN (imprimé)2160-1852
ISSN (Electronique)2160-1860

Une conférence

Une conférence16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019
Pays/TerritoireCanada
La villeMontreal
période26/05/1927/05/19

Empreinte digitale

Examiner les sujets de recherche de « The software heritage graph dataset: Public software development under one roof ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation