Ultra-Large-Scale Repository Analysis via Graph Compression

Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We consider the problem of mining the development history - as captured by modern version control systems - of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects - encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.

Original languageEnglish
Title of host publicationSANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering
EditorsKostas Kontogiannis, Foutse Khomh, Alexander Chatzigeorgiou, Marios-Eleftherios Fokaefs, Minghui Zhou
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages184-194
Number of pages11
ISBN (Electronic)9781728151434
DOIs
Publication statusPublished - 1 Feb 2020
Externally publishedYes
Event27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020 - London, Canada
Duration: 18 Feb 202021 Feb 2020

Publication series

NameSANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering

Conference

Conference27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020
Country/TerritoryCanada
CityLondon
Period18/02/2021/02/20

Keywords

  • development history
  • graph compression
  • mining software repositories
  • software evolution
  • source code
  • version control systems

Fingerprint

Dive into the research topics of 'Ultra-Large-Scale Repository Analysis via Graph Compression'. Together they form a unique fingerprint.

Cite this