TY - GEN
T1 - Ultra-Large-Scale Repository Analysis via Graph Compression
AU - Boldi, Paolo
AU - Pietri, Antoine
AU - Vigna, Sebastiano
AU - Zacchiroli, Stefano
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2/1
Y1 - 2020/2/1
N2 - We consider the problem of mining the development history - as captured by modern version control systems - of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects - encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
AB - We consider the problem of mining the development history - as captured by modern version control systems - of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects - encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
KW - development history
KW - graph compression
KW - mining software repositories
KW - software evolution
KW - source code
KW - version control systems
U2 - 10.1109/SANER48275.2020.9054827
DO - 10.1109/SANER48275.2020.9054827
M3 - Conference contribution
AN - SCOPUS:85083579373
T3 - SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering
SP - 184
EP - 194
BT - SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering
A2 - Kontogiannis, Kostas
A2 - Khomh, Foutse
A2 - Chatzigeorgiou, Alexander
A2 - Fokaefs, Marios-Eleftherios
A2 - Zhou, Minghui
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020
Y2 - 18 February 2020 through 21 February 2020
ER -