TY - JOUR
T1 - Determining the intrinsic structure of public software development history
T2 - an exploratory study
AU - Pietri, Antoine
AU - Rousseau, Guillaume
AU - Zacchiroli, Stefano
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2026/2/1
Y1 - 2026/2/1
N2 - Collaborative software development has produced a wealth of software source code artifacts (source files and directories, commits, releases, etc.) that have been studied for decades by researchers in empirical software engineering. Due to code reuse and the fork-based development model, those artifacts form a globally interconnected graph of a size comparable to the graph of the Web. Little is known yet about the network structure of this graph; such knowledge is useful to determine the best practical approaches to efficiently analyze very large subsets of it (if not all of it) in a methodologically sound manner. In this paper we determine the most salient network topology properties of the global public software development history as captured by state-of-the-art version control systems (VCS). As our corpus we use Software Heritage, one of the largest and most diverse publicly available archives of VCS data—encompassing 9 billion unique source code files and 2 billion unique commits coming from about 150 million projects or, as a graph, 19 billion nodes and 221 billion edges. We explore topology characteristics such as: degree distributions; distribution of connected component sizes; and distribution of shortest path lengths. We characterize these topology aspects for both the entire graph and relevant subgraphs.
AB - Collaborative software development has produced a wealth of software source code artifacts (source files and directories, commits, releases, etc.) that have been studied for decades by researchers in empirical software engineering. Due to code reuse and the fork-based development model, those artifacts form a globally interconnected graph of a size comparable to the graph of the Web. Little is known yet about the network structure of this graph; such knowledge is useful to determine the best practical approaches to efficiently analyze very large subsets of it (if not all of it) in a methodologically sound manner. In this paper we determine the most salient network topology properties of the global public software development history as captured by state-of-the-art version control systems (VCS). As our corpus we use Software Heritage, one of the largest and most diverse publicly available archives of VCS data—encompassing 9 billion unique source code files and 2 billion unique commits coming from about 150 million projects or, as a graph, 19 billion nodes and 221 billion edges. We explore topology characteristics such as: degree distributions; distribution of connected component sizes; and distribution of shortest path lengths. We characterize these topology aspects for both the entire graph and relevant subgraphs.
KW - Complex network
KW - Graph structure
KW - Open source
KW - Source code
KW - Statistical mechanics
KW - Version control system
UR - https://www.scopus.com/pages/publications/105020374121
U2 - 10.1007/s10664-025-10741-y
DO - 10.1007/s10664-025-10741-y
M3 - Article
AN - SCOPUS:105020374121
SN - 1382-3256
VL - 31
JO - Empirical Software Engineering
JF - Empirical Software Engineering
IS - 1
M1 - 5
ER -