Using the uniqueness of global identifiers to determine the provenance of Python software source code

Research output: Contribution to journalArticlepeer-review

Abstract

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244 K packages we found 11.2 M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

Original languageEnglish
Article number107
JournalEmpirical Software Engineering
Volume28
Issue number5
DOIs
Publication statusPublished - 1 Oct 2023

Keywords

  • Identifiers
  • Open source software
  • Python
  • Software provenance
  • Source code tracking

Fingerprint

Dive into the research topics of 'Using the uniqueness of global identifiers to determine the provenance of Python software source code'. Together they form a unique fingerprint.

Cite this