Skip to main navigation Skip to search Skip to main content

A Large-scale Dataset of (Open Source) License Text Variants

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.

Original languageEnglish
Title of host publicationProceedings - 2022 Mining Software Repositories Conference, MSR 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages757-761
Number of pages5
ISBN (Electronic)9781450393034
DOIs
Publication statusPublished - 17 Oct 2022
Event2022 Mining Software Repositories Conference, MSR 2022 - Hybrid, Pittsburgh, United States
Duration: 23 May 202224 May 2022

Publication series

NameProceedings - 2022 Mining Software Repositories Conference, MSR 2022

Conference

Conference2022 Mining Software Repositories Conference, MSR 2022
Country/TerritoryUnited States
CityHybrid, Pittsburgh
Period23/05/2224/05/22

Keywords

  • Dataset
  • copyright
  • intellectual property
  • natural language processing
  • open source
  • software engineering
  • software license

Fingerprint

Dive into the research topics of 'A Large-scale Dataset of (Open Source) License Text Variants'. Together they form a unique fingerprint.

Cite this