TY - GEN
T1 - A Large-scale Dataset of (Open Source) License Text Variants
AU - Zacchiroli, Stefano
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/17
Y1 - 2022/10/17
N2 - We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.
AB - We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.
KW - Dataset
KW - copyright
KW - intellectual property
KW - natural language processing
KW - open source
KW - software engineering
KW - software license
UR - https://www.scopus.com/pages/publications/85134015668
U2 - 10.1145/3524842.3528491
DO - 10.1145/3524842.3528491
M3 - Conference contribution
AN - SCOPUS:85134015668
T3 - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
SP - 757
EP - 761
BT - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 Mining Software Repositories Conference, MSR 2022
Y2 - 23 May 2022 through 24 May 2022
ER -