TY - GEN
T1 - Building a Commit-level Dataset of Real-world Vulnerabilities
AU - Challande, Alexis
AU - David, Robin
AU - Renault, Guénaël
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/14
Y1 - 2022/4/14
N2 - While CVE have become a de facto standard for publishing advisories on vulnerabilities, the state of current CVE databases is lackluster. Yet, CVE advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association. In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the AOSP, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g. vulnerability type, impact level) with their respective patches at the commit granularity (e.g. fix commit-id, affected files, source code language). Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.
AB - While CVE have become a de facto standard for publishing advisories on vulnerabilities, the state of current CVE databases is lackluster. Yet, CVE advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association. In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the AOSP, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g. vulnerability type, impact level) with their respective patches at the commit granularity (e.g. fix commit-id, affected files, source code language). Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.
KW - binary matching
KW - dataset
KW - patch detection
KW - security vulnerabilities
KW - vulnerability research
U2 - 10.1145/3508398.3511495
DO - 10.1145/3508398.3511495
M3 - Conference contribution
AN - SCOPUS:85130623179
T3 - CODASPY 2022 - Proceedings of the 12th ACM Conference on Data and Application Security and Privacy
SP - 101
EP - 106
BT - CODASPY 2022 - Proceedings of the 12th ACM Conference on Data and Application Security and Privacy
PB - Association for Computing Machinery, Inc
T2 - 12th ACM Conference on Data and Application Security and Privacy, CODASPY 2022
Y2 - 24 April 2022 through 27 April 2022
ER -