IBEX: Harvesting entities from the web using unique identifiers

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73-96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

Original languageEnglish
Title of host publication18th International Workshop on the Web and Databases, WebDB 2015
Subtitle of host publicationFreshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings
EditorsJulia Stoyanovich, Fabian M. Suchanek
PublisherAssociation for Computing Machinery
Pages13-19
Number of pages7
ISBN (Electronic)9781450336277
DOIs
Publication statusPublished - 31 May 2015
Event18th International Workshop on the Web and Databases, WebDB 2015, co-located with ACM SIGMOD - Melbourne, Australia
Duration: 31 May 201531 May 2015

Publication series

Name18th International Workshop on the Web and Databases, WebDB 2015: Freshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings

Conference

Conference18th International Workshop on the Web and Databases, WebDB 2015, co-located with ACM SIGMOD
Country/TerritoryAustralia
CityMelbourne
Period31/05/1531/05/15

Fingerprint

Dive into the research topics of 'IBEX: Harvesting entities from the web using unique identifiers'. Together they form a unique fingerprint.

Cite this