GLADIS: A General and Large Acronym Disambiguation Benchmark

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, AcroBERT, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

Original languageEnglish
Title of host publicationEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages2065-2080
Number of pages16
ISBN (Electronic)9781959429449
DOIs
Publication statusPublished - 1 Jan 2023
Event17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Croatia, Croatia
Duration: 2 May 20236 May 2023

Publication series

NameEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityDubrovnik, Croatia
Period2/05/236/05/23

Fingerprint

Dive into the research topics of 'GLADIS: A General and Large Acronym Disambiguation Benchmark'. Together they form a unique fingerprint.

Cite this