Unsupervised word polysemy quantification with multiresolution grids of contextual embeddings

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available.

Original languageEnglish
Title of host publicationEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages3391-3401
Number of pages11
ISBN (Electronic)9781954085022
DOIs
Publication statusPublished - 1 Jan 2021
Externally publishedYes
Event16th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2021 - Virtual, Online
Duration: 19 Apr 202123 Apr 2021

Publication series

NameEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference16th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2021
CityVirtual, Online
Period19/04/2123/04/21

Fingerprint

Dive into the research topics of 'Unsupervised word polysemy quantification with multiresolution grids of contextual embeddings'. Together they form a unique fingerprint.

Cite this