TY - GEN
T1 - Unsupervised word polysemy quantification with multiresolution grids of contextual embeddings
AU - Xypolopoulos, Christos
AU - Tixier, Antoine J.P.
AU - Vazirgiannis, Michalis
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021/1/1
Y1 - 2021/1/1
N2 - The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available.
AB - The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available.
UR - https://www.scopus.com/pages/publications/85107264909
U2 - 10.18653/v1/2021.eacl-main.297
DO - 10.18653/v1/2021.eacl-main.297
M3 - Conference contribution
AN - SCOPUS:85107264909
T3 - EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
SP - 3391
EP - 3401
BT - EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2021
Y2 - 19 April 2021 through 23 April 2021
ER -