Isomorphic Cross-lingual Embeddings for Low-Resource Languages

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embedding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: {Nepali, Finnish, Romanian, Gujarati, Hungarian}-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved.

Original languageEnglish
Title of host publicationACL 2022 - 7th Workshop on Representation Learning for NLP, RepL4NLP 2022 - Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages133-142
Number of pages10
ISBN (Electronic)9781955917483
Publication statusPublished - 1 Jan 2022
Externally publishedYes
Event7th Workshop on Representation Learning for NLP, RepL4NLP 2022 at ACL 2022 - Dublin, Ireland
Duration: 26 May 2022 → …

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference7th Workshop on Representation Learning for NLP, RepL4NLP 2022 at ACL 2022
Country/TerritoryIreland
CityDublin
Period26/05/22 → …

Fingerprint

Dive into the research topics of 'Isomorphic Cross-lingual Embeddings for Low-Resource Languages'. Together they form a unique fingerprint.

Cite this