TY - GEN
T1 - Large-scale, diverse, paraphrastic bitexts via sampling and clustering
AU - Hu, J. Edward
AU - Singh, Abhinav
AU - Holzenberger, Nils
AU - Post, Matt
AU - Van Durme, Benjamin
N1 - Publisher Copyright:
© 2019 Association for Computational Linguistics.
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe PARABANK 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that PARABANK 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.
AB - Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe PARABANK 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that PARABANK 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.
UR - https://www.scopus.com/pages/publications/85084331385
U2 - 10.18653/v1/K19-1005
DO - 10.18653/v1/K19-1005
M3 - Conference contribution
AN - SCOPUS:85084331385
T3 - CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference
SP - 44
EP - 54
BT - CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference
PB - Association for Computational Linguistics
T2 - 23rd Conference on Computational Natural Language Learning, CoNLL 2019
Y2 - 3 November 2019 through 4 November 2019
ER -