Large-scale, diverse, paraphrastic bitexts via sampling and clustering

J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, Benjamin Van Durme

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe PARABANK 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that PARABANK 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

Original languageEnglish
Title of host publicationCoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference
PublisherAssociation for Computational Linguistics
Pages44-54
Number of pages11
ISBN (Electronic)9781950737727
DOIs
Publication statusPublished - 1 Jan 2019
Externally publishedYes
Event23rd Conference on Computational Natural Language Learning, CoNLL 2019 - Hong Kong, China
Duration: 3 Nov 20194 Nov 2019

Publication series

NameCoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference

Conference

Conference23rd Conference on Computational Natural Language Learning, CoNLL 2019
Country/TerritoryChina
CityHong Kong
Period3/11/194/11/19

Fingerprint

Dive into the research topics of 'Large-scale, diverse, paraphrastic bitexts via sampling and clustering'. Together they form a unique fingerprint.

Cite this