TY - GEN
T1 - The Curious Decline of Linguistic Diversity
T2 - 2024 Findings of the Association for Computational Linguistics: NAACL 2024
AU - Guo, Yanzhu
AU - Shang, Guokan
AU - Vazirgiannis, Michalis
AU - Clavel, Chloé
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
AB - This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
UR - https://www.scopus.com/pages/publications/85197872224
U2 - 10.18653/v1/2024.findings-naacl.228
DO - 10.18653/v1/2024.findings-naacl.228
M3 - Conference contribution
AN - SCOPUS:85197872224
T3 - Findings of the Association for Computational Linguistics: NAACL 2024 - Findings
SP - 3589
EP - 3604
BT - Findings of the Association for Computational Linguistics
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 16 June 2024 through 21 June 2024
ER -