Passer à la navigation principale Passer à la recherche Passer au contenu principal

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

langue originaleAnglais
titreFindings of the Association for Computational Linguistics
Sous-titreNAACL 2024 - Findings
rédacteurs en chefKevin Duh, Helena Gomez, Steven Bethard
EditeurAssociation for Computational Linguistics (ACL)
Pages3589-3604
Nombre de pages16
ISBN (Electronique)9798891761193
Les DOIs
étatPublié - 1 janv. 2024
Evénement2024 Findings of the Association for Computational Linguistics: NAACL 2024 - Hybrid, Mexico City, Mexique
Durée: 16 juin 202421 juin 2024

Série de publications

NomFindings of the Association for Computational Linguistics: NAACL 2024 - Findings

Une conférence

Une conférence2024 Findings of the Association for Computational Linguistics: NAACL 2024
Pays/TerritoireMexique
La villeHybrid, Mexico City
période16/06/2421/06/24

Empreinte digitale

Examiner les sujets de recherche de « The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation