TY - GEN
T1 - Character and subword-based word representation for neural language modeling prediction
AU - Labeau, Matthieu
AU - Allauzen, Alexandre
N1 - Publisher Copyright:
© EMNLP 2017.All right reserved.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.
AB - Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.
M3 - Conference contribution
AN - SCOPUS:85093564563
T3 - EMNLP 2017 - 1st Workshop on Subword and Character Level Models in NLP, SCLeM 2017 - Proceedings of the Workshop
SP - 1
EP - 13
BT - EMNLP 2017 - 1st Workshop on Subword and Character Level Models in NLP, SCLeM 2017 - Proceedings of the Workshop
A2 - Faruqui, Manaal
A2 - Schutze, Hinrich
A2 - Trancoso, Isabel
A2 - Yadollah, Yaghoobzadeh
PB - Association for Computational Linguistics (ACL)
T2 - EMNLP 2017 1st Workshop on Subword and Character Level Models in NLP, SCLeM 2017
Y2 - 7 September 2017
ER -