Abstract
Fixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we systematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their susceptibility to phonetic and speaker-specific variability on English, a high resource language, and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimination and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method supplemented with length information can be competitive with the variable-length input feature representation on both evaluations. Recurrent autoencoders trained without supervision can yield even better results at the expense of increased computational complexity.
| Original language | English |
|---|---|
| Pages (from-to) | 2683-2687 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| Volume | 2018-September |
| DOIs | |
| Publication status | Published - 1 Jan 2018 |
| Externally published | Yes |
| Event | 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India Duration: 2 Sept 2018 → 6 Sept 2018 |
Keywords
- ABX discrimination
- Audio word embeddings
- Representation learning
- Same-different classification
- Unsupervised speech processing