TY - GEN
T1 - Learning from Multiview Correlations in Open-domain Videos
AU - Holzenberger, Nils
AU - Palaskar, Shruti
AU - Madhyastha, Pranava
AU - Metze, Florian
AU - Arora, Raman
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5/1
Y1 - 2019/5/1
N2 - An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the existence of a latent alignment between views, such as between speech and its transcription, and by the multitude of choices for the learning objective. We explore an advanced, correlation-based representation learning method on a 4-way parallel, multimodal dataset, and assess the quality of the learned representations on retrieval-based tasks. We show that the proposed approach produces rich representations that capture most of the information shared across views. Our best models for speech and textual modalities achieve retrieval rates from 70.7% to 96.9% on open-domain, user-generated instructional videos. This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.
AB - An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the existence of a latent alignment between views, such as between speech and its transcription, and by the multitude of choices for the learning objective. We explore an advanced, correlation-based representation learning method on a 4-way parallel, multimodal dataset, and assess the quality of the learned representations on retrieval-based tasks. We show that the proposed approach produces rich representations that capture most of the information shared across views. Our best models for speech and textual modalities achieve retrieval rates from 70.7% to 96.9% on open-domain, user-generated instructional videos. This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.
KW - Canonical Correlation Analysis
KW - Multiview Learning
KW - Representation Learning
UR - https://www.scopus.com/pages/publications/85068971629
U2 - 10.1109/ICASSP.2019.8683540
DO - 10.1109/ICASSP.2019.8683540
M3 - Conference contribution
AN - SCOPUS:85068971629
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8628
EP - 8632
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -