Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations.We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages680-684
Number of pages5
ISBN (Electronic)9798350374513
DOIs
Publication statusPublished - 1 Jan 2024
Event2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Publication series

Name2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

Conference

Conference2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period14/04/2419/04/24

Keywords

  • Audio Representation Learning
  • Joint-Embedding Predictive Architecture
  • Masked Image Modeling
  • Momentum Encoder
  • Self-supervised learning

Fingerprint

Dive into the research topics of 'Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning'. Together they form a unique fingerprint.

Cite this