Learning visual voice activity detection with an automatically annotated dataset

  • Sylvain Guy
  • , Stéphane Lathuilière
  • , Pablo Mesejo
  • , Radu Horaud

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild - WildVVAD - based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

Original languageEnglish
Title of host publicationProceedings of ICPR 2020 - 25th International Conference on Pattern Recognition
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4851-4856
Number of pages6
ISBN (Electronic)9781728188089
DOIs
Publication statusPublished - 1 Jan 2020
Event25th International Conference on Pattern Recognition, ICPR 2020 - Virtual, Milan, Italy
Duration: 10 Jan 202115 Jan 2021

Publication series

NameProceedings - International Conference on Pattern Recognition
ISSN (Print)1051-4651

Conference

Conference25th International Conference on Pattern Recognition, ICPR 2020
Country/TerritoryItaly
CityVirtual, Milan
Period10/01/2115/01/21

Fingerprint

Dive into the research topics of 'Learning visual voice activity detection with an automatically annotated dataset'. Together they form a unique fingerprint.

Cite this