TY - GEN
T1 - FunnyNet
T2 - 16th Asian Conference on Computer Vision, ACCV 2022
AU - Liu, Zhi Song
AU - Courant, Robin
AU - Kalogeiton, Vicky
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as facial expression, body language, dialogues and culture. In this paper, we propose FunnyNet, a model that relies on cross- and self-attention for both visual and audio data to predict funny moments in videos. Unlike most methods that focus on text with or without visual data to identify funny moments, in this work in addition to visual cues, we exploit audio. Audio comes naturally with videos, and moreover it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet successfully exploits visual and auditory cues to identify funny moments, while our findings corroborate our claim that audio is more suitable than text for funny moment prediction. FunnyNet sets the new state of the art for laughter detection with audiovisual or multimodal cues on all datasets.
AB - Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as facial expression, body language, dialogues and culture. In this paper, we propose FunnyNet, a model that relies on cross- and self-attention for both visual and audio data to predict funny moments in videos. Unlike most methods that focus on text with or without visual data to identify funny moments, in this work in addition to visual cues, we exploit audio. Audio comes naturally with videos, and moreover it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet successfully exploits visual and auditory cues to identify funny moments, while our findings corroborate our claim that audio is more suitable than text for funny moment prediction. FunnyNet sets the new state of the art for laughter detection with audiovisual or multimodal cues on all datasets.
UR - https://www.scopus.com/pages/publications/85151056331
U2 - 10.1007/978-3-031-26316-3_26
DO - 10.1007/978-3-031-26316-3_26
M3 - Conference contribution
AN - SCOPUS:85151056331
SN - 9783031263156
T3 - Lecture Notes in Computer Science
SP - 433
EP - 450
BT - Computer Vision – ACCV 2022 - 16th Asian Conference on Computer Vision, Proceedings
A2 - Wang, Lei
A2 - Gall, Juergen
A2 - Chin, Tat-Jun
A2 - Sato, Imari
A2 - Chellappa, Rama
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 4 December 2022 through 8 December 2022
ER -