Conditional Cross Correlation Network for Video Question Answering

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video question answering (VideoQA) is the process that aims at responding to questions expressed in natural language, according to the semantic content of a given video. VideoQA is a highly challenging task and demands a comprehensive understanding of the video document, including the recognition of the various objects, actions and activities involved together with the spatial, temporal and causal relations between them. To tackle the challenge of VideoQA, most methods propose efficient techniques to fuse the representations between visual and textual modalities. In this paper, we introduce a novel framework based on a conditional cross-correlation network that learns multimodal contextualization with reduced computational and memory requirements. At the core of our approach, we consider a cross-correlation module designed to learn reciprocally constrained visual/textual features combined with a lightweight transformer that fuses the intermodal contextualization between visual and textual modalities. We test the vulnerability of the composing elements of our pipeline using black box attacks. To this purpose, we automatically generate semantic-preserving rephrased questions. The ablation study conducted confirms the importance of each module in the framework. The experimental evaluation, carried out on the MSVD-QA benchmark, validates the proposed methodology with average accuracy scores of 43.58%. When compared with state-of-the-art methods the proposed method yields gains in accuracy of more than 4%and achieves a 43.58% accuracy rate on the MSVD-QA data set.

Original languageEnglish
Title of host publicationProceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages25-32
Number of pages8
ISBN (Electronic)9781665482639
DOIs
Publication statusPublished - 1 Jan 2023
Event17th IEEE International Conference on Semantic Computing, ICSC 2023 - Virtual, Online, United States
Duration: 1 Feb 20233 Feb 2023

Publication series

NameProceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023

Conference

Conference17th IEEE International Conference on Semantic Computing, ICSC 2023
Country/TerritoryUnited States
CityVirtual, Online
Period1/02/233/02/23

Keywords

  • cross-correlation
  • multimodal learning
  • video question answering

Fingerprint

Dive into the research topics of 'Conditional Cross Correlation Network for Video Question Answering'. Together they form a unique fingerprint.

Cite this