TY - GEN
T1 - Conditional Cross Correlation Network for Video Question Answering
AU - Ouenniche, Kaouther
AU - Tapu, Ruxandra
AU - Zaharia, Titus
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Video question answering (VideoQA) is the process that aims at responding to questions expressed in natural language, according to the semantic content of a given video. VideoQA is a highly challenging task and demands a comprehensive understanding of the video document, including the recognition of the various objects, actions and activities involved together with the spatial, temporal and causal relations between them. To tackle the challenge of VideoQA, most methods propose efficient techniques to fuse the representations between visual and textual modalities. In this paper, we introduce a novel framework based on a conditional cross-correlation network that learns multimodal contextualization with reduced computational and memory requirements. At the core of our approach, we consider a cross-correlation module designed to learn reciprocally constrained visual/textual features combined with a lightweight transformer that fuses the intermodal contextualization between visual and textual modalities. We test the vulnerability of the composing elements of our pipeline using black box attacks. To this purpose, we automatically generate semantic-preserving rephrased questions. The ablation study conducted confirms the importance of each module in the framework. The experimental evaluation, carried out on the MSVD-QA benchmark, validates the proposed methodology with average accuracy scores of 43.58%. When compared with state-of-the-art methods the proposed method yields gains in accuracy of more than 4%and achieves a 43.58% accuracy rate on the MSVD-QA data set.
AB - Video question answering (VideoQA) is the process that aims at responding to questions expressed in natural language, according to the semantic content of a given video. VideoQA is a highly challenging task and demands a comprehensive understanding of the video document, including the recognition of the various objects, actions and activities involved together with the spatial, temporal and causal relations between them. To tackle the challenge of VideoQA, most methods propose efficient techniques to fuse the representations between visual and textual modalities. In this paper, we introduce a novel framework based on a conditional cross-correlation network that learns multimodal contextualization with reduced computational and memory requirements. At the core of our approach, we consider a cross-correlation module designed to learn reciprocally constrained visual/textual features combined with a lightweight transformer that fuses the intermodal contextualization between visual and textual modalities. We test the vulnerability of the composing elements of our pipeline using black box attacks. To this purpose, we automatically generate semantic-preserving rephrased questions. The ablation study conducted confirms the importance of each module in the framework. The experimental evaluation, carried out on the MSVD-QA benchmark, validates the proposed methodology with average accuracy scores of 43.58%. When compared with state-of-the-art methods the proposed method yields gains in accuracy of more than 4%and achieves a 43.58% accuracy rate on the MSVD-QA data set.
KW - cross-correlation
KW - multimodal learning
KW - video question answering
U2 - 10.1109/ICSC56153.2023.00011
DO - 10.1109/ICSC56153.2023.00011
M3 - Conference contribution
AN - SCOPUS:85151545660
T3 - Proceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023
SP - 25
EP - 32
BT - Proceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on Semantic Computing, ICSC 2023
Y2 - 1 February 2023 through 3 February 2023
ER -