TY - GEN
T1 - DeViL
T2 - 45th Annual Conference of the German Association for Pattern Recognition, DAGM-GCPR 2023
AU - Dani, Meghal
AU - Rio-Torto, Isabel
AU - Alaniz, Stephan
AU - Akata, Zeynep
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method generates textual descriptions of visual features at different layers of the network as well as highlights the attribution locations of learned concepts. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train and can be applied to any vision backbone. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M, surpassing previous lightweight captioning models and attribution maps, uncovering the learned concepts of the vision backbone. Further, we analyze fine-grained descriptions of layers as well as specific spatial locations and show that DeViL outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset.
AB - Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method generates textual descriptions of visual features at different layers of the network as well as highlights the attribution locations of learned concepts. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train and can be applied to any vision backbone. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M, surpassing previous lightweight captioning models and attribution maps, uncovering the learned concepts of the vision backbone. Further, we analyze fine-grained descriptions of layers as well as specific spatial locations and show that DeViL outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset.
KW - Explainable AI
KW - Natural Language Explanations
KW - Open-vocabulary Saliency
KW - Vision-Language Models
UR - https://www.scopus.com/pages/publications/85189509347
U2 - 10.1007/978-3-031-54605-1_24
DO - 10.1007/978-3-031-54605-1_24
M3 - Conference contribution
AN - SCOPUS:85189509347
SN - 9783031546044
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 363
EP - 377
BT - Pattern Recognition - 45th DAGM German Conference, DAGM GCPR 2023, Proceedings
A2 - Köthe, Ullrich
A2 - Rother, Carsten
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 19 September 2023 through 22 September 2023
ER -