TY - JOUR
T1 - Leveraging ensemble deep models and llm for visual polysemy and word sense disambiguation
AU - Setitra, Insaf
AU - Rajapaksha, Praboda
AU - Myat, Aung Kaung
AU - Crespi, Noel
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/9/1
Y1 - 2025/9/1
N2 - Visual Polysemy Disambiguation (VPD) and Visual Word Sense Disambiguation (VWSD) are challenging tasks for both computer vision and NLP since an image can have diverse contextual interpretations, ranging from visual representations to abstract concepts. In this paper, we propose a novel approach to address the challenges of VPD and VWSD by leveraging ensemble deep models from computer vision to alleviate the problem of VPD and using the strength of LLMs to mitigate the problem of WSD. We first generate visually representative images from textual descriptions through a zero-shot text-to-image generation framework using image scrapping and Google search. We then employ an ensemble of classifiers and a deep network to learn feature representations, classify images into contexts and find the best match. Similarly, we perform the reverse process of generating textual descriptions from images using a Vision Transformer model and calculate the cosine distance with the actual text. Experimental evaluation of benchmark datasets demonstrates the effectiveness of our combined approach in strengthening both text-to-image and image-to-text generation, we improve disambiguation accuracy, providing a robust solution for VWSD with an MRR of 95.77% and a Hit rate of 92.00% surpassing state-of-the-art methods.
AB - Visual Polysemy Disambiguation (VPD) and Visual Word Sense Disambiguation (VWSD) are challenging tasks for both computer vision and NLP since an image can have diverse contextual interpretations, ranging from visual representations to abstract concepts. In this paper, we propose a novel approach to address the challenges of VPD and VWSD by leveraging ensemble deep models from computer vision to alleviate the problem of VPD and using the strength of LLMs to mitigate the problem of WSD. We first generate visually representative images from textual descriptions through a zero-shot text-to-image generation framework using image scrapping and Google search. We then employ an ensemble of classifiers and a deep network to learn feature representations, classify images into contexts and find the best match. Similarly, we perform the reverse process of generating textual descriptions from images using a Vision Transformer model and calculate the cosine distance with the actual text. Experimental evaluation of benchmark datasets demonstrates the effectiveness of our combined approach in strengthening both text-to-image and image-to-text generation, we improve disambiguation accuracy, providing a robust solution for VWSD with an MRR of 95.77% and a Hit rate of 92.00% surpassing state-of-the-art methods.
KW - Classification
KW - Multimodal representation
KW - Text-to-image generation
KW - ViT
KW - Visual polysemy
KW - Visual word sense disambiguation
UR - https://www.scopus.com/pages/publications/85217156353
U2 - 10.1007/s11042-024-20235-6
DO - 10.1007/s11042-024-20235-6
M3 - Article
AN - SCOPUS:85217156353
SN - 1380-7501
VL - 84
SP - 35727
EP - 35759
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 29
ER -