Leveraging ensemble deep models and llm for visual polysemy and word sense disambiguation

  • Insaf Setitra
  • , Praboda Rajapaksha
  • , Aung Kaung Myat
  • , Noel Crespi

Research output: Contribution to journalArticlepeer-review

Abstract

Visual Polysemy Disambiguation (VPD) and Visual Word Sense Disambiguation (VWSD) are challenging tasks for both computer vision and NLP since an image can have diverse contextual interpretations, ranging from visual representations to abstract concepts. In this paper, we propose a novel approach to address the challenges of VPD and VWSD by leveraging ensemble deep models from computer vision to alleviate the problem of VPD and using the strength of LLMs to mitigate the problem of WSD. We first generate visually representative images from textual descriptions through a zero-shot text-to-image generation framework using image scrapping and Google search. We then employ an ensemble of classifiers and a deep network to learn feature representations, classify images into contexts and find the best match. Similarly, we perform the reverse process of generating textual descriptions from images using a Vision Transformer model and calculate the cosine distance with the actual text. Experimental evaluation of benchmark datasets demonstrates the effectiveness of our combined approach in strengthening both text-to-image and image-to-text generation, we improve disambiguation accuracy, providing a robust solution for VWSD with an MRR of 95.77% and a Hit rate of 92.00% surpassing state-of-the-art methods.

Original languageEnglish
Pages (from-to)35727-35759
Number of pages33
JournalMultimedia Tools and Applications
Volume84
Issue number29
DOIs
Publication statusPublished - 1 Sept 2025

Keywords

  • Classification
  • Multimodal representation
  • Text-to-image generation
  • ViT
  • Visual polysemy
  • Visual word sense disambiguation

Fingerprint

Dive into the research topics of 'Leveraging ensemble deep models and llm for visual polysemy and word sense disambiguation'. Together they form a unique fingerprint.

Cite this