Passer à la navigation principale Passer à la recherche Passer au contenu principal

SA-CLIP: Language Guided Image Spatial and Action Feature Learning

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

We observed that Contrastive Language-Image Pretraining (CLIP) models struggle with real-world downstream tasks such as road traffic anomaly detection, due to their inability to effectively capture spatial and action relationships between objects within images. To address this, we propose a dependency parsing based method to compile and curate a dataset with 1M samples of images using language supervision provided by the common image caption dataset, in which each image is paired with subject-relationship-object descriptions emphasizing spatial and action interactions, and train a Spatial and Action relationship aware CLIP (SA-CLIP) model. We evaluated the proposed model on the Visual Spatial Reasoning (VSR) dataset and further verified its effectiveness on the Detection-of-Traffic-Anomaly (DoTA) dataset. Experiment results show that the proposed SA-CLIP demonstrates strong abilities in understanding spatial relationships while achieving good zero-shot performance on the traffic anomaly detection task.

langue originaleAnglais
titreEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
rédacteurs en chefChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
EditeurAssociation for Computational Linguistics (ACL)
Pages20808-20814
Nombre de pages7
ISBN (Electronique)9798891763357
Les DOIs
étatPublié - 1 janv. 2025
Evénement30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, Chine
Durée: 4 nov. 20259 nov. 2025

Série de publications

NomEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

Une conférence

Une conférence30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Pays/TerritoireChine
La villeSuzhou
période4/11/259/11/25

Empreinte digitale

Examiner les sujets de recherche de « SA-CLIP: Language Guided Image Spatial and Action Feature Learning ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation