Passer à la navigation principale Passer à la recherche Passer au contenu principal

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

  • Sanghwan Kim
  • , Rui Xiao
  • , Mariana Iuliana Georgescu
  • , Stephan Alaniz
  • , Zeynep Akata

Résultats de recherche: Contribution à un journalArticle de conférenceRevue par des pairs

Résumé

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

langue originaleAnglais
Pages (de - à)14690-14700
Nombre de pages11
journalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Les DOIs
étatPublié - 1 janv. 2025
Modification externeOui
Evénement2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, États-Unis
Durée: 11 juin 202515 juin 2025

Empreinte digitale

Examiner les sujets de recherche de « COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation