The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

  • Giacomo Zara
  • , Alessandro Conti
  • , Subhankar Roy
  • , Stéphane Lathuilière
  • , Paolo Rota
  • , Elisa Ricci

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision"from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V 1 achieves significant improvement over state-of-the-art SFVUDA methods.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages10273-10283
Number of pages11
ISBN (Electronic)9798350307184
DOIs
Publication statusPublished - 1 Jan 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: 2 Oct 20236 Oct 2023

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/TerritoryFrance
CityParis
Period2/10/236/10/23

Fingerprint

Dive into the research topics of 'The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation'. Together they form a unique fingerprint.

Cite this