Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Research output: Contribution to journalConference articlepeer-review

Abstract

Learning meaningful representations of complex objects that can be seen through multiple (k ≥ 3) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to k views, either by instantiating 12 k(k−1) loss-pairs, or by using reduced embeddings, following a one vs. average-of-rest strategy. We propose the multimarginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all k views. Given a batch of n points, each seen as a k-tuple of views subsequently transformed into k embeddings, our loss contrasts the cost of matching these n ground-truth k-tuples with the MM-OT polymatching cost, which seeks n optimally arranged k-tuples chosen within these n×k vectors. While the exponential complexity O(nk) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., k = 3 ∼ 6 views using mini-batches of size 64 ∼ 128. Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.

Original languageEnglish
Pages (from-to)40827-40842
Number of pages16
JournalProceedings of Machine Learning Research
Volume235
Publication statusPublished - 1 Jan 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'Contrasting Multiple Representations with the Multi-Marginal Matching Gap'. Together they form a unique fingerprint.

Cite this