Differentially Private Representation Learning via Image Captioning

  • Tom Sander
  • , Yaodong Yu
  • , Maziar Sanjabi
  • , Alain Durmus
  • , Yi Ma
  • , Kamalika Chaudhuri
  • , Chuan Guo

Research output: Contribution to journalConference articlepeer-review

Abstract

Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of ε = 8 for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains 65.8% accuracy on ImageNet-1K, considerably improving the previous SOTA of 56.5%. Our work challenges the prevailing sentiment that high-utility DP representation learning cannot be achieved by training from scratch. Code is available at https://github.com/facebookresearch/dpcap.

Original languageEnglish
Pages (from-to)43255-43275
Number of pages21
JournalProceedings of Machine Learning Research
Volume235
Publication statusPublished - 1 Jan 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'Differentially Private Representation Learning via Image Captioning'. Together they form a unique fingerprint.

Cite this