Vision-Language Pretraining: Current Trends and the Future

An ACL 2022 tutorial by Aishwarya Agrawal (DeepMind, University of Montreal, Mila), Damien Teney (Idiap Research Institute), and Aida Nematzadeh (DeepMind).

• Part 1: Vision-language landscape before the pretraining era.
• Part 2: Modern vision-language pretraining.
• Part 3: Beyond statistical learning.

The goal of this tutorial will be to give an overview of the ingredients needed for working on multimodal problems, particularly vision and language. We will also discuss some of the open problems and promising future directions in this area.

In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger but noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., ViLBERT and CLIP). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. In addition to the larger pretraining datasets, the transformer architecture and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrianed models on downstream tasks. This approach is appealing for a few reasons: first, the pretraining datasets are often automatically curated from the Web, providing huge datasets with negligible collection costs. Second, we can train large models once, and reuse them for various tasks. Finally, these pretraining approach performs better or on par to previous task-specific models. An interesting question is whether these pretrained models -- in addition to their good task performance -- learn representations that are better at capturing the alignments between the two modalities. In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image--language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal modeling.

Recordings of the tutorial will soon be available through ACL. Click here to contact us.