CLIP

CLIP: Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy and others ICML 2021

Summary:

From my personal experience, the hugest misleading thing from the CLIP result demo is it looks like a caption generation model, which is totally wrong even though it has its potential to do so. The task for the CLIP is image classification. The goal of the CLIP is to build a pre-training model so that it could generalize well to all types of image datasets. The hugest strength of the CLIP compared with other image classification models at that time lies in its ability to generalize well to unseen datasets (zero-shot learning) while others perform poorly on out-of-box images.

The intuition behind such success is that the task for model now is not blindly find the discrimination line for image data distribution and assign clusters with the random one-hot text labels. Now, the model is asked to find the correlation between the image and text pair so that the close (similar) image data classes will have close (similar) text embedding (vector) labels. This makes the novel image from test set could still be assigned to a closer semantics label, which got more chance to be correct. For example, a zebra image is similar to a horse image. If we randomly use one-hot encoding to represent zebra label as $[1,0,0,0,0,0]$, turtle label as $[0,1,0,0,0,0]$ and horse label as $[0,0,0,0,0,1]$. If the model predict a novel zebra image as $[0,1,0,0,0,0]$ which is really close to the correct label, they might get a totally semantics unrelated result as turtle. However, with such new co-trained image-text representation, the zebra and horse labels are highly possible to be close (eg.$[ 1,0,0,0,0,0 ]$ and $[0,1,0,0,0,0]$). Even though it might not be able to predict a exactly correct zebra label, it still can predict ‘horse’, which is still someway reasonable.

The CLIP architecture contains several parts. Image encoder, text encoder, and the contrastive image-text pair matrix. First select N image-text pairs from the dataset, then use the image encoder and text encoder to get the N embeddings for image and caption separately. With N image embedding and N corresponding text embedding, we could form a N*N inner product matrix. The (i,j) element of the matrix means the similarity of the i image encoding and j text encoding. We could clearly know the diagonal elements should be the correct image-text pairs so we choose to maximize them and minimize all the rest of the pairs by using cross-entropy. By enough large data-text pair training, both image encoder and text encoder could converge to produce similar semantics representation

During the inference phase, for a test dataset, all the possible text labels will be formed to photo description sentences and get the text embeddings by text encoder. Then the new image will be encoded by the image encoders and get a dot product with all text label embedding. The highest value (most similar) text label will be chosen as its classification label.

Strength:

Besides the better generalization ability from the result side, from my opinion, this work also has significant meaning to machine learning paradigm. It touches an important fact for human learning: multi-modality. And it is the multi-modality that brings agent the embodied experience and understand what is the ‘meaning’ of an entity. Those entity notions can’t be learned from single modality since single modality only provides ‘correlation’ rather than ‘interpretation’. The latter task, however, is the most common CV task we ask the machines to do.

Critique:

It is obvious to me to how to use the CLIP model to generalize to image caption task by adding a text decoder to the image representation so that we could directly get the caption. Even though the model sets a good start, I always believe more advanced reasoning task requires embodied self model. The text-image co-learning binding is still too shallow for such task.