DALL-E

DALL-E: Zero-Shot Text-to-Image Generation

Ramesh, Aditya and Pavlov, Mikhail and Goh, Gabriel and Gray, and others PMLR 2021

Summary:

DALL-E is a 12-billion parameter model for generating images from a text description. It is trained on the 250 million image-text pairs collected from the internet. The question this paper is trying to exploring is ‘Could dataset size and model size be the limiting factor of current approaches?’ The answer seems to be ‘yes’ by their results since it achieves significant improvement comparing with previous methods. The design intuition of the DALL-E based on understanding mainly comes from two parts:

Transformer/BERT/GPT-3 already show the possibility of given a fixed size sequence and return a semantically related output sequence. It can be generalized to this task by input a text sequence and generate an image pixel sequence which is based on the text semantics.
The self-attention mechanism used in Transformer family set the space complexity to n^2 where n is the length of the input sequence. For image data, the sequence length will easily eat up all the memory since it has much longer footprint than text. To make the image data tractable, the goal is to smartly compressed the image into a smaller size of image token sequence, so that this smart down sampling could also maintain original image information as much as possible rather than pure down sampling. Use this token representation, the image data could be compressed without too much information loss.

To implement these goals, the paper splits the training into two stages:

Train a d-VAE part to compress the original image into a smaller size encoding. In the blog they mentioned it is pretrained to compress the original 256*256 image to 32*32 image tokens. The VAE here they developed is called discrete VAE. The idea is similar with VQ-VAE where the encoding space is a set of discrete variables rather than arbitrary continuous space. The difference here is d-VAE use a weighted sum of the vectors in codebooks rather than finding a closest neighbor as VQ-VAE does. Since the source code is unreleased, based on the released Transformer part and some hints inside the blog, we could guess the d-VAE part doesn’t involved in the DALL-E training process. It is more likely to be reused from other pre-trained models.
A GPT-3 part took image caption as input and autoregressively concatenated the predict image tokens after the text. The output of the Transformer is the weight distribution of the codebook vectors. After autoregressively predict all the image tokens of the inferred image, the token block was decoded by the CNN decoder of the d-VAE and get the original image.

Strength:

The codebook could be reused to compress image into lower dimension, which is crucial for self-attention. Besides, the idea of tokenizing the text and image together and do co-training is insightful.

Critiques:

For business purpose, there isn’t too much details both in the paper and code about d-VAE part which makes it difficult to reuse the image token codebook. However, this makes me rethink the purpose of codebook here. In VQ-VAE work, the codebook is directly related with the encoding space of the VAE. In their d-VAE, the final encoding is a distribution of the codebook. The encoding space here is still continuous, which makes me doubt about the purpose of the codebook. In their post, they also mentioned ‘training using the continuous relaxation obviates the need for an explicit codebook’. In this case, any autoencoder structure could fit in this problem. The question now is why this design is better. I strongly suspect this part is a reuse from other works without considering the task in a global optimization way.

Also, based on the DALL-E structure, the Transformer part seems to directly mapping to the CNN encoder block they got from d-VAE. I have enough reason to suspect the goal of the Transformer here is to achieve the same autoregressive output given the caption as the d-VAE CNN encoder given the output of the original image. So the key of the DALL-E doesn’t lie in the Transformer, but the pre-training d-VAE process, which is totally opaque from the paper. And the actual training process of the Transformer may not behave as what they argued in the paper.

Important resource for understanding DALL-E

Original Blog Link:

https://openai.com/blog/dall-e/

Very Nice Walkthrough Video by Yannic Kilcher:

https://www.youtube.com/watch?v=j4xgkjWlfL4&ab_channel=YannicKilcher

Code: Official OpenAI Partial Code (Almost tells you nothing):

https://github.com/openai/DALL-E

Some people try to read the paper and reproduce by themselves(Useful):

https://github.com/lucidrains/DALLE-pytorch

Details: A more illustrated paper to understand ‘Gumble-Softmax Relaxation’

https://arxiv.org/pdf/1611.00712.pdf