ImageGPT

Image GPT: Generative Pretraining from Pixels

Chen Mark, Alec RadfordIlya, Sutskever and Others ICML 2020

Summary:

This paper discusses about the experiments that applies Transformer related sequence generation models (BERT, GPT) into the image domain. It demonstrated the SOTA image generation performance on many image datasets. The pre-training objective of the ImageGPT is to predict the next(GPT) pixel sequence or masked (BERT) pixel given the input pixels. In the previous reviews we have already discussed the architecture details for Transformer, BERT, GPTs. I will mainly focus on introducing the manipulations and design philosophy of the transition. Compared with language model, there is difference here:

Recall that for Transformer structure, the parameter space of the self-attention layer is \(O(dn^2)\) where n is the sequence token number. Even for a 300*300 low resolution RGB image, the parameter space of a single head attention layer for such image would be 81*10^8*3 = 24.3 billion! For comparison, the parameter for a whole BERT Large is just around 340 million. To solve the large parameter space problem for image, the ImageGPT first down sample the 224*224 RGB image to 32*32 pixel. Notice that to further shrink down the RGB value, the algorithm runs a k-512 means clustering to represent the RGB channel to a single dimension. Thus, the final sequence size would be around 900 and the encoding size d is 512, which is acceptable. Notice that there is no mention in the paper for positional encoding for the input. In the ImageGPT, the input sequence is a row of pixel value of an image.

Strength:

The paper first generalize and demonstrate that the Transformer is a general architecture for image sequence. It also conducts a linear probe study here to further analyze the network performance. The finding here is insightful for understanding the imageGPT structure:

The generative performance is related with the model representation quality
The network depth will first increases the representation quality and then degrades it suggesting that the model first tries to learn the image representation in the front layers and then learns the generation for the later layers

Critiques:

The goal of image pretraining model is to learn the representation of the image data. However, as we analyze above, the size of the model is quadratically related with the input sequence length. Thus, they have to down-sample the image to make the model tractable. However, learning from the low-resolution image can’t guarantee it could generalize well to the real-world high resolution image. On the contrary, the CNN structure doesn’t required data down-sampling and could still achieve better performance with half or third size of the parameters.
Inherently, the image data contains the 2D spatial local information. However, resize the input to 1D at the beginning layer undoubtedly complicated the task to make the model harder to synthesize the information across the row. There should be more straightforward encoding approach to better represent the data.
Positional information of the pixel is still a quite important part for a pixel’s semantics. A red pixel inside an apple part of an apple tree image represent the apple. But same pixel value appeared on the leaf part might represent a bug or noise. Encoding the positional information into the pixel representation is meaningful for image generation task and the positional encoding here should be 2D rather than the 1D in the language model.
Due to using 1D pixel sequence as image input, it is inevitable for a pixel token needs to self-attend to several other tokens position in the sequence A better 2D representation way could potentially restrict the self-attention to several local pixels, which could use a local attention and further shrink down the parameter space.
Besides, self-attention is quite expensive for the long sequence, is there any O(n) attention mechanism to lessen the attention burden?