GLIDE: Text Guided Photorealistic Image Generation and Editing using Diffusion

13 Nov 2022

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh and others

OpenAI arXiv Preprint

Summary:

GLIDE is a model for photorealistic image generation and editing with text guidance. The backbone of the model is a diffusion model which we have introduced before. From application angle, it performed SOTA in text-to-image task in 2021 and made human interactive editing possible; from academic perspective, it firstly applied and evaluated classifier-free and CLIP guidance diffusion proposed by previous papers in the text-image generation task on the large dataset. Demonstrated their validity.

To embed the guidance into the diffusion model, the model had explored two approaches: classifier-free guidance and CLIP guidance.

To understand classifier-free guidance, we need to first understand what is class-conditional diffusion. For the denoising sampling process, we modify the mean of noise adding not only from the previous timestep image, but also add a scaled log classifier gradient term. So that the image generated with correct class will remain the original noise schedule term. The noisy image generated far from the predicted class will move towards the correct class. For the U-Net input part, just like what we did for time-embedding in the DDPM diffusion, we also add the class label encoding y in x for all the layers’ inputs in U-Net. Overall, the image could be denoised with the classifier guidance with same noising process.

The classifier-free approach is that they will replace the label y with a fixed probability with null label. So that part of the images are denoised without label condition and part of them are with label information. The final combination for the error prediction would be a scaled extrapolation between the unconditioned noise error and the conditioned one. The larger the scale the s, the more it will go away to the unconditioned prediction. The final goal of the loss function would be minimize the noise generation value and maximize the classified prediction probability. Here the condition doesn’t limit to a label since they got rid of the classifier part, it could be the text caption tokens in that case.

The CLIP guidance is similar as the conditional diffusion. The only difference is they replace the traditional classifier with the CLIP contrastive learning score, so that it could measure the semantics similarity between the nature language text and the image. The denoising process also took ‘maximizing the image-text similarity’ into consideration. So that it will work as standard error prediction if there the text and image pair is similar and make prediction move to the ‘ground truth’ denoise image given such caption if the difference between the noise in previous step and current step is large.

For training, the diffusion model was first trained on a 64*64 low resolution dataset to reduce the training time. Then they train an up-sampling diffusion model to increase the resolution to 256*256. For CLIP guidance, they also trained a noised version CLIP to make sure the CLIP could give reasonable similarity score on Gaussian noise added image and the caption. The inpainting model is fine-tuned on the previous model but add extra mask channel to the input.

Strength:

The model proposed a better way to help the diffusion models generate more ‘human guided’ images rather than pure sampling based on the training dataset distribution without human controllability. It could achieve better image generation result than DALL-E.

Critique:

Again, the sampling time is super slow and I have tried their downsized version and it will take normal GPU nearly 20 mins to generate one 256*256 image. Besides, this work mainly focuses on the application side, the ideas are almost all proposed by other previous works. The result is good but it still mainly relied on the huge dataset. The released cut-down version had clearly showed this point. Is the text2image modeling optimal? Well, there is even no diagram to show its architecture…

Useful resources to understand GLIDE:

Videos:

By Yannic Kilcher

Overall clear explanation, but some flaws in classifier-free guidance understanding

https://www.youtube.com/watch?v=gwI6g1pBD84&ab_channel=YannicKilcher

By Aleksa Cordic

Less clear explanation compared with Yannic’s, but has correct understanding on classifier-free guidance part. Refer to 38:01 part.

https://www.youtube.com/watch?v=lvv4N2nf-HU

By Letitia

Sharp comments with more illustrated diagrams

https://www.youtube.com/watch?v=lvv4N2nf-HU&ab_channel=AleksaGordi%C4%87-TheAIEpiphany

Interactive Demo:

https://huggingface.co/spaces/valhalla/glide-text2im

Code:

OpenAI Version:

https://github.com/openai/glide-text2im

Other Important References:

Class Conditioned Diffusion Model:

https://arxiv.org/pdf/2105.05233.pdf

Classifier-free Diffusion Model:

https://openreview.net/pdf?id=qw8AKxfYbI