DDPM: Diffusion Model for Image Generation

DDPM: Denoising Diffusion Probabilistic Models

Ho, Jonathan and Jain, Ajay and Abbeel, Pieter

NIPS 2020

Summary:

The denoising diffusion probabilistic model is an extended diffusion version to the 2015 paper on image generation area. The model is inspired by consideration from nonequilibrium thermodynamics, where an original distribution can be gradually and slowly destroyed by iterative diffusion process. To apply it in the image dataset, we could gradually add up small Gaussian noise to the original image. With enough time step T, we could get a pure Gaussian noise image. So if we could train a model to recover the previous distribution given by the later noised distribution and time step, we could denoise the image by one step. The goal of the model here is to understand how much noise is added at that timestep for a given dataset. Furthermore, if we continue to denoise the final generated noise image at timestep T with that trained model T times, we would expect to get the original image at timestep 0. In that case, we could generate an image from pure white noise.

Come down to the implementation level, there are several modules to implement diffusion model: the noise scheduler, which sequentially adds up noise; the trained NN (U-Net), which learns to denoise the image given a time step; and a timestep encoding, which adds sequential information to the image input.

The noise scheduler suppose the noise adding process as a Markov process, which the noise add at the current timestep only depends on the previous one timestep image. So given an input image, the noise to add is a Gaussian distribution with mean as a sqrt(1-beta_t) scaled previous image, and beta_t I variance. So the beta_t here can be thought as a measurement of the noise level. If beta_t is 0 the image of next step will be same as the current step; if beta_t is 1 the next image will be pure Gaussian noise. It can be mathematically proofed that if we linearly increase the beta_t from 0 to 0.1 for around 1000 time steps, the final generated image would be a standard Gaussian distribution. Also, if we could precompute the noisy images for all timesteps and calculate the cumulated product of 1-beta_t term for all time steps, we can directly generated the noisy image of arbitrary timestep without doing that sequential noising process again, which makes the retrieval of the noisy image at any timestep really fast.

The U-Net architecture is chosen to be the denoise NN since the added noise of the image has the same dimension with the original image size. Besides, because the U-Net structure is not recurrent, the denoise process for all the timestep noisy image will always share same U-Net weights, thus, a time-embedding to the input data is needed to notify the U-Net which timestep it is since the added noise level depends on the timestep information. So during training, the timestep T can be sampled randomly so that to recover the original image, they first retrieve the noise image at T timestep with the forward model we explained above, and then sequentially push the output image from later timestep with timestep encoding to the U-Net T times to get the original image estimation. With the estimation and the original image, the paper applied L2 norm as loss function to back propagate the error to make the NN denoise better next time.

Strength:

The method achieved SOTA image generation quality. The generated images have same dimension as training dataset images, I guess the magics might lie in this lossless dimensionality. No compression to the latent space might maintain as much detail as possible since the task of the NN is denoising based on a given dataset distribution. They don’t need to bother the compression and generation part. In the training stage, what they do is to remember what might be a good next step to denoise out the images they have seen before in the previous training dataset. In the inference stage, based on their memory, they try to denoise out an similar image they remembered in the training dataset from a pure Gaussian noise. There is no need for the model to compress the latent space like VAE or GAN.

Critique:

The inference phase needs to go 1000 time steps which is time-consuming. Also, the controllability here is modeled on interpolation, which requires more semantics level elegant way.

Useful Resources:

Video:

https://www.youtube.com/watch?v=a4Yfz2FxXiY&ab_channel=DeepFindr

Blogs:

https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#forward-diffusion-process

Code:

https://colab.research.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=sharing