Review of the Guidance in Diffusion Models

Summary:

I plan to use Diffusion Models Beat GANs on Image Synthesis NIPS2021 paper as a foundation and review several diffusion guidance methods in this article, including reviewing diffusion methods in a score based methods angle, the classifier based DDPM and DDIM, and the classifier-free methods. Suppose p(x) is the probability distribution of the dataset X, the goal of generative methods is to implicitly/explicitly learn out the p(x) and sample from p(x) to get new generation results. If we somehow could know the gradient of the p(x), we can guide our model to generate similar images. The score based methods are deriving the gradient of the log likelihood of the p(x). However, the dataset we have is usually very sparse in the spanned space of the X, which makes the gradients are always zero in the most of the space and makes this idea impractical. If we look diffusion model in this perspective, the goal of it is to perturb the dataset with Gaussian noise so that the new perturbed data could provide non-zero gradient for the empty area of the space. For every iteration, the variance of each step will shrink and get lead the model to a more and more accurate gradient estimation area provided by dataset X.

The paper provided a comprehensive review on how classifier guided diffusion model is developed. The classifier guidance for diffusion reverse model overall can be written as given the current noised input and the label result, what is the previous less noised input version. Based on the Bayes rule, it can be solved by calculating as the non-labeled standard diffusion reverse probability p(x_t-1|x_t) multiplied by the classified label probability given the input noised image p(y|x). The p(y|x), in other words, is the output probability of a trained classifier. To further simplify the classifier probability, Taylor expansion is applied around x_t = μ and result in a (x- μ)*gradient term. After replace the classifier probability term, the log form of the conditional diffusion can be reduced to a log(z) plus a constant term, where z is the original Gaussian distribution of the diffusion mode with μ shifted by the product of the gradient and sigma. The result here means a classifier guidance diffusion can be converted to an un-guided diffusion process by updating the μ with a classifier gradient related term.

For DDIM, since the sampling process is deterministic rather than stochastic as DDPM, the conditional sampling assumption won’t hold for DDIM. To add condition into that process, the score-based conditioning trick can be applied on the gradient of the diffusion model function p(x_t). If we have a noise-predict model for unconditional diffusion model e(x_t), the gradient score of the p(x_t) will be a scaled e(x_t). Then, to apply the conditional classifier influence score, the total score of the joint distribution can be written as the unconditional error function minus a scaled classifier gradient.

The motivation of classifier free guidance comes from reviewing the p(x|y) term in the classifier-guided DDPM with a Bayes angle. Based on the Bayes rule, the p(y|x) can be written as p(x|y)*p(y)/p(x). The derivation of the formula on x is p(x|y)’/p(x)’ which get rid of the p(y) term. If we input the model with empty value, the output of the diffusion model can be then viewed as p(x|∅). The formula then can be viewed as the division of the diffusion output probability with the label and with the empty value. This thoroughly remove the need of training a new classifier since there is no p(y|x) term anymore. The training process of the classifier-free guidance diffusion is done by replace the label input with empty value with a fixed probability. And get two outputs to form the final noise prediction(One for input with class, one for input with empty value).

Strength:

The guidance of diffusion provides more controllability to the generation results which is meaningful for wider applications. The classifier-free guidance seems to avoid the classifier training in a thorough way and the GLIDE paper also shows its advance comparing with the classifier based methods. To me, the classifier based methods seem to be a detour for human’s guided diffusion exploration process.

Critique:

What are the shortages for classifier-free diffusion models? What is the conditioned/unconditioned dataset balance for the training dataset?