Diffusion Model
Diffusion models, specifically Denoising Diffusion Probabilistic Models (DDPMs), are generative models that learn to create data by reversing a gradual noise-addition process.
Mathematically, diffusion can be broken down into two Markov chains: a forward process that systematically corrupts data into pure Gaussian noise, and a learned reverse process that iteratively recovers the original data from the noise.
Here is the mathematical foundation of how this works.
1. The Forward Process (Adding Noise)
Let \(\mathbf{x}_0 \sim q(\mathbf{x})\) be a data point sampled from the real data distribution. The forward process adds small amounts of Gaussian noise to the data over \(T\) timesteps, producing a sequence of increasingly noisy variables \(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T\).
This is defined as a Markov chain with a fixed variance schedule \(\beta_1, \dots, \beta_T\):
\[q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\]
Instead of applying this iteratively \(t\) times, we can sample \(\mathbf{x}_t\) at any arbitrary timestep \(t\) directly from \(\mathbf{x}_0\) in a closed form.
Let \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\). The distribution of \(\mathbf{x}_t\) given \(\mathbf{x}_0\) is:
\[q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})\]
Using the reparameterization trick, we can express \(\mathbf{x}_t\) as:
\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}\]
where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) is the noise added to the image. As \(T \to \infty\), \(\bar{\alpha}_T \to 0\), making \(\mathbf{x}_T\) nearly an isotropic Gaussian.
2. The Reverse Process (Removing Noise)
If we could calculate \(q(\mathbf{x}_{t-1} | \mathbf{x}_t)\), we could start with pure noise \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and run the process backwards to generate a new sample from the data distribution. However, this is intractable because it requires knowing the entire data distribution.
Instead, we use a neural network (typically a U-Net) to approximate these conditional probabilities with a parameterized model \(p_\theta\):
\[p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))\]
In practice, the variance \(\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)\) is often fixed to time-dependent constants (like \(\beta_t \mathbf{I}\)), meaning the network only needs to learn the mean \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\).
3. The Training Objective
To train the model, we want to maximize the likelihood of the training data. Because exact likelihood computation is intractable, we optimize the variational lower bound (ELBO), similar to Variational Autoencoders (VAEs).
Through algebraic simplification, the objective reduces to teaching the network to predict the specific noise vector \(\boldsymbol{\epsilon}\) that was added to \(\mathbf{x}_0\) to create \(\mathbf{x}_t\).
The highly effective, simplified loss function formulated by Ho et al. is:
\[L_{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]\]
- \(\boldsymbol{\epsilon}\) is the actual noise injected during the forward process.
- \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) is the neural network's prediction of that noise.
4. Sampling (Generation)
Once the network \(\boldsymbol{\epsilon}_\theta\) is trained, we can generate new data.
- We sample pure noise: \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).
- We iteratively denoise it using Langevin dynamics. At each step from \(T\) down to 1, we compute \(\mathbf{x}_{t-1}\) using the network's noise prediction:
\[\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) + \sigma_t \mathbf{z}\]
where \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) for \(t > 1\) (adding back a small amount of noise to maintain the Markov chain stochasticity), and \(\mathbf{z} = \mathbf{0}\) for the final step where \(t = 1\).
How the loss is derived from ELBO
\[\log p(x) \ge \mathbb{E}_{z \sim q} \left[ \log \frac{p(x, z)}{q(z|x)} \right]\]
To make the ELBO useful for optimization (as in Variational Autoencoders), we rewrite the joint distribution \(p(x, z)\) as \(p(x|z)p(z)\):
\[\mathcal{L} = \mathbb{E}_{z \sim q} \left[ \log \frac{p(x|z)p(z)}{q(z|x)} \right]\]
\[\mathcal{L} = \mathbb{E}_{z \sim q} [\log p(x|z)] - \mathbb{E}_{z \sim q} \left[ \log \frac{q(z|x)}{p(z)} \right]\]
\[\mathcal{L} = \underbrace{\mathbb{E}_{z \sim q} [\log p(x|z)]}_{\text{Reconstruction Term}} - \underbrace{D_{KL}(q(z|x) \parallel p(z))}_{\text{Regularization Term}}\]
- Reconstruction Term: Encourages the model to assign high probability to the observed data given the latent codes.
- Regularization Term: Forces the approximate posterior \(q(z|x)\) to stay close to the prior \(p(z)\), preventing the model from collapsing into a simple lookup table.
1. Starting with the ELBO
Like Variational Autoencoders (VAEs), we want to maximize the log-likelihood of our data, \(\log{p_\theta(\mathbf{x}_0)}\). Because integrating over all possible latent trajectories \(\mathbf{x}_{1:T}\) is intractable, we optimize the variational lower bound (the negative ELBO), which we want to minimize:
\[L_{VLB}=\mathbb{E}_{q}\left[-\log{p_\theta(\mathbf{x}_{0:T})}+\log{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right]\]
2. Decomposing the Objective
Using the Markov property of both the forward and reverse chains, we can expand the joint probabilities:
\[p_\theta(\mathbf{x}_{0:T})=p_\theta(\mathbf{x}_T)\prod_{t=1}^T{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}\]
\[q(\mathbf{x}_{1:T}|\mathbf{x}_0)=\prod_{t=1}^T{q(\mathbf{x}_t|\mathbf{x}_{t-1})}\]
By applying Bayes' rule and carefully grouping terms conditioned on \(\mathbf{x}_0\), the negative ELBO can be rewritten as a sum of three distinct Kullback-Leibler (KL) divergences:
\[L_{VLB}=L_T+L_{t-1}+L_0\]
Where:
- \(L_T=D_{KL}(q(\mathbf{x}_T|\mathbf{x}_0)\parallel{p_\theta(\mathbf{x}_T)})\): The prior matching term. Since \(q(\mathbf{x}_T|\mathbf{x}_0)\) is essentially pure Gaussian noise and \(p_\theta(\mathbf{x}_T)\) is defined as a standard Gaussian, this term has no learnable parameters and can be ignored during training.
- \(L_0=-\log{p_\theta(\mathbf{x}_0|\mathbf{x}_1)}\): The reconstruction term, estimated using a discrete decoder.
- \(L_{t-1}=\sum_{t=2}^T{D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\parallel{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)})}\): The denoising matching term. This is where the core learning happens.
3. Analyzing the Denoising Term (\(L_{t-1}\))
The beauty of the diffusion framework is that the forward posterior, \(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\), is mathematically tractable. It is a Gaussian distribution:
\[q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1};\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0),\tilde{\beta}_t\mathbf{I})\]
Our neural network models the reverse step as another Gaussian:
\[p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\Sigma_\theta\mathbf{I})\]
Because the KL divergence between two Gaussians with fixed, matching variances (assuming \(\Sigma_\theta=\tilde{\beta}_t\)) reduces to the squared difference between their means, optimizing \(L_{t-1}\) is simply a matter of minimizing the distance between the true mean \(\tilde{\boldsymbol{\mu}}_t\) and the predicted mean \(\boldsymbol{\mu}_\theta\):
\[L_{t-1}\propto\|\tilde{\boldsymbol{\mu}}_t-\boldsymbol{\mu}_\theta\|^2\]
4. Parameterizing the Mean
Through Gaussian conditioning, the true mean of the forward posterior is:
\[\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0)=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t\]
We know from the forward process reparameterization that \(\mathbf{x}_0=\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon})\). Substituting this \(\mathbf{x}_0\) into the equation above yields a formulation of the mean based entirely on \(\mathbf{x}_t\) and the noise \(\boldsymbol{\epsilon}\):
\[\tilde{\boldsymbol{\mu}}_t=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)\]
Since \(\mathbf{x}_t\) is available as an input, we can design our neural network to predict the noise \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\) rather than predicting the mean directly. The network's predicted mean becomes:
\[\boldsymbol{\mu}_\theta(\mathbf{x}_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right)\]
5. Arriving at the Simplified Loss
When we plug both \(\tilde{\boldsymbol{\mu}}_t\) and \(\boldsymbol{\mu}_\theta\) back into the KL divergence from Step 3, the \(\mathbf{x}_t\) terms cancel out, leaving us with an objective that scales the error of the noise prediction by a complex, time-dependent weighting factor:
\[L_{t-1}\propto\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\|^2\]
Ho et al. (2020) discovered that retaining this exact weighting heavily prioritizes denoising at very small \(t\) values (where the image is almost clean) and ignores the harder task of denoising at large \(t\) values.
By empirically dropping this complex weighting factor and instead uniformly weighting the expected squared error of the noise prediction across all timesteps, the model achieves significantly better sample quality. This leaves us with the final, simplified loss:
\[L_{simple}(\theta)=\mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\|^2\right]\]