Diffusion from Stochastic Differential Equations (SDEs) Perspective

From a mathematical perspective, diffusion models are fundamentally about defining a trajectory between a complex, intractable data distribution and a simple, tractable noise distribution, and then learning the vector field that reverses this trajectory.

Given a background in computational imaging and computer vision, the most elegant and unifying way to look at this is through the lens of continuous-time Stochastic Differential Equations (SDEs), which encapsulates discrete models like DDPMs and SGMs.

Here is the mathematical breakdown of the diffusion framework.

1. The Forward Process: Destroying Information

We start with data \(\mathbf{x}_0 \sim p_0(\mathbf{x})\), where \(p_0\) is our target data distribution (e.g., natural images). The goal of the forward process is to progressively add Gaussian noise until the distribution at time \(T\), denoted \(p_T(\mathbf{x})\), is indistinguishable from a standard Gaussian prior \(\mathcal{N}(\mathbf{0}, \mathbf{I})\).

This continuous perturbation is modeled by an Itô SDE:

d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}

Where:

\(\mathbf{f}(\mathbf{x}, t)\) is the drift coefficient (a vector function pulling the data, often scaling it down).
\(g(t)\) is the diffusion coefficient (a scalar function controlling the variance of the injected noise).
\(\mathbf{w}\) is the standard Brownian motion (Wiener process).

By choosing affine drift functions (e.g., \(\mathbf{f}(\mathbf{x}, t) = -\frac{1}{2}\beta(t)\mathbf{x}\)), the transition kernel \(q(\mathbf{x}_t | \mathbf{x}_0)\) remains Gaussian, which allows for closed-form sampling at any arbitrary timestep \(t\) without simulating the entire chain.

2. The Reverse Process: Generating Data

To generate data, we need to simulate the SDE backward in time, starting from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and moving to \(t=0\).

According to Anderson's theorem, any forward SDE has a corresponding reverse-time SDE:

d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]dt + g(t)d\mathbf{\bar{w}}

Where:

\(d\mathbf{\bar{w}}\) is a reverse-time Brownian motion.
\(dt\) is an infinitesimal negative time step.
\(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\) is the score function of the perturbed data distribution.

The drift and diffusion coefficients (\(\mathbf{f}\) and \(g\)) are analytically known because we designed the forward process. The only unknown term required to reverse the SDE is the score function.

3. Score Matching: Learning the Vector Field

Since the true marginal distribution \(p_t(\mathbf{x})\) is intractable (it requires integrating over all possible original images \(\mathbf{x}_0\)), we cannot compute \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\) directly.

Instead, we train a neural network—the score model \(s_\theta(\mathbf{x}, t)\)—to approximate it. We optimize this using Denoising Score Matching, which relies on the fact that matching the tractable conditional score \(\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0)\) is equivalent to matching the intractable marginal score up to a constant.

The objective function is:

\mathcal{L}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,T), \mathbf{x}_0 \sim p_0, \mathbf{x}_t \sim q(\mathbf{x}_t|\mathbf{x}_0)} \left[ \lambda(t) \| s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) \|_2^2 \right]

Where:

\(\lambda(t)\) is a positive weighting function (often chosen to balance the variance of the score magnitude across time).
\(\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0)\) is analytically available because the perturbation kernel \(q(\mathbf{x}_t | \mathbf{x}_0)\) is Gaussian. Specifically, it simplifies to \(-\frac{\boldsymbol{\epsilon}}{\sigma(t)}\), where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) is the noise injected.

This is why, in practice, DDPMs are often implemented to predict the injected noise \(\boldsymbol{\epsilon}\) rather than the score directly; they are mathematically equivalent up to a scaling factor.

4. Connection to Discrete Models (DDPM)

While the SDE framework is continuous, the foundational Denoising Diffusion Probabilistic Models (DDPM) defined this as a discrete Markov chain.

In the discrete case, the forward transitions are:

q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})

And the optimization is derived from maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of the data, similar to Variational Autoencoders (VAEs). The SDE formulation generalizes this limit as the number of discrete steps \(N \to \infty\).

Because the reverse process inherently models a learned prior distribution of natural signals, diffusion models have become incredibly powerful tools for solving ill-posed inverse problems (like super-resolution, phase retrieval, or denoising).

Jipeng Sun

Home

Research

Knowledge Share

Teaching