VAE vs. Diffusion from ELBO Perspective

It is fascinating that two entirely different generative paradigms—Variational Autoencoders (VAEs) and Diffusion Models—are mathematically rooted in optimizing the exact same Evidence Lower Bound (ELBO).

The general ELBO formula is:

\log p(\mathbf{x}) \ge \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))

The radical divergence in how they function and the results they produce comes down to fundamental design choices regarding the latent space, the forward process (encoding), and the reverse process (decoding).

Here is a breakdown of the key mathematical and structural differences that force VAEs and Diffusion models down different paths.

1. The Forward Process (The "Encoder")

In the ELBO framework, we have an approximate posterior, \(q(\mathbf{z}|\mathbf{x})\), which maps data to a latent space.

VAE (Learned & Single-Step): In a VAE, \(q_\phi(\mathbf{z}|\mathbf{x})\) is a learned neural network (the encoder) parameterized by \(\phi\). It takes a complex, non-linear, single-step leap from the data distribution to the latent distribution.
Diffusion (Fixed & Multi-Step): In a diffusion model, the forward process \(q(\mathbf{x}_{1:T}|\mathbf{x}_0)\) is fixed and mathematically defined. It has zero learnable parameters. It is a predefined Markov chain that adds Gaussian noise over \(T\) small steps.

2. Dimensionality and the Information Bottleneck

The structure of the latent variable \(\mathbf{z}\) defines what the model is forced to learn.

VAE (Information Compression): VAEs use a strict information bottleneck. The latent vector \(\mathbf{z}\) is typically of a much lower dimension than the input image \(\mathbf{x}\). The model is forced to compress the image into a dense, semantic representation. This is why VAE latent spaces are highly interpretable and great for interpolating between concepts (e.g., blending two faces).
Diffusion (Dimensionality Preservation): Diffusion models do not compress the data spatially. Every latent variable \(\mathbf{x}_1, \dots, \mathbf{x}_T\) has the exact same dimensionality as the original image \(\mathbf{x}_0\). Instead of compressing data into a smaller vector, diffusion gradually destroys the mutual information between the original image and the latent state by adding noise, until \(\mathbf{x}_T\) is pure noise.

3. The Reverse Process (The "Decoder")

The generative step involves mapping from the latent space back to the data space, represented as \(p_\theta(\mathbf{x}|\mathbf{z})\).

VAE (One Massive Leap): The VAE decoder \(p_\theta(\mathbf{x}|\mathbf{z})\) must map a low-dimensional, highly compressed vector all the way back to a high-dimensional image in one single pass. This is an incredibly difficult mapping to learn perfectly, which is why VAEs often produce blurry images; they learn the global structure well but fail to perfectly reconstruct high-frequency, fine-grained details in one shot.
Diffusion (Thousands of Tiny Steps): Diffusion models break the generation process into hundreds or thousands of tiny, iterative steps: \(p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\). Because each step only requires removing a highly predictable, microscopic amount of Gaussian noise, the neural network's job is drastically simplified at each individual timestep. The composition of hundreds of simple Gaussian steps yields a highly complex, expressive data distribution, resulting in unmatched image sharpness and detail.

4. How the ELBO Manifests in the Loss

The general ELBO formula is:

\log p(\mathbf{x}) \ge \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))

How this equation is penalized during training differs entirely:

The VAE Trade-off: The VAE loss is an explicit tug-of-war. The first term is the reconstruction loss (make the output look exactly like the input). The second term is the KL penalty (force the learned latent distribution to match a standard normal prior \(\mathcal{N}(\mathbf{0}, \mathbf{I})\)). If the VAE focuses too much on reconstruction, the latent space becomes disjointed. If it focuses too much on the prior, the images become blurry.
The Diffusion Decomposition: Because the diffusion forward process is a fixed sequence of Gaussians, and the reverse process is modeled as Gaussians, the ELBO beautifully decomposes into a sum of KL divergences at every single timestep \(t\). As we walked through previously, matching these Gaussian distributions mathematically simplifies to matching their means. Therefore, the diffusion ELBO bypasses the VAE's "reconstruction vs. prior" tug-of-war entirely, reducing simply to step-wise noise prediction: \(\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\).

In summary, VAEs ask a network to do the heavy lifting of mapping between complex domains in single, compressed leaps, resulting in speed but structural blurriness. Diffusion removes the bottleneck and breaks the math down into a long chain of trivial Gaussian steps, sacrificing speed for mathematical tractability and extreme detail.

Would you like to look at how these two contrasting paradigms are actually merged together in modern architectures, such as Latent Diffusion Models (like Stable Diffusion)?

Jipeng Sun

Home

Research

Knowledge Share

Teaching