Proximal Algorithms

Proximal algorithms are a class of optimization methods designed to handle objective functions that are non-smooth, constrained, or split into multiple terms. They are exceptionally useful in solving ill-posed inverse problems where you need to balance a smooth data fidelity term with a non-smooth regularization term.

Here is the mathematical breakdown of how they work.

1. The Proximal Operator: Definition

Let \(f: \mathbb{R}^n \to \mathbb{R} \cup {+\infty}\) be a closed proper convex function. The proximal operator of \(f\) with parameter \(\lambda > 0\), denoted as \(\text{prox}_{\lambda f}(v)\), is defined as:

\text{prox}_{\lambda f}(v) = \arg\min_x \left( f(x) + \frac{1}{2\lambda} \|x - v\|_2^2 \right)

Intuition:

This operator takes a point \(v\) and returns a new point \(x\) that balances two competing goals:

Minimize \(f(x)\): We want the new point to reduce the value of our target function.
Stay close to \(v\): The penalty term \(\frac{1}{2\lambda} \|x - v\|_2^2\) forces the new point \(x\) to not stray too far from the input \(v\).

The parameter \(\lambda\) acts as a trade-off parameter or "step size." As \(\lambda \to 0\), the penalty dominates, and \(\text{prox}_{\lambda f}(v) \to v\). As \(\lambda \to \infty\), the operator simply returns the global minimum of \(f(x)\).

2. Connection to Implicit Gradient Descent

To understand why this is a generalized form of gradient descent, assume for a moment that \(f\) is smooth and differentiable. To find the minimum of the proximal objective, we take the gradient with respect to \(x\) and set it to zero:

\nabla f(x) + \frac{1}{\lambda}(x - v) = 0

Rearranging this gives:

x = v - \lambda \nabla f(x)

Notice that the gradient is evaluated at the *future* point \(x\), not the current point \(v\). This means the proximal operator is equivalent to a backward (implicit) Euler step. Because it is implicit, it is inherently more stable than standard forward gradient descent, allowing it to easily handle non-smooth functions where standard gradients are undefined.

3. A Classic Example: \(L_1\) Regularization

In computational imaging, \(L_1\) sparsity priors are foundational for recovering signals. If we set \(f(x) = \|x\|_1\), the proximal operator has a beautifully simple closed-form solution known as the soft-thresholding operator:

\text{prox}_{\lambda \| \cdot \|_1}(v)_i = \text{sign}(v_i) \max(|v_i| - \lambda, 0)

Instead of just scaling values down, this operation cleanly shrinks small values (representing noise or negligible artifacts) exactly to zero, actively promoting sparsity.

4. The Proximal Gradient Method (Forward-Backward Splitting)

In practice, we rarely minimize just \(f(x)\). We usually want to minimize a composite objective function typical of inverse problems:

\min_x g(x) + h(x)

Where:

\(g(x)\) is convex and differentiable (e.g., a data fidelity term like \(\|Ax - y\|_2^2\)).
\(h(x)\) is convex but non-differentiable (e.g., a regularization term like Total Variation or an \(L_1\) norm).

We cannot use standard gradient descent because \(h\) is non-smooth. We cannot easily use the proximal operator on the whole \(g + h\) function because it might lack a closed-form solution.

The Proximal Gradient Method solves this by splitting the problem. It updates \(x\) iteratively using the following rule:

x^{(k+1)} = \text{prox}_{\lambda h} \left( x^{(k)} - \lambda \nabla g(x^{(k)}) \right)

This is a two-step process:

Forward Step (Gradient): Take a standard explicit gradient descent step on the smooth data fidelity term \(g\) to get an intermediate point \(z = x^{(k)} - \lambda \nabla g(x^{(k)})\).
Backward Step (Proximal): Apply the proximal operator of the non-smooth prior \(h\) to that intermediate point \(z\).

Modern Context: Generative Priors

When bridging classical optimization with modern generative AI, this mathematical framework remains critical. In architectures like Plug-and-Play (PnP) networks, the formal proximal operator \(\text{prox}_{\lambda h}\) is entirely swapped out for an off-the-shelf denoiser, such as a neural network or a step in a diffusion model. The proximal framework provides the theoretical scaffolding that allows these complex, learned models to act as mathematically sound regularizers for noisy image reconstruction.

Jipeng Sun

Home

Research

Knowledge Share

Teaching