Proximal Algorithms

← Back to Knowledge Share

Proximal algorithms are a class of optimization methods designed to handle objective functions that are non-smooth, constrained, or split into multiple terms. They are exceptionally useful in solving ill-posed inverse problems where you need to balance a smooth data fidelity term with a non-smooth regularization term.

Here is the mathematical breakdown of how they work.

1. The Proximal Operator: Definition

Let \(f: \mathbb{R}^n \to \mathbb{R} \cup {+\infty}\) be a closed proper convex function. The proximal operator of \(f\) with parameter \(\lambda > 0\), denoted as \(\text{prox}_{\lambda f}(v)\), is defined as:

\[\text{prox}_{\lambda f}(v) = \arg\min_x \left( f(x) + \frac{1}{2\lambda} \|x - v\|_2^2 \right)\]

Intuition:

This operator takes a point \(v\) and returns a new point \(x\) that balances two competing goals:

  1. Minimize \(f(x)\): We want the new point to reduce the value of our target function.
  2. Stay close to \(v\): The penalty term \(\frac{1}{2\lambda} \|x - v\|_2^2\) forces the new point \(x\) to not stray too far from the input \(v\).

The parameter \(\lambda\) acts as a trade-off parameter or "step size." As \(\lambda \to 0\), the penalty dominates, and \(\text{prox}_{\lambda f}(v) \to v\). As \(\lambda \to \infty\), the operator simply returns the global minimum of \(f(x)\).

2. Connection to Implicit Gradient Descent

To understand why this is a generalized form of gradient descent, assume for a moment that \(f\) is smooth and differentiable. To find the minimum of the proximal objective, we take the gradient with respect to \(x\) and set it to zero:

\[\nabla f(x) + \frac{1}{\lambda}(x - v) = 0\]

Rearranging this gives:

\[x = v - \lambda \nabla f(x)\]

Notice that the gradient is evaluated at the *future* point \(x\), not the current point \(v\). This means the proximal operator is equivalent to a backward (implicit) Euler step. Because it is implicit, it is inherently more stable than standard forward gradient descent, allowing it to easily handle non-smooth functions where standard gradients are undefined.

3. A Classic Example: \(L_1\) Regularization

In computational imaging, \(L_1\) sparsity priors are foundational for recovering signals. If we set \(f(x) = \|x\|_1\), the proximal operator has a beautifully simple closed-form solution known as the soft-thresholding operator:

\[\text{prox}_{\lambda \| \cdot \|_1}(v)_i = \text{sign}(v_i) \max(|v_i| - \lambda, 0)\]

Instead of just scaling values down, this operation cleanly shrinks small values (representing noise or negligible artifacts) exactly to zero, actively promoting sparsity.

4. The Proximal Gradient Method (Forward-Backward Splitting)

In practice, we rarely minimize just \(f(x)\). We usually want to minimize a composite objective function typical of inverse problems:

\[\min_x g(x) + h(x)\]

Where:

  • \(g(x)\) is convex and differentiable (e.g., a data fidelity term like \(\|Ax - y\|_2^2\)).
  • \(h(x)\) is convex but non-differentiable (e.g., a regularization term like Total Variation or an \(L_1\) norm).

We cannot use standard gradient descent because \(h\) is non-smooth. We cannot easily use the proximal operator on the whole \(g + h\) function because it might lack a closed-form solution.

The Proximal Gradient Method solves this by splitting the problem. It updates \(x\) iteratively using the following rule:

\[x^{(k+1)} = \text{prox}_{\lambda h} \left( x^{(k)} - \lambda \nabla g(x^{(k)}) \right)\]

This is a two-step process:

  1. Forward Step (Gradient): Take a standard explicit gradient descent step on the smooth data fidelity term \(g\) to get an intermediate point \(z = x^{(k)} - \lambda \nabla g(x^{(k)})\).
  2. Backward Step (Proximal): Apply the proximal operator of the non-smooth prior \(h\) to that intermediate point \(z\).

Modern Context: Generative Priors

When bridging classical optimization with modern generative AI, this mathematical framework remains critical. In architectures like Plug-and-Play (PnP) networks, the formal proximal operator \(\text{prox}_{\lambda h}\) is entirely swapped out for an off-the-shelf denoiser, such as a neural network or a step in a diffusion model. The proximal framework provides the theoretical scaffolding that allows these complex, learned models to act as mathematically sound regularizers for noisy image reconstruction.