Analytical Proximal Operators
As a quick refresher, the proximal operator of a scaled convex function \(\lambda f(x)\) evaluated at a point \(v\) is defined as:
\[\text{prox}_{\lambda f}(v) = \underset{x}{\text{argmin}} \left( f(x) + \frac{1}{2\lambda} \|x - v\|_2^2 \right)\]
Here are the derivations for three of the most fundamental proximal operators used in inverse problems and imaging: the L2 norm (Ridge), the L1 norm (Lasso/Sparsity), and the Indicator function (Physical Constraints).
1. The L2 Norm Squared (Tikhonov / Ridge Regularization)
This is the easiest derivation because the objective function is smooth and strictly convex everywhere, so we can just use standard vector calculus.
Let \(f(x) = \frac{1}{2}\|x\|_2^2\). Our goal is to find the minimum of the objective function \(J(x)\):
\[J(x) = \frac{1}{2}\|x\|_2^2 + \frac{1}{2\lambda}\|x - v\|_2^2\]
Derivation:
Because \(J(x)\) is differentiable, the minimum occurs exactly where the gradient with respect to \(x\) is zero.
\[\nabla J(x) = x + \frac{1}{\lambda}(x - v) = 0\]
Now, we just solve for \(x\):
\[x + \frac{1}{\lambda}x - \frac{1}{\lambda}v = 0\]
\[x \left(1 + \frac{1}{\lambda}\right) = \frac{1}{\lambda}v\]
\[x \left(\frac{\lambda + 1}{\lambda}\right) = \frac{1}{\lambda}v\]
\[x = \frac{1}{\lambda + 1} v\]
Result:
\[\text{prox}_{\lambda (\frac{1}{2}\|\cdot\|_2^2)}(v) = \frac{1}{\lambda + 1} v\]
*Intuition:* The L2 proximal operator simply scales the input vector \(v\) down toward zero. It shrinks everything proportionally without ever forcing any value to become exactly zero.
2. The L1 Norm (Soft Thresholding for Sparsity)
This is the workhorse of sparse optimization. Let \(f(x) = \|x\|_1\).
Because the L1 norm (\(\sum |x_i|\)) and the L2 distance squared (\(\sum (x_i - v_i)^2\)) are both separable, we can break this massive vector problem down and solve it for a single scalar coordinate \(x_i\) at a time.
\[J(x_i) = |x_i| + \frac{1}{2\lambda}(x_i - v_i)^2\]
Derivation:
The absolute value function \(|x_i|\) is not differentiable at \(x_i = 0\). Instead of a gradient, we must use the subdifferential (\(\partial\)). The optimality condition dictates that \(0\) must be in the subdifferential of \(J(x_i)\):
\[0 \in \partial |x_i| + \frac{1}{\lambda}(x_i - v_i)\]
The subdifferential of the absolute value is:
- \(\partial |x_i| = {1}\) if \(x_i > 0\)
- \(\partial |x_i| = {-1}\) if \(x_i < 0\)
- \(\partial |x_i| = [-1, 1]\) if \(x_i = 0\)
We analyze this in three cases to solve for \(x_i\):
Case 1: Assume \(x_i > 0\)
\[1 + \frac{1}{\lambda}(x_i - v_i) = 0 \implies x_i = v_i - \lambda\]
For our assumption (\(x_i > 0\)) to be true, it must be that \(v_i - \lambda > 0\), meaning \(v_i > \lambda\).
Case 2: Assume \(x_i < 0\)
\[-1 + \frac{1}{\lambda}(x_i - v_i) = 0 \implies x_i = v_i + \lambda\]
For this assumption to be true, \(v_i + \lambda < 0\), meaning \(v_i < -\lambda\).
Case 3: Assume \(x_i = 0\)
\[0 \in [-1, 1] + \frac{1}{\lambda}(0 - v_i) \implies \frac{v_i}{\lambda} \in [-1, 1] \implies v_i \in [-\lambda, \lambda]\]
This means if the incoming data \(v_i\) is trapped between \(-\lambda\) and \(\lambda\), the optimal choice is to snap it exactly to zero.
Result (The Soft Thresholding Operator):
\[\text{prox}_{\lambda \|\cdot\|_1}(v_i) = \text{sign}(v_i) \max(|v_i| - \lambda, 0)\]
*Intuition:* Unlike L2 which shrinks proportionally, L1 lops off a flat amount (\(\lambda\)) from the magnitude of the signal. If the signal is too weak (below \(\lambda\)), it gets completely crushed to zero, inducing exact sparsity.
3. The Indicator Function (Euclidean Projection)
Often in physical systems (like measuring light intensity in optics), we know our signal \(x\) must belong to a strictly defined convex set \(C\) (e.g., all values must be non-negative).
We enforce this using an indicator function:
- \(I_C(x) = 0\) if \(x \in C\)
- \(I_C(x) = \infty\) if \(x \notin C\)
Let \(f(x) = I_C(x)\).
Derivation:
Plug the indicator function into the proximal definition:
\[\text{prox}_{I_C}(v) = \underset{x}{\text{argmin}} \left( I_C(x) + \frac{1}{2\lambda} \|x - v\|_2^2 \right)\]
Because \(I_C(x)\) acts as an infinite penalty outside of \(C\) and contributes nothing (\(0\)) inside of \(C\), the minimization problem is strictly restricted to the domain of \(C\). The \(\frac{1}{2\lambda}\) scaling factor becomes irrelevant since we are just minimizing distance.
The equation simplifies entirely to:
\[\text{prox}_{I_C}(v) = \underset{x \in C}{\text{argmin}} \|x - v\|_2^2\]
Result:
\[\text{prox}_{I_C}(v) = \text{P}_C(v)\]
*Intuition:* The proximal operator of an indicator function is simply the Euclidean projection onto that set. For example, if \(C\) is the non-negative orthant (representing physical light), the proximal operator simply takes all negative values in \(v\) and sets them to \(0\), leaving positive values untouched.