ADMM (Alternating Direction Method of Multipliers)
The Alternating Direction Method of Multipliers (ADMM) is a powerful algorithm that solves convex optimization problems by breaking them into smaller, more manageable pieces. It combines the decomposability of dual ascent with the robust convergence properties of the method of multipliers.
It is particularly useful for solving large-scale inverse problems, such as those found in computational imaging, where you need to balance a complex forward model with a non-smooth regularization prior.
Here is the mathematical breakdown of how it works.
1. The Standard Problem Formulation
ADMM tackles problems that can be written in the following structural form:
\[\min_{x, z} f(x) + g(z)\]
Subject to the linear constraint:
\[Ax + Bz = c\]
- \(x\) and \(z\): The variables we want to optimize. We split the problem into two distinct variables to handle different parts of the objective function separately.
- \(f(x)\): Typically represents the empirical loss or data fidelity term (e.g., how well an image reconstructs from sensor measurements).
- \(g(z)\): Typically represents a regularization term or prior (e.g., sparsity, total variation, or an implicitly defined generative prior).
- \(A, B, c\): Matrices and vectors that define the linear relationship enforcing consistency between \(x\) and \(z\). Often, \(A = I\), \(B = -I\), and \(c = 0\), enforcing \(x = z\).
2. The Augmented Lagrangian
To solve constrained optimization problems, we typically use a Lagrangian. ADMM uses an Augmented Lagrangian, which adds a quadratic penalty term to the standard Lagrangian to enforce strict convexity and ensure the algorithm converges even if \(f\) or \(g\) are not strictly convex.
The Augmented Lagrangian is defined as:
\[L_\rho(x, z, y) = f(x) + g(z) + y^T(Ax + Bz - c) + \frac{\rho}{2} \|Ax + Bz - c\|_2^2\]
- \(y\): The dual variable (or Lagrange multiplier).
- \(\rho > 0\): The penalty parameter. A higher \(\rho\) enforces the constraint \(Ax + Bz = c\) more aggressively.
3. The ADMM Updates
Instead of minimizing \(x\) and \(z\) jointly (which is often computationally intractable or mathematically impossible due to non-smoothness), ADMM alternates the minimization. At iteration \(k\), the algorithm performs three distinct steps:
Step 1: The \(x\)-update (Fidelity Step)
We minimize the augmented Lagrangian with respect to \(x\), keeping \(z\) and \(y\) fixed at their values from the previous iteration:
\[x^{k+1} = \underset{x}{\text{argmin}} \ L_\rho(x, z^k, y^k)\]
Step 2: The \(z\)-update (Prior/Regularization Step)
We then minimize with respect to \(z\), using the *freshly updated* \(x^{k+1}\), and keeping \(y\) fixed:
\[z^{k+1} = \underset{z}{\text{argmin}} \ L_\rho(x^{k+1}, z, y^k)\]
Step 3: The \(y\)-update (Dual Update)
Finally, we perform a gradient ascent step on the dual variable \(y\) to penalize the residual error of the constraint:
\[y^{k+1} = y^k + \rho(Ax^{k+1} + Bz^{k+1} - c)\]
4. The Scaled Form (Practical Implementation)
In practice, the equations are usually simplified by combining the linear and quadratic terms into a single squared L2 norm. We define a scaled dual variable \(u = \frac{y}{\rho}\).
The residual is \(r = Ax + Bz - c\). By completing the square, the ADMM updates become much cleaner to implement:
- \(x\)-update:
\[x^{k+1} = \underset{x}{\text{argmin}} \left( f(x) + \frac{\rho}{2}\|Ax + Bz^k - c + u^k\|_2^2 \right)\]
- \(z\)-update:
\[z^{k+1} = \underset{z}{\text{argmin}} \left( g(z) + \frac{\rho}{2}\|Ax^{k+1} + Bz - c + u^k\|_2^2 \right)\]
- \(u\)-update:
\[u^{k+1} = u^k + (Ax^{k+1} + Bz^{k+1} - c)\]
Why ADMM is Elegant
The beauty of ADMM lies in the \(z\)-update. Because of the quadratic penalty \(\frac{\rho}{2}\| \cdot \|_2^2\), the \(z\)-update frequently takes the form of a Proximal Operator:
\[z^{k+1} = \text{prox}_{g/\rho}(v) = \underset{z}{\text{argmin}} \left( g(z) + \frac{\rho}{2}\|z - v\|_2^2 \right)\]
If \(g(z)\) is an L1 norm (for sparsity), this step reduces to simple, computationally cheap soft-thresholding. In modern frameworks like Plug-and-Play (PnP) ADMM, the proximal operator for the prior \(g(z)\) doesn't even need to be explicitly defined analytically; it can be entirely replaced by an off-the-shelf denoiser, such as a diffusion model or a flow matching network.
Where does \(y^T\) and \(u\) come from?
1. Where does \(y^T\) come from? (\(\lambda\) vs. \(y\))
In the simple 1D example, the multiplier is just a single number, \(\lambda\).
However, in ADMM, we are usually solving large-scale problems where \(x\) and \(z\) are vectors (for example, \(x\) might be an image flattened into a 1 million-pixel column vector). Because \(x\) and \(z\) are vectors, the constraint \(Ax + Bz = c\) actually represents thousands or millions of individual equations.
Because there are multiple constraints, we need a separate Lagrange multiplier for *each* equation.
- Instead of a single number \(\lambda\), we use a column vector of multipliers, which is conventionally named \(y\) (though some textbooks still write it as a bold \(\boldsymbol{\lambda}\)).
- To multiply two vectors together to calculate the total penalty (a single scalar number), we take their dot product.
- In linear algebra, the dot product of a vector \(y\) and the constraint residual \((Ax + Bz - c)\) is written by transposing \(y\) into a row vector.
Therefore, \(\lambda \cdot h(x)\) in 1D becomes \(y^T(Ax + Bz - c)\) in multiple dimensions. They mean the exact same thing: "price \(\times\) constraint violation."
2. Why is \(u\) written that way? (Completing the Square)
The scaled dual variable \(u = \frac{y}{\rho}\) is created using an algebraic trick called completing the square.
When we minimize \(x\) and \(z\) in the standard Augmented Lagrangian, we have to deal with two separate constraint terms: the linear term (\(y^T\)) and the squared penalty term (\(\frac{\rho}{2}\)). Dealing with both is annoying when writing algorithms. We want to combine them into a single, clean squared term.
Let's define our constraint residual as \(r = Ax + Bz - c\).
Here are the two terms from the Augmented Lagrangian that we want to combine:
\[y^T r + \frac{\rho}{2}\|r\|_2^2\]
Step 1: Factor out \(\frac{\rho}{2}\)
\[\frac{\rho}{2} \left( \frac{2}{\rho} y^T r + \|r\|_2^2 \right)\]
Step 2: Define \(u\)
Let's define a new variable \(u = \frac{y}{\rho}\). This means \(y = \rho u\).
Substitute this into our equation:
\[\frac{\rho}{2} \left( 2u^T r + \|r\|_2^2 \right)\]
Step 3: Complete the Square
To turn this into a perfect square, we add and subtract \(\|u\|_2^2\) inside the parentheses:
\[\frac{\rho}{2} \left( \|r\|_2^2 + 2u^T r + \|u\|_2^2 - \|u\|_2^2 \right)\]
The first three terms now perfectly factor into \(\|r + u\|_2^2\):
\[\frac{\rho}{2} \|r + u\|_2^2 - \frac{\rho}{2}\|u\|_2^2\]
Why this is mathematically brilliant
Look at that final expression. The second term, \(-\frac{\rho}{2}\|u\|_2^2\), does not contain \(x\) or \(z\) at all.
Because the ADMM \(x\)-update and \(z\)-update steps are argmin operations (meaning "find the \(x\) that minimizes this equation"), we can completely throw away any terms that act as constants.
So, by defining \(u = \frac{y}{\rho}\), we can replace the messy linear and quadratic constraint terms with a single, elegant squared norm:
\[\frac{\rho}{2} \|Ax + Bz - c + u\|_2^2\]
This makes calculating the gradients and proximal operators in your code significantly easier.
Why ADMM works?
1. The "Impossible" Joint Update
If we wanted to be mathematically perfect, we wouldn't use ADMM. We would use the standard Method of Multipliers. In that method, you update \(x\) and \(z\) *jointly* at the exact same time:
\[(x^{k+1}, z^{k+1}) = \underset{x, z}{\text{argmin}} \ L_\rho(x, z, y^k)\]
The Problem: The squared penalty term in the Augmented Lagrangian is \(\frac{\rho}{2}\|Ax + Bz - c\|_2^2\). If you expand this square, you get cross-terms multiplying \(x\) and \(z\) together. Because \(x\) and \(z\) are mathematically tangled up in this quadratic term, you cannot easily split the function. For large problems, finding a joint minimum for \(f(x)\) and \(g(z)\) simultaneously is computationally impossible or hopelessly slow.
2. The ADMM "Tug-of-War" (Gauss-Seidel Splitting)
ADMM abandons the impossible joint update and instead uses a concept similar to Gauss-Seidel iteration (or coordinate descent).
Instead of moving \(x\) and \(z\) at the same time, we freeze \(z\) and take a step for \(x\). Then, we freeze our new \(x\) and take a step for \(z\).
Here is why this specific three-step sequence guarantees convergence:
Step 1: The \(x\)-update (Listen to the Data)
\[x^{k+1} = \underset{x}{\text{argmin}} \left( f(x) + \frac{\rho}{2}\|Ax + Bz^k - c + u^k\|_2^2 \right)\]
- What it does: We freeze the prior (\(z\)) and the multiplier (\(u\)). We then find the \(x\) that minimizes our main objective \(f(x)\) (usually a data fidelity term, like matching an image to sensor measurements), while trying not to stray too far from our last known prior \(z^k\).
- Why it helps: It pulls the solution toward physical reality (the measured data).
Step 2: The \(z\)-update (Listen to the Prior)
\[z^{k+1} = \underset{z}{\text{argmin}} \left( g(z) + \frac{\rho}{2}\|Ax^{k+1} + Bz - c + u^k\|_2^2 \right)\]
- What it does: We take our freshly calculated data-driven solution \(x^{k+1}\), freeze it, and now find the \(z\) that satisfies our regularization \(g(z)\) (like forcing the image to be sharp or sparse).
- Why it helps: It cleans up the result from Step 1. It acts as a projection or denoising step, pulling the solution toward our mathematical ideal.
Step 3: The \(u\)-update (The Feedback Controller)
\[u^{k+1} = u^k + (Ax^{k+1} + Bz^{k+1} - c)\]
- What it does: This is a gradient ascent step on the dual variable. The term \((Ax^{k+1} + Bz^{k+1} - c)\) is the exact error (residual) between our data-step (\(x\)) and our prior-step (\(z\)). We add this error directly to \(u\).
- Why it leads to the solution: This acts like an Integral (I) controller in a PID feedback loop. If \(x\) and \(z\) disagree, the error accumulates in \(u\). In the next iteration, this large \(u\) acts as a massive penalty in the \(x\) and \(z\) update equations, violently forcing them to move closer together.
Why It Converges (The Saddle Point)
For convex problems, optimization theory (specifically, KKT conditions) guarantees that a global optimum is reached when two things happen simultaneously:
- Primal Feasibility: The constraint is perfectly satisfied (\(Ax + Bz - c = 0\)).
- Dual Feasibility: We have found the absolute minimum of the objective functions \(f(x) + g(z)\).
ADMM mathematically guarantees this will happen. The alternating \(x\) and \(z\) steps constantly push the objective value down (minimizing \(f\) and \(g\)), while the \(u\) step constantly pushes the constraint error to zero.
Eventually, \(x\) and \(z\) are forced to perfectly agree, the residual drops to zero, \(u\) stops changing, and the algorithm locks into the exact optimal saddle point.
Primal-Dual Explanation
In optimization, every problem actually exists in two parallel universes: the Primal and the Dual.
- The Primal Variables (\(x\) and \(z\)): These represent the actual, physical quantities you are trying to solve for. In computational imaging, for instance, \(x\) and \(z\) are the literal pixels of the image you are trying to reconstruct. The "Primal Problem" is your original goal: *minimize the error of the image.*
- The Dual Variable (\(y\) or \(u\)): This is the "shadow" variable. It does not represent pixels or physical data. It represents the mathematical force (or price) required to enforce your constraints. The "Dual Problem" is the perspective of the rule-maker: *maximize the penalty for breaking the rules.*
Because the Primal wants to minimize (roll to the bottom of the valley) and the Dual wants to maximize (climb to the top of the hill), the perfect solution exists at a saddle point—the exact coordinate where the primal forces and dual forces perfectly balance each other out.