COLMAP

COLMAP is an end-to-end pipeline for Structure-from-Motion (SfM) and Multi-View Stereo (MVS). It takes a collection of 2D images and mathematically reconstructs the 3D geometry of the scene alongside the exact positions and intrinsic properties of the cameras used to capture it.

Here is the mathematical engine driving each stage of the COLMAP pipeline.

1. The Camera Projection Model

The foundation of COLMAP is mapping a 3D world coordinate \(X = [X, Y, Z, 1]^T\) to a 2D pixel coordinate \(x = [u, v, 1]^T\) on an image plane. This is governed by the standard pinhole camera projection model:

x \sim K [R | t] X

Extrinsics (\(R, t\)): The rotation matrix \(R \in SO(3)\) and translation vector \(t \in \mathbb{R}^3\) map the point from world coordinates to the local camera coordinate system.
Intrinsics (\(K\)): The calibration matrix containing the focal length and principal point, which projects the 3D camera coordinates onto the 2D image plane.

Because physical lenses are imperfect, COLMAP rigorously models complex non-linear lens distortions (both radial and tangential) to map these ideal projected points to the actual observed pixel coordinates.

2. Two-View Geometry (Initialization)

Before building a global map, COLMAP must determine the relative mathematical relationship between pairs of images. By extracting and matching 2D features (like SIFT) across two images, it establishes correspondences.

For two calibrated cameras viewing the exact same 3D point, their normalized image coordinates \(x_1\) and \(x_2\) are strictly bound by the Epipolar Constraint:

\[x_2^T E x_1 = 0\]

Here, \(E\) is the Essential Matrix, mathematically defined as \(E = [t]_\times R\), where \([t]_\times\) is the skew-symmetric matrix of the translation vector. COLMAP robustly solves for \(E\) using algorithms like the 5-point algorithm within a RANSAC loop. Decomposing the Essential Matrix yields the relative rotation and translation between those two specific views.

3. Triangulation

Once the relative camera projection matrices \(P_1 = K_1[I | 0]\) and \(P_2 = K_2[R | t]\) are known, the depth of the matched features can be recovered. Since the projected point \(x_i\) must lie on the line created by \(P_i X\), their cross product is zero:

x_i \times (P_i X) = 0

Expanding this across multiple views yields a system of linear equations in the form \(AX = 0\). COLMAP solves this using Singular Value Decomposition (SVD) via the Direct Linear Transform (DLT) method to estimate the initial 3D position of point \(X\).

4. Bundle Adjustment (The SfM Core)

As COLMAP incrementally registers more cameras and triangulates more points, tiny numerical errors accumulate into massive drift. Bundle Adjustment (BA) is the massive non-linear optimization engine that solves this. It jointly refines all camera parameters \(\Theta\) (both extrinsics and intrinsics) and 3D point coordinates \(X\) by minimizing the total Reprojection Error:

\min_{\Theta, X} \sum_{i=1}^{N} \sum_{j=1}^{M} v_{ij} \| x_{ij} - \pi(\Theta_i, X_j) \|^2

\(v_{ij}\) is an indicator variable (1 if point \(j\) is visible in camera \(i\), 0 otherwise).
\(x_{ij}\) is the actual 2D pixel observation.
\(\pi(\Theta_i, X_j)\) is the mathematical projection of our estimated 3D point into the estimated camera model.

This is a massive, highly non-convex least-squares problem, which COLMAP solves efficiently using the Levenberg-Marquardt algorithm alongside Schur complement tricks to exploit the sparse structure of the problem.

5. Multi-View Stereo (Dense Reconstruction)

SfM outputs a sparse point cloud and the camera poses. To generate dense, continuous surfaces, COLMAP utilizes Multi-View Stereo (MVS), specifically a technique called PatchMatch Stereo.

For every single pixel in a reference image, MVS estimates a depth \(d\) and a surface normal \(n\). It tests these hypotheses by warping a small patch of pixels from a reference image to a neighboring source image using a homography matrix \(H\):

H = K_s \left( R_{sr} - \frac{t_{sr} n^T}{d} \right) K_r^{-1}

Where \(R_{sr}\) and \(t_{sr}\) represent the relative transformation between the reference (\(r\)) and source (\(s\)) cameras. The algorithm iteratively optimizes \(d\) and \(n\) to maximize the photometric consistency—usually measured by the Normalized Cross-Correlation (NCC)—between the reference patch and the homography-warped source patch.

Epipolar Constraint

At its core, the equation \(x_2^T E x_1 = 0\) is a purely geometric constraint expressed algebraically. It states that an observed 3D point and the optical centers of the two cameras viewing it must all lie on the exact same 2D plane in 3D space. This is known as the coplanarity constraint.

Here is the step-by-step mathematical derivation of why this holds true.

1. The Geometric Setup

Imagine a 3D point \(X\). It is observed by Camera 1 and Camera 2.

Let Camera 1 be the origin of our coordinate system, with its optical center at \(O_1 = [0, 0, 0]^T\).
Let Camera 2 have an optical center at \(O_2\).
The coordinate system of Camera 2 is related to Camera 1 by a rigid body transformation: a rotation matrix \(R\) and a translation vector \(t\). Note that \(t\) is effectively the baseline vector pointing from \(O_1\) to \(O_2\).

The three points—\(O_1\), \(O_2\), and \(X\)—form a triangle in 3D space. The plane containing this triangle is called the epipolar plane.

2. The Algebraic Derivation

Let \(X_1\) be the 3D coordinates of point \(X\) relative to Camera 1.

Let \(X_2\) be the 3D coordinates of the same point \(X\) relative to Camera 2.

Because both vectors point to the same physical location in space, we can map \(X_1\) into Camera 2's coordinate system using our rigid transformation:

\[X_2 = R X_1 + t\]

To isolate the geometric relationship, we want to eliminate the addition of \(t\). We can do this by taking the cross product of \(t\) on both sides of the equation.

t \times X_2 = t \times (R X_1 + t)

Since the cross product distributes over addition, and the cross product of any vector with itself is zero (\(t \times t = 0\)), this simplifies to:

t \times X_2 = t \times R X_1

Geometrically, the vector \((t \times R X_1)\) is perpendicular to both the translation vector \(t\) and the rotated ray \(R X_1\). Therefore, this cross product represents the normal vector to the epipolar plane.

Because \(X_2\) lies *inside* that epipolar plane, it must be perpendicular to the plane's normal vector. We express this by taking the dot product of \(X_2\) with both sides of our equation:

X_2^T (t \times X_2) = X_2^T (t \times R X_1)

The left side of the equation is the dot product of \(X_2\) and a vector orthogonal to \(X_2\). This strictly equals zero:

0 = X_2^T (t \times R X_1)

This is the scalar triple product, formally proving that \(X_2\), \(t\), and \(R X_1\) are coplanar.

3. Introducing the Essential Matrix

To write this cleanly as a matrix multiplication, we replace the cross product operation \(t \times (\cdot)\) with its equivalent matrix multiplication using the skew-symmetric matrix of \(t\), denoted as \([t]_\times\):

0 = X_2^T [t]_\times R X_1

We group the camera extrinsics into a single matrix. This is the Essential Matrix (\(E\)):

E = [t]_\times R

Substituting \(E\) back into our coplanarity equation gives:

\[X_2^T E X_1 = 0\]

4. From 3D Points to 2D Pixels

The final step is relating the 3D coordinates (\(X_1, X_2\)) to the 2D normalized image coordinates (\(x_1, x_2\)) captured by the camera sensor.

Under the pinhole camera model, the 3D point is just a scaled version of the 2D image point along the ray defined by depth \(Z\):

\(X_1 = Z_1 x_1\) and \(X_2 = Z_2 x_2\).

Substituting these into our equation:

\[(Z_2 x_2)^T E (Z_1 x_1) = 0\]

\[Z_1 Z_2 (x_2^T E x_1) = 0\]

Since the depths \(Z_1\) and \(Z_2\) are strictly positive non-zero scalars for any point visible in front of the camera, we can divide them out. This leaves us with the fundamental Epipolar Constraint:

\[x_2^T E x_1 = 0\]

The Geometric Intuition in 2D

If you bracket the equation as \(x_2^T (E x_1) = 0\), it reveals a powerful tool for search algorithms in computer vision.

The term \((E x_1)\) is actually a 3-vector that defines the coefficients of a 2D line in the second image—specifically, the epipolar line \(l_2\). The equation then becomes \(x_2^T l_2 = 0\), which is the standard algebraic condition for a point \(x_2\) to lie on a line \(l_2\).

Instead of searching the entire 2D pixel grid of image 2 to find the match for \(x_1\), the Epipolar Constraint proves you only need to search along the 1D epipolar line \(E x_1\).

Jipeng Sun

Home

Research

Knowledge Share

Teaching