Dissecting the Camera Matrix

Camera Matrix

So, you’ve been playing around a new computer vision library, and you’ve managed to calibrate your camera… now what do you do with it? It would be a lot more useful if you could get at the camera’s position or find out it’s field-of view. You crack open your trusty copy of Hartley and Zisserman, which tells you how to decompose your camera into an intrinsic and extrinsic matrix — great! But when you look at the results, something isn’t quite right. Maybe your rotation matrix has a determinant of -1, causing your matrix-to-quaternion function to barf. Maybe your focal-length is negative, and you can’t understand why. Maybe your translation vector mistakenly claims that the world origin in behind the camera. Or worst of all, everything looks fine, but when you plug it into OpenGL, you just don’t see anything.

Today we’ll cover the process of decomposing a camera matrix into intrinsic and extrinsic matrices, and we’ll try to untangle the issues that can crop-up with different coordinate conventions. In later articles, we’ll study the intrinsic and extrinsic matrices in more detail, and I’ll cover how to convert them into a form usable by OpenGL.

Prologue: Getting a Camera Matrix

I’ll assume you’ve already obtained your camera matrix beforehand, but if you’re looking for help with camera calibration, I recommend looking into the Camera Calibration Toolbox for Matlab. OpenCV also seems to have some useful routines for automatic camera calibration from a sequences of chessboard images, although I haven’t personally used them. As usual, Hartley and Zisserman’s has a nice treatment of the topic.

Cut ‘em Up: Camera Decomposition

To start, we’ll assume your camera matrix is 3x4, which transforms homogeneous 3D world coordinates to homogeneous 2D image coordinates. Following Hartley and Zisserman, we’ll denote the matrix as P, and occasionally it will be useful to use the block-form:

P=[M|MC]

where M is an invertible 3x3 matrix, and C is a column-vector representing the camera’s position in world coordinates. Some calibration software provides a 4x4 matrix, which adds an extra row to preserve the z-coordinate. In this case, just drop the third row to get a 3x4 matrix.

The camera matrix by itself is useful for projecting 3D points into 2D, but it has several drawbacks:
- It doesn’t tell you where the camera’s pose.
- It doesn’t tell you about the camera’s internal geometry.
- Specular lighting isn’t possible, since you can’t get surface normals in camera coordinates.

To address these drawbacks, a camera matrix can be decomposed into the product of two matrices: an intrinsic matrix, K , and an extrinsic matrix, [R|RC] :

P=K[R|RC]

The matrix K is a 3x3 upper-triangular matrix that describes the camera’s internal parameters like focal length. R is a 3x3 rotation matrix whose columns are the directions of the world axes in the camera’s reference frame. The vector C is the camera center in world coordinates; the vector t=RC gives the position of the world origin in camera coordinates. We’ll study each of these matrices in more detail in later articles, today we’ll just discuss how to get them from P .

Recovering the camera center, C , is straightforward. Note that the last column of P is MC , so just left-multiply it by M1 .


Before You RQ-ze Me…

To recover R and K, we note that R is orthogonal by virtue of being a rotation matrix, and K is upper-triangular. Any full-rank matrix can be decomposed into the product of an upper-triangular matrix and an orthogonal matrix by using RQ-decomposition. Unfortunately RQ-decomposition isn’t available in many libraries including Matlab, but luckily, it’s friend QR-decomposition usually is. Solem’s vision blog has a nice article implementing the missing function using a few matrix flips; here’s a Matlab version (thanks to Solem for letting me repost this!):

function [R Q] = rq(M)
    [Q,R] = qr(flipud(M)')
    R = flipud(R');
    R = fliplr(R);

    Q = Q';   
    Q = flipud(Q);

I’m seeing double… FOUR decompositions!

There’s only one problem: the result of RQ-decomposition isn’t unique. To see this, try negating any column of K and the corresponding row of R: the resulting camera matrix is unchanged. Most people simply force the diagonal elements of K to be positive, which is the correct approach if two conditions are true:

  1. your image’s X/Y axes point in the same direction as your camera’s X/Y axes.
  2. your camera looks in the positive-z direction.

Solem’s blog elegantly gives us positive diagonal entries in three lines of code:

# make diagonal of K positive
T = diag(sign(diag(K)));

K = K * T;
R = T * R; # (T is its own inverse)

In practice, the camera and image axes won’t agree, and the diagonal elements of K shouldn’t be positive. Forcing them to be positive can result in nasty side-effect, including:

- The objects appear on the wrong side of the camera.
- The rotation matrix has a determinant of -1 instead of 1.
- Incorrect specular lighting.
- Visible geometry won’t render due to a having negative w coordinate.


In this case, you’ve got some fixing to do. Start by making sure that your camera and world coordinates both have the same handedness. Then take note of the axis conventions you used when you calibrated your camera. What direction did the image y-axis point, up or down? The x-axis? Now consider your camera’s coordinate axes. Does your camera look down the negative-z axis (OpenGL-style)? Positive-z (like Hartley and Zisserman)? Does the x-axis point left or right? The y-axis? Okay, okay, you get the idea.

Dissecting the Camera Matrix_第1张图片
Hartley and Zisserman’s coordinate conventions. Note that camera and image x-axes point left when viewed from the camera’s POV.

Starting from an all-positive diagonal, follow these four steps:

1. If the image x-axis and camera x-axis point in opposite directions, negate the first column of K and the first row of R.
2. If the image y-axis and camera y-axis point in opposite directions, negate the second column of K and the second row of R.
3. If the camera looks down the negative-z axis, negate the third column of K. Also negate the third column of R.
4. If the determinant of R is -1, negate it.

Note that each of these steps leaves the combined camera matrix unchanged. The last step is equivalent to multiplying the entire camera matrix, P, by -1. Since P operates on homogeneous coordinates, multiplying it by any constant has no effect.

Regarding step 3, Hartley and Zisserman’s camera looks down the positive-z direction, but in some real-world systems, (e.g. OpenGL) the camera looks down the negative-z axis. This allows the x and y axis to point right and up, resulting in a coordinate system that feels natural while still being right-handed. Step 3 above corrects for this, by causing w to be positive when z is negative. You may balk at the fact that K3,3 is negative, but OpenGL requires this for proper clipping. We’ll discuss OpenGL more in a future article.

You can double-check the result by inspecting the vector t=RC
, which is the location of the world origin in camera coordinates. If everything is correct, the sign of tx,ty,tz should reflect where the world origin appears in the camera (left/right of center, above/below center, in front/behind camera, respectively).

Who Flipped my Axes?

Until now, our discussion of 2D coordinate conventions have referred to the coordinates used during calibration. If your application uses a different 2D coordinate convention, you’ll need to transform K using 2D translation and reflection.

For example, consider a camera matrix that was calibrated with the origin in the top-left and the y-axis pointing downward, but you prefer a bottom-left origin with the y-axis pointing upward. To convert, you’ll first negate the image y-coordinate and then translate upward by the image height, h. The resulting intrinsic matrix K ’ is given by:

K=1000100h1×100010001K

Summary

The procedure above should give you a correct camera decomposition regardless of the coordinate conventions you use. I’ve tested it in a handful of scenarios in my own research, and it has worked so far.


The Extrinsic Camera Matrix

The camera’s extrinsic matrix describes the camera’s location in the world, and what direction it’s pointing. Those familiar with OpenGL know this as the “view matrix” (or rolled into the “modelview matrix”). It has two components: a rotation matrix, R, and a translation vector t, but as we’ll soon see, these don’t exactly correspond to the camera’s rotation and translation. First we’ll examine the parts of the extrinsic matrix, and later we’ll look at alternative ways of describing the camera’s pose that are more intuitive.

The extrinsic matrix takes the form of a rigid transformation matrix: a 3x3 rotation matrix in the left-block, and 3x1 translation column-vector in the right:

[R|t]=r1,1r2,1r3,1r1,2r2,2r3,2r1,3r2,3r3,3t1t2t3


It’s common to see a version of this matrix with extra row of (0,0,0,1) added to the bottom. This makes the matrix square, which allows us to further decompose this matrix into a rotation followed by translation:
Dissecting the Camera Matrix_第2张图片


This matrix describes how to transform points in world coordinates to camera coordinates. The vector t can be interpreted as the position of the world origin in camera coordinates, and the columns of R represent represent the directions of the world-axes in camera coordinates.


The important thing to remember about the extrinsic matrix is that it describes how the world is transformed relative to the camera. This is often counter-intuitive, because we usually want to specify how the camera is transformed relative to the world. Next, we’ll examine two alternative ways to describe the camera’s extrinsic parameters that are more intuitive and how to convert them into the form of an extrinsic matrix.


Building the Extrinsic Matrix from Camera Pose

It’s often more natural to specify the camera’s pose directly rather than specifying how world points should transform to camera coordinates. Luckily, building an extrinsic camera matrix this way is easy: just build a rigid transformation matrix that describes the camera’s pose and then take it’s inverse.


Let C be a column vector describing the location of the camera-center in world coordinates, and let Rc be the rotation matrix describing the camera’s orientation with respect to the world coordinate axes. The transformation matrix that describes the camera’s pose is then [Rc|C] . Like before, we make the matrix square by adding an extra row of (0,0,0,1). Then the extrinsic matrix is obtained by inverting the camera’s pose matrix:

Dissecting the Camera Matrix_第3张图片


When applying the inverse, we use the fact that the inverse of a rotation matrix is it’s transpose, and inverting a translation matrix simply negates the translation vector. Thus, we see that the relationship between the extrinsic matrix parameters and the camera’s pose is straightforward:

R=RTct=RC

Some texts write the extrinsic matrix substituting -RC for t, which mixes a world transform (R) and camera transform notation (C).


The “Look-At” Camera

Readers familiar with OpenGL might prefer a third way of specifying the camera’s pose using (a) the camera’s position, (b) what it’s looking at, and (c) the “up” direction. In legacy OpenGL, this is accomplished by the gluLookAt() function, so we’ll call this the “look-at” camera. Let C be the camera center, p be the target point, and u be up-direction. The algorithm for computing the rotation matrix is ( paraphrased from the OpenGL documentation ):

1. Compute L = p - C.
2. Normalize L.
3. Compute s = L x u. (cross product)
4. Normalize s.
5. Compute u’ = s x L.


The extrinsic rotation matrix is then given by:

R=s1u1L1s2u2L2s3u3L3

You can get the translation vector the same way as before, t=RC.

Conclusion

We’ve just explored three different ways of parameterizing a camera’s extrinsic state. Which parameterization you prefer to use will depend on your application. If you’re writing a Wolfenstein-style FPS, you might like the world-centric parameterization, because moving along (t_z) always corresponds to walking forward. Or you might be interpolating a camera through waypoints in your scene, in which case, the camera-centric parameterization is preferred, since you can specify the position of your camera directly. If you aren’t sure which you prefer, play with the tool above and decide which approach feels the most natural.

Originally Posted by Kyle Simek

你可能感兴趣的:(computer-vision)