来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT
Let M be a square matrix. Let λ be a constant and e a nonzero column vector with the same number of rows as M. Then λ is an eigenvalue of M and e is the corresponding eigenvector of M if M e = λ e Me = λe Me=λe.
Start with any unit vector v of the appropriate length and compute M i v M^iv Miv iteratively until it converges. When M is a stochastic matrix, the limiting vector is the principal eigenvector (the eigenvector with the largest eigenvalue), and its corresponding eigenvalue is 1. This method for finding the principal eigenvector, called power iteration, works quite generally, although if the principal eigenvalue (eigenvalue associated with the principal eigenvector) is not 1, then as i grows, the ratio of M i + 1 v M^{i+1}v Mi+1v to M i v M^iv Miv approaches the principal eigenvalue while M i v M^iv Miv approaches
a vector (probably not a unit vector) with the same direction as the principal eigenvector.
To find the second eigenpair we create a new matrix M ∗ = M − λ 1 x x T M^∗ = M − λ_1xx^T M∗=M−λ1xxT. Then, use power iteration on M ∗ M^∗ M∗ to compute its largest eigenvalue. The obtained x ∗ x^∗ x∗ and λ ∗ λ^∗ λ∗ correspond to the second largest eigenvalue and the corresponding eigenvector of matrix M. Intuitively, what we have done is eliminate the influence of a given eigenvector by setting its associated eigenvalue to zero. The formal justification is the following two observations. If M ∗ = M − λ x x T M^∗ = M − λxx^T M∗=M−λxxT, where x and λ are the eigenpair with the largest eigenvalue, then:
x is also an eigenvector of M ∗ M^∗ M∗, and its corresponding eigenvalue is 0. In proof, observe that
M ∗ x = ( M − λ x x T ) x = M x − λ x x T x = M x − λ x = 0 M^∗x = (M − λxx^T)x = Mx − λxx^Tx = Mx − λx = 0 M∗x=(M−λxxT)x=Mx−λxxTx=Mx−λx=0
At the next-to-last step we use the fact that x T x = 1 x^Tx = 1 xTx=1 because x is a unit vector.
Conversely, if v and λ v λ_v λv are an eigenpair of a symmetric matrix M other than the first eigenpair (x, λ), then they are also an eigenpair of M ∗ M^∗ M∗.
Proof :
M ∗ v = ( M ∗ ) T v = ( M − λ x x T ) T v = M T v − λ x ( x T v ) = M T v = λ v v M^∗v = (M^∗)^Tv = (M − λxx^T)^Tv = M^Tv − λx(x^Tv) = M^Tv = λ_vv M∗v=(M∗)Tv=(M−λxxT)Tv=MTv−λx(xTv)=MTv=λvv
This sequence of equalities needs the following justifications:
(a) If M is symmetric, then M = M T M = M^T M=MT.
(b) The eigenvectors of a symmetric matrix are orthogonal. That is, the dot product of any two distinct eigenvectors of a matrix is 0. We do not prove this statement here.
Principal-component analysis, or PCA, is a technique for taking a dataset consisting of a set of tuples representing points in a high-dimensional space and finding the directions along which the tuples line up best. The idea is to treat the set of tuples as a matrix M and find the eigenvectors for M M T MM^T MMT or M T M M^TM MTM. The matrix of these eigenvectors can be thought of as a rigid rotation in a high dimensional space. When you apply this transformation to the original data, the axis corresponding to the principal eigenvector is the one along which the points are most “spread out,” More precisely, this axis is the one along which the variance of the data is maximized. Put another way, the points can best be viewed as lying along this axis, with small deviations from this axis. Likewise, the axis corresponding to the second eigenvector (the eigenvector corresponding to the second-largest eigenvalue) is the axis along which the variance of distances from the first axis is greatest, and so on.
Any matrix of orthonormal vectors (unit vectors that are orthogonal to one another) represents a rotation and/or reflection of the axes of a Euclidean space.
We conclude that the eigenvalues of M M T MM^T MMT are the eigenvalues of M T M M^TM MTM plus additional 0’s. If the dimension of M M T MM^T MMT were less than the dimension off M T M M^TM MTM, then the opposite would be true; the eigenvalues of M T M M^TM MTM would be those of M M T MM^T MMT plus additional 0’s.
Let M be an m × n m × n m×n matrix, and let the rank of M be r. Recall that the rank of a matrix is the largest number of rows (or equivalently columns) we can choose for which no nonzero linear combination of the rows is the all-zero vector 0 (we say a set of such rows or columns is independent). Then we can find matrices U, Σ, and V as shown in Fig. 11.5 with the following properties:
Suppose we want to represent a very large matrix M by its SVD components U, Σ, and V , but these matrices are also too large to store conveniently. The best way to reduce the dimensionality of the three matrices is to set the smallest of the singular values to zero. If we set the s smallest singular values to 0, then we can also eliminate the corresponding s columns of U and V.
How Many Singular Values Should We Retain?
A useful rule of thumb is to retain enough singular values to make up 90% of the energy in Σ. That is, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values.
The choice of the lowest singular values to drop when we reduce the number of dimensions can be shown to minimize the root-mean-square error between the original matrix M and its approximation.
It says that V is the matrix of eigenvectors of M T M M^TM MTM and Σ 2 Σ^2 Σ2 is the diagonal matrix whose entries are the corresponding eigenvalues.
Thus, the same algorithm that computes the eigenpairs for M T M M^TM MTM gives us the matrix V for the SVD of M itself. It also gives us the singular values for this SVD; just take the square roots of the eigenvalues for M T M M^TM MTM. U is the matrix of eigenvectors of M M T MM^T MMT.
Definition of CUR
Let M be a matrix of m rows and n columns. Pick a target number of “concepts” r to be used in the decomposition. A CUR-decomposition of M is a randomly chosen set of r columns of M, which form the m × r m × r m×r matrix C, and a randomly chosen set of r rows of M, which form the r × n r × n r×n matrix R. There is also an r × r r × r r×r matrix U that is constructed from C and R as follows:
Having selected each of the columns of M, we scale each column by dividing its elements by the square root of the expected number of times this column would be picked. That is, we divide the elements of the jth column of M, if it is selected, by r q j \sqrt {rq_j} rqj . The scaled column of M becomes a column of C.
Rows of M are selected for R in the analogous way. For each row of R we select from the rows of M, choosing row i with probability p i p_i pi. Recall p i p_i pi is the sum of the squares of the elements of the ith row divided by the sum of the squares of all the elements of M. We then scale each chosen row by dividing by r p i \sqrt {rp_i} rpi if it is the ith row of M that was chosen.
It is quite possible that a single row or column is selected more than once. However, it is also possible to combine k rows of R that are each the same row of the matrix M into a single row of R, thus leaving R with fewer rows. Likewise, k columns of C that each come from the same column of M can be combined into one column of C. However, for either rows or columns,
the remaining vector should have each of its elements multiplied by k \sqrt k k .
When we merge some rows and/or columns, it is possible that R has fewer rows than C has columns, or vice versa. As a consequence, W will not be a square matrix. However, we can still take its pseudoinverse by decomposing it into W = X Σ Y T W = XΣY^T W=XΣYT, where Σ is now a diagonal matrix with some all-0 rows or columns, whichever it has more of. To take the pseudoinverse of such a diagonal matrix, we treat each element on the diagonal as usual (invert nonzero elements
and leave 0 as it is), but then we must transpose the result.
END