PCA(Principal Component Analysis)是一种无监督的数据降维方法,方法的目标是提取出最有价值的信息(方差最大)
PCA降维之后数据的原始意义会消失,这个特性可以应用在原始数据需要保密,利用PCA做数据加密中
方法的核心思想是:找到最好的轴让数据的散度最大
内积与投影
向量A与向量B的内积定义为
A ⋅ B = ∣ A ∣ ∣ B ∣ cos ( a ) A \cdot B=|A||B| \cos (a) A⋅B=∣A∣∣B∣cos(a)且如果向量B的模长为1,则A、B的内积等于A向B所在的方向做投影的矢量长度
基变换
在X轴与Y轴下点(3,2)可以被表示为 3 ( 1 , 0 ) ⊤ + 2 ( 0 , 1 ) ⊤ 3(1,0)^{\top}+2(0,1)^{\top} 3(1,0)⊤+2(0,1)⊤,即 ( 1 , 0 ) 、 ( 0 , 1 ) (1,0)、(0,1) (1,0)、(0,1)可以被称为一组基
基总是正交的(即内积为0或互相垂直)且线性无关的
基变换就是找到向量在新的基上的表达,具体做法是将数据与第一个基做内积运算,结果作为新的坐标的第一个坐标分量,然后与第二个基做内积运算,结果作为新坐标的第二个坐标分量
例如(3,2)点映射到轴 ( 1 / 2 1 / 2 − 1 / 2 1 / 2 ) \left(\begin{array}{cc}{1 / \sqrt{2}} & {1 / \sqrt{2}} \\ {-1 / \sqrt{2}} & {1 / \sqrt{2}}\end{array}\right) (1/2−1/21/21/2)后,向量的表达变为 ( 1 / 2 1 / 2 − 1 / 2 1 / 2 ) ( 3 2 ) = ( 5 / 2 − 1 / 2 ) \left(\begin{array}{cc}{1 / \sqrt{2}} & {1 / \sqrt{2}} \\ {-1 / \sqrt{2}} & {1 / \sqrt{2}}\end{array}\right)\left(\begin{array}{l}{3} \\ {2}\end{array}\right)=\left(\begin{array}{c}{5 / \sqrt{2}} \\ {-1 / \sqrt{2}}\end{array}\right) (1/2−1/21/21/2)(32)=(5/2−1/2)
基变换相当于向量在基上做投影,这也说明了为什么要求基是单位向量
PCA其实就相当于找到一组基,和原数据做内积 ( p 1 p 2 ⋮ p R ) ( a 1 a 2 … a M ) = ( p 1 a 1 p 1 a 2 ⋯ p 1 a M p 2 a 1 p 2 a 2 ⋯ p 2 a M ⋮ ⋮ ⋱ ⋮ p R a 1 p R a 2 ⋯ p R a M ) \left(\begin{array}{c}{p_{1}} \\ {p_{2}} \\ {\vdots} \\ {p_{R}}\end{array}\right)\left(\begin{array}{llll}{a_{1}} & {a_{2}} & {\dots} & {a_{M}}\end{array}\right)=\left(\begin{array}{cccc}{p_{1} a_{1}} & {p_{1} a_{2}} & {\cdots} & {p_{1} a_{M}} \\ {p_{2} a_{1}} & {p_{2} a_{2}} & {\cdots} & {p_{2} a_{M}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {p_{R} a_{1}} & {p_{R} a_{2}} & {\cdots} & {p_{R} a_{M}}\end{array}\right) ⎝⎜⎜⎜⎛p1p2⋮pR⎠⎟⎟⎟⎞(a1a2…aM)=⎝⎜⎜⎜⎛p1a1p2a1⋮pRa1p1a2p2a2⋮pRa2⋯⋯⋱⋯p1aMp2aM⋮pRaM⎠⎟⎟⎟⎞
基的求解
方向的选择:如何选择一个方向(或者说基)才能尽量保留最多的原始信息,最直观的想法就是希望投影之后的投影值尽可能分散(即方差尽可能大)。
所以要寻找一个一维基,使得所有数据变换为这个基上的坐标表示后,方差值最大
此外,如果单纯只选择方差最大的方向,后续方向应该会和方差最大的方向接近重合,所以需要让基之间不存在线性相关性,即要让基之间的cov为0,所以第二个基一定只能在第一个基正交方向上选(正交方向上选择方差最大的那个方向)
所以优化目标为 即 w T x w^{T} x wTx的协方差阵应该是一个对角阵 原始数据的协方差阵是一个实对称阵,而实对称阵一定可以找到n个单位特征向量 E = ( e 1 e 2 ⋯ e n ) E=\left(\begin{array}{llll}{e_{1}} & {e_{2}} & {\cdots} & {e_{n}}\end{array}\right) E=(e1e2⋯en)使得实对称阵可以进行对角化 E ⊤ C E = Λ = ( λ 1 λ 2 ⋱ λ n ) E^{\top} C E=\Lambda=\left(\begin{array}{cccc}{\lambda_{1}} & {} & {} & {} \\ {} & {\lambda_{2}} & {} & {} \\ {} & {} & {\ddots} & {} \\ {} & {} & {} & {\lambda_{n}}\end{array}\right) E⊤CE=Λ=⎝⎜⎜⎛λ1λ2⋱λn⎠⎟⎟⎞所以 E = ( e 1 e 2 ⋯ e n ) E=\left(\begin{array}{llll}{e_{1}} & {e_{2}} & {\cdots} & {e_{n}}\end{array}\right) E=(e1e2⋯en)就是我们要找的单位正交基 再根据特征值的大到小,将特征向量从上到下排列,则用前k行组成的矩阵乘以原始数据X,就得到了我们需要的降维之后的数据矩阵Y 实际的例子
将一组N维向量降为k维(0
w T x w^{T} x wTx的协方差阵为 1 m w T x x T w \frac{1}{m}{w^{T} xx^{T}w} m1wTxxTw
而 1 m x x T \frac{1}{m} xx^{T} m1xxT正好是原始数据的协方差阵(原始数据已经中心化),所以只需要找到一组基,可以使得原始数据的协方差阵可以对角化,就完成了PCA的目标
原始中心化之后的数据为 ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) \left(\begin{array}{ccccc}{-1} & {-1} & {0} & {2} & {0} \\ {-2} & {0} & {0} & {1} & {1}\end{array}\right) (−1−2−10002101)
协差阵为 C = 1 5 ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) ( − 1 − 2 − 1 0 0 0 2 1 0 1 ) = ( 6 5 4 5 4 5 6 5 ) C = \frac{1}{5}\left(\begin{array}{ccccc}{-1} & {-1} & {0} & {2} & {0} \\ {-2} & {0} & {0} & {1} & {1}\end{array}\right)\left(\begin{array}{cc}{-1} & {-2} \\ {-1} & {0} \\ {0} & {0} \\ {2} & {1} \\ {0} & {1}\end{array}\right) = \left(\begin{array}{cc}{\frac{6}{5}} & {\frac{4}{5}} \\ {\frac{4}{5}} & {\frac{6}{5}}\end{array}\right) C=51(−1−2−10002101)⎝⎜⎜⎜⎜⎛−1−1020−20011⎠⎟⎟⎟⎟⎞=(56545456)
求出协差阵的特征值与特征向量,结果为 λ 1 = 2 , c 1 ( 1 1 ) \lambda_{1}=2,c_{1}\left(\begin{array}{l}{1} \\ {1}\end{array}\right) λ1=2,c1(11) λ 2 = 2 / 5 , c 2 ( − 1 1 ) \lambda_{2}=2/5,c_{2}\left(\begin{array}{l}{-1} \\ {1}\end{array}\right) λ2=2/5,c2(−11)
所以,若降为一维数据,则(先将基单位化) Y = ( 1 / 2 1 / 2 ) ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) = ( − 3 / 2 − 1 / 2 0 3 / 2 − 1 / 2 ) Y=\left(\begin{array}{ccc}{1 / \sqrt{2}} & {1 / \sqrt{2}}\end{array}\right)\left(\begin{array}{ccccc}{-1} & {-1} & {0} & {2} & {0} \\ {-2} & {0} & {0} & {1} & {1}\end{array}\right)=\left(\begin{array}{cccc}{-3 / \sqrt{2}} & {-1 / \sqrt{2}} & {0} & {3 / \sqrt{2}}&{-1 / \sqrt{2}}\end{array}\right) Y=(1/21/2)(−1−2−10002101)=(−3/2−1/203/2−1/2)import numpy as np
import pandas as pd
#读取数据----------------------------------------------------------------------------------------------------------
df = pd.read_csv("iris.data", header=None)
df.columns = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
X = df[["sepal_len", "sepal_wid", "petal_len", "petal_wid"]].values
y = df["class"].values
#画出不同label的统一数据的分布情况----------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt
label_dict = {1: "Iris-setosa", 2: "Iris-versicolor", 3: "Iris-virginica"}
feature_dict = {0: "sepal length[cm]", 1: "sepal width[cm]", 2: "petal length", 3: "petal width[cm]"}
plt.figure(figsize=(8, 6))
for cnt in range(4):
plt.subplot(2, 2, cnt + 1)
for lab in ("Iris-setosa", "Iris-versicolor", "Iris-virginica"):
plt.hist(X[y == lab, cnt], label=lab, bins=10, alpha=0.3)#注意array的这种取值方式X[y == lab, cnt]
plt.xlabel(feature_dict[cnt])
plt.legend(loc="best", fancybox=True, fontsize=8)#fancybox:控制legend框是否圆角
plt.tight_layout()
plt.show()
#数据标准化----------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
print(X_std)
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0] - 1)
#此处采用cov_mat = np.cov(X_std.T)一样的效果
#求特征值和特征向量----------------------------------------------------------------------------------------------------------
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print(eig_vecs)
print(eig_vals)
#画出特征值的大小与占比----------------------------------------------------------------------------------------------------------
eig_pairs = ((np.abs(eig_vals[i]), eig_vecs[:, i]) for i in range(len(eig_vals)))
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)#以数据的第一个值作为排序规则
print(eig_pairs)
for i in eig_pairs:
print(i[0])
tot = sum(eig_vals)
var_exp = [(i / tot) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
plt.bar(range(1, 5), var_exp, label='individual explained variance', align='center')
plt.step(range(1, 5), cum_var_exp, label='cumulative explained variance', where='mid')
plt.legend(loc='best')
plt.show()
#横向合并特征向量作为w----------------------------------------------------------------------------------------------------------
matrix_w = np.hstack((eig_pairs[0][1].reshape(4, 1), eig_pairs[1][1].reshape(4, 1)))
#w点成原始数据得出降维后的数据----------------------------------------------------------------------------------------------------------
Y = X_std.dot(matrix_w)#注意这里内积过后,y的label值依然有效
#画出不同label得到的两个特征的散点图分布----------------------------------------------------------------------------------------------------------
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
#用sklearn完成----------------------------------------------------------------------------------------------------------
import pandas as pd
df = pd.read_csv("iris.data", header=None)
df.columns = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
X = df[["sepal_len", "sepal_wid", "petal_len", "petal_wid"]].values
y = df["class"].values
#标准化----------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
#PCA降维----------------------------------------------------------------------------------------------------------
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Y = pca.fit_transform(X_std)
#绘图----------------------------------------------------------------------------------------------------------
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='best')
plt.tight_layout()
plt.show()