PCA主成分分析-scikitlearn和Numpy两种实现方法

'''
PCA with the Iris dataset – manual example 使用Iris数据来示例PCA主成分分析,使用numpy手工实现和s cikit-learn中的PCA方式实现
'''

# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
# load the Iris dataset
iris = load_iris()
# 创建X,y变量来表示特征和响应变量列。create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target
# the names of the flower we are trying to predict.
iris.target_names
# Names of the features
iris.feature_names

# 构建协方差矩阵,协方差矩阵的公式
# import numpy
import numpy as np
# calculate the mean vector
mean_vector = iris_X.mean(axis=0)
print (mean_vector)
#[ 5.84333333  3.054       3.75866667  1.19866667]
# calculate the covariance matrix。协方差矩阵是对称矩阵,行数和列数为特征的个数。
cov_mat = np.cov((iris_X-mean_vector).T)
print(cov_mat.shape)

# 计算协方差矩阵的特征值
# calculate the eigenvectors and eigenvalues of our covariance matrix of the iris dataset
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)
# Print the eigen vectors and corresponding eigenvalues
# in order of descending eigenvalues
for i in range(len(eig_val_cov)):
	eigvec_cov = eig_vec_cov[:,i]
	print ('Eigenvector {}: \n{}'.format(i+1, eigvec_cov))
	print ('Eigenvalue {} from covariance matrix: {}'.format(i+1,eig_val_cov[i]))
	print (30 * '-')

#根据特征值排序,选择topK的特征向量
explained_variance_ratio = eig_val_cov/eig_val_cov.sum()
explained_variance_ratio

# Scree Plot陡坡图来可视化 特征值/向量的重要性
plt.plot(np.cumsum(explained_variance_ratio))
plt.title('Scree Plot')
plt.xlabel('Principal Component (k)')
plt.ylabel('% of Variance Explained <= k')

#用保留的特征向量来变换原数据,生成新的数据矩阵
# store the top two eigenvectors in a variable。假如这里选定了前两个特征向量。
top_2_eigenvectors = eig_vec_cov[:,:2].T
# show the transpose so that each row is a principal component, we have two rows == two components
top_2_eigenvectors

# to transform our data from having shape (150, 4) to (150, 2)
# we will multiply the matrices of our data and our eigen vectors together
np.dot(iris_X, top_2_eigenvectors.T)



# 以上使用的是numpy的形式来实现PCA。scikit-learn中也有PCA实现的模块。
# scikit-learn's version of PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
#instantiate the class
pca = PCA(n_components=2)

# fit the PCA to our data
pca.fit(iris_X)
# 查看得到的主成分
pca.components_
#数据变换。scikit-learn 中PCA 会自动中心化数据,所以结果上会与上面手工做的有一些出入,但不影响模型预测。
pca.transform(iris_X)[:5,]
#成分解释的方差
pca.explained_variance_ratio_

# Plot the original and projected data
label_dict = {i: k for i, k in enumerate(iris.target_names)}
def plot(X, y, title, x_label, y_label):
	ax = plt.subplot(111)
	for label,marker,color in zip(range(3),('^', 's', 'o'),('blue', 'red', 'green')):
		plt.scatter(x=X[:,0].real[y == label],y=X[:,1].real[y == label],color=color,alpha=0.5,label=label_dict[label])
		plt.xlabel(x_label)
		plt.ylabel(y_label)
	leg = plt.legend(loc='upper right', fancybox=True)
	leg.get_frame().set_alpha(0.5)
	plt.title(title)
	plt.show()

plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)","sepal width (cm)")
plot(pca.transform(iris_X), iris_y, "Iris: Data projected onto first two PCA components", "PCA1", "PCA2")

PCA主成分分析-scikitlearn和Numpy两种实现方法_第1张图片

你可能感兴趣的:(机器学习,特征工程,机器学习,python)