原文:
分類法/範例二: Normal and Shrinkage Linear Discriminant Analysis for classification
"""
总结:
1.通过score方法拿到模型对当前特征数量的样本判断准确度
2.对比有无shrinkage,部分方法才可以使用特征压缩
http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
"""
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# LinearDiscriminantAnalysis 模式识别的经典算法 特征抽取方法 使得类内距离最小 类间距离最大
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
def generate_data(n_samples, n_features):
#指定特征、中心点数量、范围等来生成几类数据
X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])
# add non-discriminative features
# print((X))
if n_features > 1:
X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
return X, y
X, y = generate_data(10, 5)
# print((X,y))
import pandas as pd
pd.set_option('precision',2)
df=pd.DataFrame(np.hstack([y.reshape(10,1),X]))
df.columns = ['y', 'X0', 'X1', 'X2', 'X3', 'X4']
print(df)
#改變特徵數量並測試shrinkage之功能
n_train = 20 # samples for training
n_test = 200 # samples for testing
n_averages = 50 # how often to repeat classification
n_features_max = 75 # maximum number of features
step = 4 # step size for the calculation
acc_clf1, acc_clf2 = [], []
n_features_range = range(1, n_features_max + 1, step)
for n_features in n_features_range:
score_clf1, score_clf2 = 0, 0
for _ in range(n_averages):
X, y = generate_data(n_train, n_features)
# 线性分类判别器
# lsqr用最小二乘最小平方QR分解求解
# 对比特征缩减 shrinkage
clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto').fit(X, y)
clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)
X, y = generate_data(n_test, n_features)
score_clf1 += clf1.score(X, y)
score_clf2 += clf2.score(X, y)
acc_clf1.append(score_clf1 / n_averages)
acc_clf2.append(score_clf2 / n_averages)
#顯示LDA判別結果
#以比例形式,更有对比意义
features_samples_ratio = np.array(n_features_range) / n_train
# figsize长宽比
fig = plt.figure(figsize=(4,3), dpi=150)
plt.plot(features_samples_ratio, acc_clf1, linewidth=2,
label="Linear Discriminant Analysis with shrinkage", color='r')
plt.plot(features_samples_ratio, acc_clf2, linewidth=2,
label="Linear Discriminant Analysis", color='g')
plt.xlabel('n_features / n_samples')
plt.ylabel('Classification accuracy')
plt.legend(loc=1, prop={'size': 5})
plt.show()
**下图为对比结果,使用特征压缩可以大大提高预测准确度**
最小二乘的矩阵形式:Ax=b,其中A为nxk的矩阵,x为kx1的列向量,b为nx1的列向量。如果n大于k(方程的个数大于未知量的个数), 这个方程系统称为Over Determined System,如果n小于k(方程的个数小于未知量的个数),这个系统就是Under Determined System。
QR分解法是三种将矩阵分解的方式之一。这种方式,把矩阵分解成一个正交矩阵与一个上三角矩阵的积。
QR 分解经常用来解线性最小二乘法问题。QR 分解也是特定特征值算法即QR算法的基础。