机器学习应用——sklearn自带数据集训练(线性判别分析)

原文:
分類法/範例二: Normal and Shrinkage Linear Discriminant Analysis for classification

"""
总结:
1.通过score方法拿到模型对当前特征数量的样本判断准确度
2.对比有无shrinkage,部分方法才可以使用特征压缩
http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
"""

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# LinearDiscriminantAnalysis 模式识别的经典算法 特征抽取方法 使得类内距离最小 类间距离最大
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis



def generate_data(n_samples, n_features):
    #指定特征、中心点数量、范围等来生成几类数据
    X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])
    # add non-discriminative features
    # print((X))
    if n_features > 1:
        X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
    return X, y

X, y = generate_data(10, 5)
# print((X,y))
import pandas as pd
pd.set_option('precision',2)
df=pd.DataFrame(np.hstack([y.reshape(10,1),X]))
df.columns = ['y', 'X0', 'X1', 'X2', 'X3', 'X4']
print(df)

#改變特徵數量並測試shrinkage之功能
n_train = 20  # samples for training
n_test = 200  # samples for testing
n_averages = 50  # how often to repeat classification
n_features_max = 75  # maximum number of features
step = 4  # step size for the calculation
acc_clf1, acc_clf2 = [], []
n_features_range = range(1, n_features_max + 1, step)

for n_features in n_features_range:
    score_clf1, score_clf2 = 0, 0
    for _ in range(n_averages):
        X, y = generate_data(n_train, n_features)
        # 线性分类判别器 
        # lsqr用最小二乘最小平方QR分解求解
        # 对比特征缩减 shrinkage
        clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto').fit(X, y)
        clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)

        X, y = generate_data(n_test, n_features)
        score_clf1 += clf1.score(X, y)
        score_clf2 += clf2.score(X, y)

    acc_clf1.append(score_clf1 / n_averages)
    acc_clf2.append(score_clf2 / n_averages)

#顯示LDA判別結果
#以比例形式,更有对比意义
features_samples_ratio = np.array(n_features_range) / n_train
# figsize长宽比 
fig = plt.figure(figsize=(4,3), dpi=150)
plt.plot(features_samples_ratio, acc_clf1, linewidth=2,
         label="Linear Discriminant Analysis with shrinkage", color='r')
plt.plot(features_samples_ratio, acc_clf2, linewidth=2,
         label="Linear Discriminant Analysis", color='g')
plt.xlabel('n_features / n_samples')
plt.ylabel('Classification accuracy')

plt.legend(loc=1, prop={'size': 5})
plt.show()

**下图为对比结果,使用特征压缩可以大大提高预测准确度**

机器学习应用——sklearn自带数据集训练(线性判别分析)_第1张图片

最小二乘的矩阵形式:Ax=b,其中A为nxk的矩阵,x为kx1的列向量,b为nx1的列向量。如果n大于k(方程的个数大于未知量的个数), 这个方程系统称为Over Determined System,如果n小于k(方程的个数小于未知量的个数),这个系统就是Under Determined System。

QR分解法是三种将矩阵分解的方式之一。这种方式,把矩阵分解成一个正交矩阵与一个上三角矩阵的积。
QR 分解经常用来解线性最小二乘法问题。QR 分解也是特定特征值算法即QR算法的基础。

你可能感兴趣的:(机器学习)