线性判别分析(Linear Discriminant Analysis,简称 LDA)的思想:给定训练样例集,设法将样例投射到一条直线或一个超平面上,使得同类样例的投影点尽可能接近、异类样例的投影点尽可能远离;在对新样本进行分类时,将其投影到同样的这条直线或超平面上,再根据投影点的位置来确定新样本的类别。
给定数据集 D = { ( x i , y i ) } i = 1 m , y i ∈ { 0 , 1 } D=\{(x_i,y_i)\}_ {i=1}^m, y_i\in\{0,1\} D={ (xi,yi)}i=1m,yi∈{ 0,1}。令 X i X_i Xi表示样例集合, μ i \mu_i μi表示均值向量( μ 1 \mu_1 μ1表示 x ∈ X 1 x\in X_1 x∈X1的 x x x的均值), ∑ i \sum_i ∑i表示协方差矩阵( ∑ 0 = ∑ x ∈ X 0 ( x − μ 0 ) ( x − μ 0 ) T \sum_0=\sum_{x\in X_0}{(x-\mu_0)(x-\mu_0)^T} ∑0=∑x∈X0(x−μ0)(x−μ0)T),此处 i ∈ { 0 , 1 } i\in\{0,1\} i∈{ 0,1}。则两类样本的中心在直线上的投影为 w T μ i w^T\mu_i wTμi;两类样本所有点投射到直线上的协方差为 w T ∑ i w w^T\sum_iw wT∑iw。
投射的协方差推导过程:
由 ∑ i = ∑ x ∈ X i ( x i − μ i ) ( x i − μ i ) T \sum_i=\sum_{x\in X_i}(x_i-\mu_i)(x_i-\mu_i)^T ∑i=∑x∈Xi(xi−μi)(xi−μi)T可得投射后的协方差 ∑ i ′ = ∑ x ∈ X i ( w T x i − w T μ i ) ( w T x i − w T μ i ) T \sum_i'=\sum_{x\in X_i}(w^Tx_i-w^T\mu_i)(w^Tx_i-w^T\mu_i)^T ∑i′=∑x∈Xi(wTxi−wTμi)(wTxi−wTμi)T,又
( w T x i − w T μ i ) ( w T x i − w T μ i ) T = w T ( x i − μ i ) [ w T ( x i − μ i ) ] T = w T ( x i − μ i ) ( x i − μ i ) T w \begin{array}{l} (w^Tx_i-w^T\mu_i)(w^Tx_i-w^T\mu_i)^T \\ \\ =w^T(x_i-\mu_i)[w^T(x_i-\mu_i)]^T \\ \\ =w^T(x_i-\mu_i)(x_i-\mu_i)^Tw \end{array} (wTxi−wTμi)(wTxi−wTμi)T=wT(xi−μi)[wT(xi−μi)]T=wT(xi−μi)(xi−μi)Tw
则可得结果 w T ∑ i w w^T\sum_iw wT∑iw。
目标是使 w T ∑ 0 w + w T ∑ 1 w w^T\sum_0w+w^T\sum_1w wT∑0w+wT∑1w尽可能小,使 ∥ w T μ 0 − w T μ 1 ∥ 2 2 \parallel w^T\mu_0-w^T\mu_1\parallel_2^2 ∥wTμ0−wTμ1∥22尽可能大,同时考虑两者,可得到欲最大化的目标
J = ∥ w T μ 0 − w T μ 1 ∥ 2 2 w T ∑ 0 w + w T ∑ 1 w = w T ( μ 0 − μ 1 ) ( μ 0 − μ 1 ) T w w T ( ∑ 0 + ∑ 1 ) w = w T S b w w T S w w \begin{array}{c} J=\dfrac{\parallel w^T\mu_0-w^T\mu_1\parallel_2^2}{w^T\sum_0w+w^T\sum_1w} \\ \\ =\dfrac{w^T(\mu_0-\mu_1)(\mu_0-\mu_1)^Tw}{w^T(\sum_0+\sum_1)w} \\ \\ =\dfrac{w^TS_bw}{w^TS_ww} \end{array} J=wT∑0w+wT∑1w∥wTμ0−wTμ1∥22=wT(∑0+∑1)wwT(μ0−μ1)(μ0−μ1)Tw=wTSwwwTSbw
这就是LDA欲最大化的目标,即 S b S_b Sb与 S w S_w Sw的"广义瑞利商"(generalized Rayleigh quotient)。其中"类内散度矩阵"(within-class scatter matrix)为 S w = ∑ 0 + ∑ 1 = ∑ x ∈ X 0 ( x − μ 0 ) ( x − μ 0 ) T + ∑ x ∈ X 1 ( x − μ 1 ) ( x − μ 1 ) T S_w=\sum_0+\sum_1=\sum_{x\in X_0}(x-\mu_0)(x-\mu_0)^T+\sum_{x\in X_1}(x-\mu_1)(x-\mu_1)^T Sw=∑0+∑1=∑x∈X0(x−μ0)(x−μ0)T+∑x∈X1(x−μ1)(x−μ1)T;“类间散度矩阵”(between-class scatter matrix)为 S b = ( μ 0 − μ 1 ) ( μ 0 − μ 1 ) T S_b=(\mu_0-\mu_1)(\mu_0-\mu_1)^T Sb=(μ0−μ1)(μ0−μ1)T。
S b S_b Sb和 S w S_w Sw是通过给定数据集计算出来,是确定的,所以 J J J的分子分母都是关于 w w w的二次项,若 w w w是一个解,则对任意常数 α \alpha α, α w \alpha w αw也是解。不失一般性,令 w T S w w = 1 w^TS_ww=1 wTSww=1,则最大化目标等价于
min w − w T S b w s . t . w T S w w = 1 \begin{array}{c} \min_w -w^TS_bw \\ s.t.\quad w^TS_ww=1 \end{array} minw−wTSbws.t.wTSww=1
根据拉格朗日乘子法,有
L ( w ) = − w T S b w + λ ( w T S w w − 1 ) L(w)=-w^TS_bw+\lambda(w^TS_ww-1) L(w)=−wTSbw+λ(wTSww−1)
令 ∂ L ∂ w = − ( S b + S b T ) w + λ ( S w + S w T ) w = − 2 S b w + 2 λ S w w = 0 \dfrac{\partial L}{\partial w}=-(S_b+S_b^T)w+\lambda(S_w+S_w^T)w=-2S_bw+2\lambda S_ww=0 ∂w∂L=−(Sb+SbT)w+λ(Sw+SwT)w=−2Sbw+2λSww=0有 S b w = λ S w w S_bw=\lambda S_ww Sbw=λSww。(从 S b S_b Sb和 S w S_w Sw的定义可知 S b = S b T S_b=S_b^T Sb=SbT)
由于 S b w S_bw Sbw的方向恒为 μ 0 − μ 1 \mu_0-\mu_1 μ0−μ1,不妨令 S b w = λ ( μ 0 − μ 1 ) S_bw=\lambda(\mu_0-\mu_1) Sbw=λ(μ0−μ1),可解得
w = S w − 1 ( μ 0 − μ 1 ) w=S_w^{-1}(\mu_0-\mu_1) w=Sw−1(μ0−μ1)
实践中通常是对 S w S_w Sw进行奇异值分解,即 S w = U ∑ V T , ∑ S_w=U\sum V^T,\sum Sw=U∑VT,∑是一个实对角矩阵,对角线上的元素是 S w S_w Sw的奇异值,则 S w − 1 = V ∑ − 1 U T S_w^{-1}=V\sum^{-1}U^T Sw−1=V∑−1UT。
假定存在N个类,且第i类的样本数为 m i m_i mi。定义"全局散度矩阵"为 S t = S b + S w = ∑ i = 1 m ( x i − μ ) ( x i − μ ) T S_t=S_b+S_w=\sum_{i=1}^{m}{(x_i-\mu)(x_i-\mu)^T} St=Sb+Sw=∑i=1m(xi−μ)(xi−μ)T。"类内散度矩阵"为 S w = ∑ i = 1 N ∑ x ∈ X i ( x − μ i ) ( x − μ i ) T S_w=\sum_{i=1}^{N}\sum_{x\in X_i}{(x-\mu_i)(x-\mu_i)^T} Sw=∑i=1N∑x∈Xi(x−μi)(x−μi)T。"类间散度矩阵"为 S b = ∑ i = 1 N m i ( μ i − μ ) ( μ i − μ ) T S_b=\sum_{i=1}^{N}{m_i(\mu_i-\mu)(\mu_i-\mu)^T} Sb=∑i=1Nmi(μi−μ)(μi−μ)T,类似的
sklearn中关于实现LDA的类是LinearDiscriminantAnalysis,官方文档: https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
下面是该类各参数、属性和方法的介绍,基本翻译自官方文档:
sklearn.discriminant_analysis.LinearDiscriminantAnalysis(solver=‘svd’, shrinkage=None, priors=None, n_components=None, store_covariance=False, tol=0.0001)
本例数据来源于机器学习竞赛平台Kaggle,一个关于手机价格分类的数据。数据地址:https://www.kaggle.com/iabhishekofficial/mobile-price-classification
数据一共2000个样例,20个字段。分类字段为price_range,一共分为了4类。数据集无缺失值,无错误值。字段及解释如下
由于从一般常识得知,机身厚度和机身重量对价格往往是没有直接影响的,故在建模时将这两个字段剔除。
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import model_selection
# 读取数据
data = pd.read_csv(r'C:\Users\Administrator\Desktop\train.csv')
# 剔除两个无关字段
data.drop(['m_dep', 'mobile_wt'], axis=1, inplace=True)
X = data[data.columns[:-1]]
Y = data[data.columns[-1]]
# 拆分为训练集和测试集
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.25, random_state=1234)
# LDA
lda = LinearDiscriminantAnalysis()
# 训练模型
lda.fit(x_train, y_train)
# 模型评估
print('模型准确率:\n', lda.score(x_test, y_test))
参考文献:
[1]周志华.《机器学习》.清华大学出版社,2016-1
[2]https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis