活动地址:CSDN21天学习挑战赛
步骤:
原始数据中,仅仅约63%的样本被采样到
P n o t _ c h o o s e = ( 1 − 1 n ) n = 1 e 一个样本不被采样到的概率是 ( 1 − 1 n ) 因此不被采样到的样本,当 n 趋于正无穷时,约等于 1 e , 约等于 37 % P_{not\_choose}=(1-\frac{1}{n})^n=\frac{1}{e}\\ 一个样本不被采样到的概率是(1-\frac{1}{n})\\ 因此不被采样到的样本,当n趋于正无穷时,约等于\frac{1}{e},约等于37\% Pnot_choose=(1−n1)n=e1一个样本不被采样到的概率是(1−n1)因此不被采样到的样本,当n趋于正无穷时,约等于e1,约等于37%
也就是说,样本训练集中,约37%的错误训练集将被忽略掉,提高训练集的质量
集成分类器预测错误的概率: p ( e r r o r ) = ∑ i = ( m + 1 ) / 2 m [ m i ] r ′ ( 1 − r ) m − 1 集成分类器预测错误的概率:\\ p(error)=\sum_{i=(m+1)/2}^m \begin{bmatrix} m\\ i \end{bmatrix} r'(1-r)^{m-1} 集成分类器预测错误的概率:p(error)=i=(m+1)/2∑m[mi]r′(1−r)m−1
通过集成学习提高分类器的整体泛化能力是有条件的
分类器之间应该具有差异性
即相关性很弱
分类器的精度,每个个体分类器的分类精度都必须大于0.5
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
data_url='iris_train.csv'
df=pd.read_csv(data_url)
X=df.iloc[:,1:5]
y=df.iloc[:,5]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf=BaggingClassifier(KNeighborsClassifier(),max_samples=0.5,max_features=0.5)
clf.fit(X_train,y_train)
基本思想
行采样——有放回的重采样数据
列采样——有放回的重采样特征
并不是所有特征,我都会学习
而每一个决策树特征的选择随机性和数据随机性,使得每个小分类器都有求同存异的特点
关键因素:每棵树选择特征的数量m
随机森林分类效果(错误率)与两个因素有关:
森林中任意两棵树的相关性:相关性越大,错误率越大
【说明基分类器之间差异性越小】
森林中每棵树的分类能力,每棵树的分类能力越强,整个森林的错误率越低
【说明基分类器准确率高】
减少特征选择个数m,树的相关性和分类能力也会相应的降低
增大m,两者也会随之增大,所以关键问题时如何选择最优的m
优点:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
data_url='iris_train.csv'
df=pd.read_csv(data_url)
X=df.iloc[:,1:5]
y=df.iloc[:,5]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
提升分类器
——Bagging各个分类器的权重时一样的
——Boosting对于分类能力强的分类器给予更高的权重
f o r k = 1 t o i t e r a t i o n s : − c l a s s i f i e r k = l e a r n a w e a k c l a s s i f i e r b a s e d o n w e i g h t s − c a l c u l a t e w e i g h t e d e r r o r f o r t h i s c l a s s i f i e r ( 加权分类误差 ) ε k = ∑ i = 1 n ω i ∗ 1 [ l a b e l i ≠ c l a s s i f i e r k ( x i ) ] − c a l c u l a t e “ s c o r e ” f o r t h i s c l a s s i f i e r ( 分类器的系数 ) α k = 1 2 l o g ( 1 − ε i ε i ) − c h a n g e t h e e x a m p l e w e i g h t s ( 权值的更新 ) w i = 1 Z w i e x p ( − α k ∗ l a b e l i ∗ c l a s s i f i e r k ( x i ) ) for \,\,\, k=1\,\,\,to\,\,\,iterations:\\ -\,\,\,classifier_k=learn\,\,\,a\,\,\,weak\,\,\,classifier\,\,\,based\,\,\,on\,\,\,weights\\ -\,\,\,calculate\,\,\,weighted\,\,\,error\,\,\,for\,\,\,this\,\,\,classifier(加权分类误差)\\ \varepsilon_k=\sum_{i=1}^n\omega_i*1[label_i\neq classifier_k(x_i)]\\ -\,\,\,calculate\,\,\,“score”\,\,\,for \,\,\,this\,\,\,classifier(分类器的系数)\\ \alpha_k=\frac{1}{2}log(\frac{1-\varepsilon_i}{\varepsilon_i})\\ -\,\,\,change\,\,\,the\,\,\,example\,\,\,weights(权值的更新)\\ w_i=\frac{1}{Z}w_iexp(-\alpha_k*label_i*classifier_k(x_i)) fork=1toiterations:−classifierk=learnaweakclassifierbasedonweights−calculateweightederrorforthisclassifier(加权分类误差)εk=i=1∑nωi∗1[labeli=classifierk(xi)]−calculate“score”forthisclassifier(分类器的系数)αk=21log(εi1−εi)−changetheexampleweights(权值的更新)wi=Z1wiexp(−αk∗labeli∗classifierk(xi))
∙ 集成学习模型: f ( x ) = ∑ m = 1 M α m G m ( x ) ∙ 指数损失函数: L ( y , f ( x ) ) = e x p [ − y f ( x ) ] ∙ 公式推导: 1. 已知 f m ( x ) = f m − 1 ( x ) + α m G m ( x ) 2. 公式代入:将 f m ( x ) = f m − 1 ( x ) + α m G m ( x ) 代入损失函数得 L ( y , f ( x ) ) = ∑ i = 1 N e x p ( − y i ( f m − 1 ( x ) + α m G m ( x ) ) ) 3. 求导得: α k = 1 2 l o g ( 1 − ε i ε i ) \bullet 集成学习模型:f(x)=\sum_{m=1}^M\alpha_mG_m(x)\\ \bullet 指数损失函数:L(y,f(x))=exp[-yf(x)]\\ \bullet 公式推导:\\ 1.已知 f_m(x)=f_{m-1}(x)+\alpha_mG_m(x)\\ 2.公式代入:将f_m(x)=f_{m-1}(x)+\alpha_mG_m(x)代入损失函数得\\ L(y,f(x))=\sum_{i=1}^Nexp(-y_i(f_{m-1}(x)+\alpha_mG_m(x)))\\ 3.求导得:\alpha_k=\frac{1}{2}log(\frac{1-\varepsilon_i}{\varepsilon_i}) ∙集成学习模型:f(x)=m=1∑MαmGm(x)∙指数损失函数:L(y,f(x))=exp[−yf(x)]∙公式推导:1.已知fm(x)=fm−1(x)+αmGm(x)2.公式代入:将fm(x)=fm−1(x)+αmGm(x)代入损失函数得L(y,f(x))=i=1∑Nexp(−yi(fm−1(x)+αmGm(x)))3.求导得:αk=21log(εi1−εi)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
data_url='iris_train.csv'
df=pd.read_csv(data_url)
X=df.iloc[:,1:5]
y=df.iloc[:,5]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf=AdaBoostClassifier(n_estimators=100)
clf.fit(X_train,y_train)
GBDT=随机森林+AdaBoost
F ( X ) = F 0 + β 1 T 1 ( X ) + β 2 T 2 ( X ) + . . . + β M T M ( X ) F(X)=F_0+\beta_1T_1(X)+\beta_2T_2(X)+...+\beta_MT_M(X) F(X)=F0+β1T1(X)+β2T2(X)+...+βMTM(X)
学习过程:
GBDT的分类算法从思想上和GBDT的回归算法没有区别,但是由于样本输出不是连续的值,而是离散的类别,导致我们无法直接从输出类别去拟合类别输出的误差
为了解决这个问题,主要由两个方法
L ( y , f ( x ) ) = e x p [ − y f ( X ) ] L ( θ ) = − y i log y i ^ − ( 1 − y i ) log ( 1 − y ^ i ) L(y,f(x))=exp[-yf(X)]\\ L(\theta)=-y_i\log\hat{y_i}-(1-y_i)\log(1-\hat{y}_i) L(y,f(x))=exp[−yf(X)]L(θ)=−yilogyi^−(1−yi)log(1−y^i)
——GBDT分类
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
data_url='iris_train.csv'
df=pd.read_csv(data_url)
X=df.iloc[:,1:5]
y=df.iloc[:,5]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=1,random_state=0)
clf.fit(X_train,y_train)
——GBDT回归
import imp
from multiprocessing.spawn import import_main_path
import pandas as pd
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
train_url='trainOX.csv'
train_data=pd.read_csv(train_url)
train_data.drop(['ID','date','hour'],axis=1,inplace=True)
X=train_data.iloc[:,0:10]
y=train_data.iloc[:,10]
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,random_state=42)
reg=make_pipeline(StandardScaler(),SGDRegressor(max_iter=1000,tol=1e-3))
reg.fit(X_train,y_train)
y_val_pre=reg.predict(X_val)
print(mean_squared_error(y_val,y_val_pre))
——XGBoost分类
from sklearn.ensemble import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
data_url='iris_train.csv'
df=pd.read_csv(data_url)
X=df.iloc[:,1:5]
y=df.iloc[:,5]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf=XGBClassifier()
clf.fit(X_train,y_train)
——XGBoost回归
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost.sklearn import XGBRegressor
import pandas as pd
train_url='trainOX.csv'
train_data=pd.read_csv(train_url)
train_data.drop(['ID','date','hour'],axis=1,inplace=True)
X=train_data.iloc[:,0:10]
y=train_data.iloc[:,10]
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,random_state=42)
reg=XGBRegressor(max_depth=5,learning_rate=0.1,n_estimators=160,silent=False,objective='reg:linear')
reg.fit(X_train,X_val)
y_val_pre=reg.predict(X_val)
print(mean_squared_error(y_val,y_val_pre))