目录
***特征工程部分***
1.工具准备
2.读取数据
3.准备数据
4.默认参数的随机森林模型
5.随机森林超参数调优
通过计算确定n_estimators=200,调其他参数
当max_features=26,模型性能抖动上升,无需细调
可以看出max_depth单调上升,继续扩大max_depth
通过调整,发现max_depth=40最优,接下来调整min_samples_leaf
CART其他模型复杂度参数:
用最佳参数组合,训练模型
特征重要性
随机森林的概率校准
***测试部分***
1.读取数据
2.数据
数据源:以Kaggle 2015年举办的Otto Group Product Classification Challenge 竞赛数据为例。分别调用缺省参数RandomForest、CART+GridSearchCV以进行参数调优。
Otto数据集是一个多类商品分类问题,类别数=9,共93维特征。
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
#读取数据
dpath='./data/'
#采用原始特征+tf_idf特征
train1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
train2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
#去掉多余的id
train2=train2.drop(["id"],['targer'],axis=1)
train=pd.concat([train1,train2],axis=1,ignore_index=False)
train.head()
del train1
del train2
y_train=train['target']
X_train=train.drop(["id","target"],axis=1)
#保存特征名字以备后用
feat_names=X_train.columns
#生成稀疏数据
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_train)
from sklearn.ensemble import RandomForestClassifier
RF1=RandomForestClassifier()
#交叉验证用于评估模型性能和进行参数调优
#分类任务中交叉验证缺省是采用StratifiedKFold
#数据集比较大,采用3折交叉验证
from sklearn.model_selection import cross_val_score
loss=cross_val_score(RF1,X_train,y_train,cv=3,score='neg_log_loss')
print('logloss of each fold is:',-loss)
print('cv logloss is:',-loss.mean())
输出结果:
虽然结果还是不好,但比默认参数的单棵树性能还是好太多。
随机森林的超参数会有很多:
Bagging参数:
与决策树的共同的超参数:
#需要调优的参数
tuned_n_estimators=range(10,200,10)
accuracy_s=np.zeros(len(tuned_n_estimators))
#初始max_depth设为单棵树的max_depth,max_features比推荐值sqrt(D)=sqrt(163)=13摇号
#min_samples_leaf比单棵树的min_samples_leaf略小
for j,one_n_estimators in enumerate(tuned_n_estimators):
RF2=RandromForestClassifier(n_estimators=one_n_estimators,max_depth=10,
max_features=20,min_samples_leaf=30,oob_score=True,
random_state=33)
RF2.fit(X_train,y_train)
accuracy_s[j]=RF2.oob_score_
plt.plot(tuned_n_estimators,accuracy_s)
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
#需要调优的参数
tuned_parameters=range(10,40,2)
accuracy_s=np.zeros(len(tuned_parameters))
for j,one_parameter in enumerate(tuned_parameters):
RF2=RandomForestClassifier(n_estimators=200,max_features=one_parameter,
min_samples_leaf=30,oob_score=True,random_state=33)
RF2.fit(X_train,y_train)
accuracy_s[j]=RF2.oob_score_
plt.plot(tuned_parameters,accuracy_s)
print(accuracy_s)
输出结果:
调整max_depth
#需要调优的参数
tuned_parameters=range(10,50,10)
accuracy_s=np.zeros(len(tuned_parameters))
for j,one_parameter in enumerate(tuned_parameter):
RF2=RandomForestClassifier(n_esitmators=200,max_features=26,max_depth=
one_parameter,min_samples_leaf=30,oob_score=True,random_state=33)
accuracy_s[j]=RF2.oob_score_
plt.plot(tuned_parameters,accuracy_s)
plt.show()
print(accuracy_s)
输出图:
#需要调优的参数
tuned_parameters=range(50,100,10)
accuracy_s=np.zeros(len(tuned_parameters))
for j,one_parameters in enumerate(tuned_parameters):
RF2=RandomForestClassifier(n_esitmators=200,max_features=26,max_depth=
one_parameter,min_samples_leaf=30,oob_score=True,random_state=33)
accuracy_s[j]=RF2.oob_score_
plt.plot(tuned_parameters,accuracy_s)
plt.show()
print(accuracy_s)
tuned_parameters=range(1,10,2)
accuracy_s=np.zeros(len(tuned_parameters))
for j,one_parameter in enumerate(tuned_parameters):
RF2=RandomForestClassifier(n_estimators=200,max_features=26,max_depth=40,
min_sample_leaf=one_parameter,oob_score=True,random_state=33)
RF2.fit(X_train,y_train)
accuracy_s[j]=RF2.oob_score_
print(accuracy_s[j])
plt.plot(tuned_parameters,accuracy_s)
plt.show()
输出结果:
对比 准确率发现当min_samples_leaf=1时,最高。
最优参数组合:
RF2=RandomTreeClassifier(n_estimators=200,max_features=26,max_depth=40,
min_samples_leaf=1,oob_score=True,random_state=40)
RF2.fit(X_train,y_train)
#保存模型,用于后续测试
import Pickle
Pickle.dump(RF2,open("Otto_RF_org_tfidf.pkl",'wb'))
df=pd.DataFrame("columns":list(feat_names),
"importance":RF2.feature_importances_T))
df=df.sort_values(by=['importances'],ascending=Flase)
诸如bagging和random forests(随机森林)的方法,从基本模型的平均预测中可能难以将预测置于0和1附近,因此对其结果输出做概率校准,得到的每类概率输出会更好。
from sklearn.calibration import CalibratedClassifierCV
CalibratedRF=CalibratedClassifierCV(base_estimator=RF2,cv=5)
CalibratedRF.fit(X_train,y_train)
#保存模型,用于后续测试
Pickle.dump(CalibratedRF,open("Otto_CalibratedRF_org_tfidf.pkl",'wb')
#读取数据
dpath='./data/'
#采用原始特征+tf_idf特征
test1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
test2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
#去掉多余的id
test2=train2.drop(["id"],['targer'],axis=1)
test=pd.concat([train1,train2],axis=1,ignore_index=False)
test.head()
del test1
del test2
test_id=test['id']
X_test=test.drop(['id'],axis=1)
#保存特征名字
feat_names=X_test.columns
#数据稀疏化
from scipy.sparse import csr_matrix
X_test=csr_matrix(X_test)
#load上面训练好的模型
model=Pickle.load(open("Otto_FE_org_tfidf.pkl",'rb'))
#输出每类的概率
y_test_pred=model.predict_proba(X_test)
#生成提交结果
out_df=pd.DataFrame(y_test_pred)
columns=np.empty(9,dtype=object)
for i in range(9):
columns[i]='Class_'+str(i+1)
out_df.columns=columns
out_df=pd.concat([test_id,out_df],axis=1)
out_df.to_csv("RF_org_tfidf.csv",index=False)
原始特征和tfidf两种特征
接下来可以尝试加载校准之后的模型"Otto_Calibrate_FE_org_tfidf.pkl"再来测试分数