目录
一、前期准备
二、划分数据集
三、引入分类的性能评价指标
四、分类算法概述
1、KNN(K-Nearest Neighbors)
2、朴素贝叶斯——朴素:特征间相互独立
3、决策树——切分标准以信息增益大的准则先进行决策
4、支持向量机(Support Vector Machine)
5、集成方法
5.1 袋装法(bagging)——并联
5.2 提升法(boosting)——串联
五、分类模型的训练及预测
学习:通过接收到的数据,归纳提取出相同与不同之处。
机器学习:让计算机以数据为基础,进行归纳与总结。
案例数据及数据预处理和特征工程——
Python数据分析之特征工程-CSDN博客数据和特征决定了机器学习的上限,而模型和算法只是无限的逼近它而已。特征工程一般包括特征使用、特征获取、特征处理、特征监控四大方面。https://blog.csdn.net/weixin_45085051/article/details/126986556
k-fold交叉验证:将数据集分成k份,每次轮流做一遍测试集,其他做训练集
# 切分训练集和验证集(测试集)
from sklearn.model_selection import train_test_split
f_v = features.values # 原先的数据是DataFrame,装换为数值,得到特征值
f_names = features.columns.values # 得到特征名称
l_v = label.values
x_tt, x_validation, y_tt, y_validation = train_test_split(f_v, l_v, test_size=0.2)
# 将训练集再切分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x_tt, y_tt, test_size=0.25)
# 引入评价指标:精确度;召回率;F1分数
from sklearn.metrics import accuracy_score, recall_score, f1_score
准确率(Accuracy)
精确率(Precision):在所有预测为正类的样本中,预测正确的比例,也称为查准率
召回率(Recall):在所有实际为正类的样本中,预测正确的比例,也称为查全率
F1分数(F1 Score):查准率和查全率的调和平均值
# 引入划分数据集的方式
from sklearn.model_selection import train_test_split
# 引入评价指标
from sklearn.metrics import accuracy_score, recall_score, f1_score
# 引入分类模型
from sklearn.neighbors import NearestNeighbors,KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB #高斯贝叶斯,伯努利贝叶斯
from sklearn.tree import DecisionTreeClassifier,export_graphviz #决策树及画树
from sklearn.svm import SVC # 从SVM中引入SVC分类器
from sklearn.ensemble import RandomForestClassifier # 随机森林-bagging集成方法
from sklearn.ensemble import AdaBoostClassifier # AdaBoost-boosting集成方法
def hr_modeling(features, label):
# 切分训练集和验证集(测试集)
f_v = features.values # 原先的数据是DataFrame,装换为数值,得到特征值
f_names = features.columns.values # 得到特征名称
l_v = label.values
x_tt, x_validation, y_tt, y_validation = train_test_split(f_v, l_v, test_size=0.2)
# 将训练集再切分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x_tt, y_tt, test_size=0.25)
#将模型的名字和模型函数作为元组添加到列表当中存储;
models = []
models.append(("KNN",KNeighborsClassifier(n_neighbors=3))) #指定邻居个数
models.append(("GaussianNB",GaussianNB())) #高斯贝叶斯
models.append(("BernoulliNB",BernoulliNB())) #伯努利贝叶斯
models.append(("DecisionTreeGini",DecisionTreeClassifier(min_impurity_split=0.1)))
#默认criterion是基尼系数,可以填参数来剪枝,进行最小不纯度切分
models.append(("DecisionTreeEntropy",DecisionTreeClassifier(criterion="entropy")))
#信息增益
models.append(("SVM Classifier",SVC(C=1000)))
#可以通过参数C来控制精度,C越大要求精度越高; C——错分点的惩罚度
models.append(("RandomForest",RandomForestClassifier())) #可以添加参数调整
# n_estimators决策树的个数
# max_features=None表示每棵决策树取全局特征;
# bootstrap=True有放回采样;False无放回采样
# oob_score=True将没有被选择到的样本作为验证集
models.append(("Adaboost",AdaBoostClassifier(base_estimator=SVC(),algorithm='SAMME')))
# base_estimator 弱基础分类器,需要具备两个属性:classes_ and n_classes_
# n_estimators 级联的分类器数量
# learning_rate 权值衰减率
# algorithm 如果base_estimator能得到分类的概率值——'SAMME.R'
#循环调用所有模型进行训练、预测
for clf_name, clf in models:
clf.fit(x_train, y_train)
xy_lst = [(x_train, y_train), (x_validation, y_validation), (x_test, y_test)]
for i in range(len(xy_lst)):
x_part = xy_lst[i][0] # 为遍历中的第0部分
y_part = xy_lst[i][1] # 为遍历中的第1部分
y_pred = clf.predict(x_part)
print(i) # i是下标,0表示训练集,1表示验证集,2表示测试集
print(clf_name, "ACC:", accuracy_score(y_part, y_pred))
print(clf_name, "REC:", recall_score(y_part, y_pred))
print(clf_name, "F-score:", f1_score(y_part, y_pred))
# 调用所有模型
def main():
features, label = hr_preprocessing() # 默认是False,也可以改为True
hr_modeling(features, label)
if __name__ == "__main__":
main()
KNN的预测结果:
0训练集
KNN ACC: 0.975108345371708
KNN REC: 0.9596287703016241
KNN F-score: 0.9486238532110092
1验证集
KNN ACC: 0.959
KNN REC: 0.9314868804664723
KNN F-score: 0.9122055674518201
2测试集
KNN ACC: 0.954
KNN REC: 0.9273972602739726
KNN F-score: 0.9075067024128687
高斯贝叶斯的训练结果:
0
GaussianNB ACC: 0.7848649849983331
GaussianNB REC: 0.7614849187935034
GaussianNB F-score: 0.628976619394404
1
GaussianNB ACC: 0.7893333333333333
GaussianNB REC: 0.7842565597667639
GaussianNB F-score: 0.629976580796253
2
GaussianNB ACC: 0.779
GaussianNB REC: 0.7753424657534247
GaussianNB F-score: 0.630640668523677
伯努利贝叶斯的训练结果:
0
BernoulliNB ACC: 0.8400933437048561
BernoulliNB REC: 0.46635730858468677
BernoulliNB F-score: 0.5827776167004929
1
BernoulliNB ACC: 0.8493333333333334
BernoulliNB REC: 0.48833819241982507
BernoulliNB F-score: 0.5971479500891266
2
BernoulliNB ACC: 0.8386666666666667
BernoulliNB REC: 0.45616438356164385
BernoulliNB F-score: 0.5791304347826087
基于基尼系数划分决策树的训练结果:
0
DecisionTreeGini ACC: 0.8169796644071563
DecisionTreeGini REC: 0.694199535962877
DecisionTreeGini F-score: 0.6449665876266436
1
DecisionTreeGini ACC: 0.82
DecisionTreeGini REC: 0.7376093294460642
DecisionTreeGini F-score: 0.652061855670103
2
DecisionTreeGini ACC: 0.8316666666666667
DecisionTreeGini REC: 0.7246575342465753
DecisionTreeGini F-score: 0.6769033909149073
基于信息增益决策树的训练结果:
0
DecisionTreeEntropy ACC: 1.0
DecisionTreeEntropy REC: 1.0
DecisionTreeEntropy F-score: 1.0
1
DecisionTreeEntropy ACC: 0.9746666666666667
DecisionTreeEntropy REC: 0.9723032069970845
DecisionTreeEntropy F-score: 0.9460992907801418
2
DecisionTreeEntropy ACC: 0.9753333333333334
DecisionTreeEntropy REC: 0.9684931506849315
DecisionTreeEntropy F-score: 0.950268817204301
支持向量机分类模型的训练结果:0
SVM Classifier ACC: 0.9852205800644516
SVM Classifier REC: 0.9549883990719258
SVM Classifier F-score: 0.9686985172981878
1
SVM Classifier ACC: 0.9736666666666667
SVM Classifier REC: 0.9387755102040817
SVM Classifier F-score: 0.942209217264082
2
SVM Classifier ACC: 0.9683333333333334
SVM Classifier REC: 0.9356164383561644
SVM Classifier F-score: 0.9349760438056125
随机森林的训练结果:
0
RandomForest ACC: 1.0
RandomForest REC: 1.0
RandomForest F-score: 1.0
1
RandomForest ACC: 0.9896666666666667
RandomForest REC: 0.9664723032069971
RandomForest F-score: 0.9771554900515843
2
RandomForest ACC: 0.9896666666666667
RandomForest REC: 0.9657534246575342
RandomForest F-score: 0.9784871616932685
Adaboost 算法的训练结果:
0
Adaboost ACC: 0.7605289476608512
Adaboost REC: 0.0
Adaboost F-score: 0.0
1
Adaboost ACC: 0.7713333333333333
Adaboost REC: 0.0
Adaboost F-score: 0.0
2
Adaboost ACC: 0.7566666666666667
Adaboost REC: 0.0
Adaboost F-score: 0.0