机器学习 day4 决策树应用,验证,调参;多种回归比较

目录

  • 1. 决策树的应用:kaggle 泰坦尼克号生还者预测
    • 导入数据并观察
    • 清洗数据
    • 筛选特征及编码
    • 划分数据集
    • 导入模型计算
    • 验证(交叉验证法)
    • 调参:网格搜索 Grid_Search
    • 利用分类器分类:
  • 2. 随机森林
    • 测试不同深度该随机森林分类器的表现:交叉验证法
    • 利用网格搜索调参(比较耗时间,大概10分钟左右):
  • 3. 多种回归比较(boston数据集) 待改进 数据标准化,归一化
    • 回归树
    • 随机森林回归
    • 岭回归
    • 多项式回归

1. 决策树的应用:kaggle 泰坦尼克号生还者预测

数据集特征介绍:
PassengerId:乘客的ID号,这是个顺序编号,用来唯一地标识一名乘客。这个特征和幸存与否无关,不使用这个特征。

Survived:1 表示幸存,0 表示遇难。这是我们标注的数据

Pclass:仓位等级,是很重要的特征。高仓位等级的乘客能更快地到达甲板,从而更容易获救

Name:乘客名字,这个特征和幸存与否无关,丢弃

Sex:乘客性别,船长让妇女和儿童先上,很重要的特征

Age:乘客年龄,儿童会优先上船

SibSp:兄弟姐妹同在船上的数量

Parch:同船的父辈人员数量

Ticket:乘客票号,不使用这个特征

Fare:乘客体热指标

Cabin:乘客所在的船舱号。实际上这个特征和幸存与否有一定关系,比如最早被水淹没的船舱位置,其乘客的幸存概率要低一些。但由于这个特征由大量丢失数据,所以丢弃这个特征

Embarked:乘客登船的港口,需要把港口数据转换为数值型数据

但是这些特征里面有一些特征是没用的,所以我们把它删除掉,以此来减少数据的运算。

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

导入数据并观察

data = pd.read_csv('data.csv')
data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

观测数值类型特征的数据描述:主要观察,标准差(越小越好),最值,均值等

data.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
data.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

一个可视化信息缺失的库

import missingno
missingno.matrix(data)

机器学习 day4 决策树应用,验证,调参;多种回归比较_第1张图片
可以看到age 和 cabin值缺失比较多,embarked 也存在缺失值

清洗数据

删除异常列 cabin :

del data['Cabin']

观察年龄数据项 :

plt.hist(data['Age'])

机器学习 day4 决策树应用,验证,调参;多种回归比较_第2张图片
由于均值和中位数比较接近,都可以用来填充(这里我选用了整数的中位数)

data.Age.mean() # 29.69911764705882
data.Age.median() # 28.0

填充年龄的缺失:

data['Age'].fillna(data['Age'].median(),inplace=True)

填充embarded 缺失:

data['Embarked'].fillna(method='ffill',inplace=True)

数据处理完毕:

data.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB

missingno.matrix(data)

机器学习 day4 决策树应用,验证,调参;多种回归比较_第3张图片

筛选特征及编码

data.columns

Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Embarked’],
dtype=‘object’)

X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
X.head()
Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22.0 1 0 7.2500 S
1 1 female 38.0 1 0 71.2833 C
2 3 female 26.0 0 0 7.9250 S
3 1 female 35.0 1 0 53.1000 S
4 3 male 35.0 0 0 8.0500 S

给性别编码:

X['Sex'] = 1*(X['Sex']=='male')
X.head()
Pclass Sex Age SibSp Parch Fare Embarked
0 3 1 22.0 1 0 7.2500 S
1 1 0 38.0 1 0 71.2833 C
2 3 0 26.0 0 0 7.9250 S
3 1 0 35.0 1 0 53.1000 S
4 3 1 35.0 0 0 8.0500 S

给登船点编码:

unique = data.Embarked.unique().tolist()
unique # ['S', 'C', 'Q']
X['Embarked']=data['Embarked'].apply(lambda x:unique.index(x))
X
Pclass Sex Age SibSp Parch Fare Embarked
0 3 1 22.0 1 0 7.2500 0
1 1 0 38.0 1 0 71.2833 1
2 3 0 26.0 0 0 7.9250 0
3 1 0 35.0 1 0 53.1000 0
4 3 1 35.0 0 0 8.0500 0
886 2 1 27.0 0 0 13.0000 0
887 1 0 19.0 0 0 30.0000 0
888 3 0 28.0 1 2 23.4500 0
889 1 1 26.0 0 0 30.0000 1
890 3 1 32.0 0 0 7.7500 2
891 rows × 7 columns

划分数据集

from sklearn.model_selection import train_test_split
y = data['Survived']
xtrain,xtest,ytrain,ytest = train_test_split(X,y,random_state =60)

导入模型计算

from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state=120).fit(xtrain,ytrain)

DT.score(xtest,ytest) # 0.7713004484304933

验证(交叉验证法)

from sklearn.model_selection import cross_val_score
cross_val_score(DT,xtrain,ytrain,cv=10)

array([0.8358209 , 0.85074627, 0.71641791, 0.67164179, 0.86567164,
0.7761194 , 0.86567164, 0.76119403, 0.8030303 , 0.81818182])

不同深度的决策树模型的训练集和测试集,交叉验证的对比参数

cross = []
score = []
train = []
for i in np.arange(1,20):
    DT1 = DecisionTreeClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
    c = cross_val_score(DT1,xtrain,ytrain,cv=5).mean()
    cross.append(c)
    score.append(DT1.score(xtest,ytest))
    train.append(DT1.score(xtrain,ytrain))
plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))

机器学习 day4 决策树应用,验证,调参;多种回归比较_第4张图片
根据图像找到最适合的参数:

DT = DecisionTreeClassifier(random_state = 20, max_depth = 6)
cross_val_score(DT, xtrain,ytrain,cv=5).mean() # 0.8278083267871171

可以看到深度为6时,交叉验证的准确率最高

[*zip(np.arange(1,20),cross)]

[(1, 0.7978341375827629),
(2, 0.7783525979126922),
(3, 0.8023678599483783),
(4, 0.8248008079901246),
(5, 0.8203119739647626),
(6, 0.8278083267871171),
(7, 0.8158455841095276),
(8, 0.8218045112781954),
(9, 0.8158904724497813),
(10, 0.8038603972618112),
(11, 0.8039052856020648),
(12, 0.8054090450005612),
(13, 0.8039165076871282),
(14, 0.8009650993154528),
(15, 0.7979351363483336),
(16, 0.8009426551453259),
(17, 0.7979351363483336),
(18, 0.7964425990349007),
(19, 0.7994276736617663)]

调参:网格搜索 Grid_Search

from sklearn.model_selection import GridSearchCV

设置需要网格搜索的参数

paras = {
     
    "max_depth":np.arange(1,20),
    "min_samples_leaf":np.arange(1,20),
    "criterion":['gini','entropy']
        }

实例化模型(不用fit数据)

DT = DecisionTreeClassifier()

定义网格搜索,并fit()数据

GS = GridSearchCV(DT,param_grid=paras,cv = 8).fit(xtrain,ytrain)

最优参数:

GS.best_params_

结果:{‘criterion’: ‘entropy’, ‘max_depth’: 9, ‘min_samples_leaf’: 9}

最优分数

GS.best_score_

结果:0.8441802925989673

最优评估器:

GS.best_estimator_

结果:DecisionTreeClassifier(criterion=‘entropy’, max_depth=14, min_samples_leaf=9)

由上可设置最优决策树分类器:

DT = DecisionTreeClassifier(criterion='entropy', max_depth=14, min_samples_leaf=9).fit(xtrain,ytrain)
DT.score(xtest,ytest) # 0.8071748878923767

利用分类器分类:

DT.predict([[1,0,30,1,2,58,0]]) # array([1], dtype=int64)

可视化决策树:

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(DT
                                ,out_file = None
                                ,feature_names= X.columns
                                ,class_names=['死亡','存活']
                                ,filled=True
                                ,rounded=True
                                )
graph = graphviz.Source(dot_data) 
graph

计算特征重要性:

[*zip(DT.feature_importances_,X.columns)]

[(0.17948345431191473, ‘Pclass’),
(0.4272386937323802, ‘Sex’),
(0.12579044257521602, ‘Age’),
(0.060547878544091265, ‘SibSp’),
(0.0, ‘Parch’),
(0.1842363885730208, ‘Fare’),
(0.0227031422633769, ‘Embarked’)]

预测可能性:

DT.predict_proba(xtest)

机器学习 day4 决策树应用,验证,调参;多种回归比较_第5张图片

2. 随机森林

from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth = 4).fit(xtrain,ytrain)
RF.score(xtest,ytest)

结果:0.8026905829596412

测试不同深度该随机森林分类器的表现:交叉验证法

from sklearn.model_selection import cross_val_score
cross_val_score(RF,xtrain,ytrain,cv=10)
cross = []
score = []
train = []
for i in np.arange(1,20):
    RF1 = RandomForestClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
    c = cross_val_score(RF1,xtrain,ytrain,cv=5).mean()
    cross.append(c)
    score.append(RF1.score(xtest,ytest))
    train.append(RF1.score(xtrain,ytrain))

plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))

机器学习 day4 决策树应用,验证,调参;多种回归比较_第6张图片

RF = RandomForestClassifier(random_state = 20, max_depth = 5).fit(xtrain,ytrain)
cross_val_score(RF, xtrain,ytrain,cv=5).mean() # 0.8367523285826506

结果:0.8367523285826506

RF.score(xtest,ytest)

结果:0.8071748878923767

[*zip(np.arange(1,20),cross)]

[(1, 0.7858489507350466),
(2, 0.7888452474469756),
(3, 0.8053192683200538),
(4, 0.8158006957692739),
(5, 0.8367523285826506),
(6, 0.8232970485916283),
(7, 0.8188306587363933),
(8, 0.82933453035574),
(9, 0.8203456402199528),
(10, 0.8128492873975984),
(11, 0.8113791942542925),
(12, 0.802412748288632),
(13, 0.802412748288632),
(14, 0.7964089327797105),
(15, 0.7994276736617664),
(16, 0.7979239142632701),
(17, 0.7934238581528448),
(18, 0.7949163954662776),
(19, 0.7949163954662776)]

利用网格搜索调参(比较耗时间,大概10分钟左右):

from sklearn.model_selection import GridSearchCV
paras = {
     
    "max_depth":np.arange(1,20),
    "min_samples_leaf":np.arange(1,20),
    "criterion":['gini','entropy']
        }
RF = RandomForestClassifier()
GS = GridSearchCV(RF,param_grid=paras).fit(xtrain,ytrain)
print(GS.best_params_) # {'criterion': 'entropy', 'max_depth': 8, 'min_samples_leaf': 3}
print(GS.best_score_) # 0.8382785321512737
print(GS.best_estimator_) # RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3)

根据最优结果重新设置随机森林分类器

RF = RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3).fit(xtrain,ytrain)
RF.score(xtest,ytest) # 0.8026905829596412

3. 多种回归比较(boston数据集) 待改进 数据标准化,归一化

数据准备

from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston.data,columns=boston.feature_names)
y = boston.target
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

xtrain,xtest,ytrain,ytest =train_test_split(X,y,random_state = 20)

回归树

DTR = DecisionTreeRegressor(max_depth = 8,random_state = 20).fit(xtrain,ytrain)
DTR.score(xtest,ytest),mean_squared_error(ytest,DTR.predict(xtest)) # (0.603193016561408, 31.984825561881298)

随机森林回归

# 随机森林

from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(random_state = 20).fit(xtrain,ytrain)
RFR.score(xtest,ytest),mean_squared_error(ytest,RFR.predict(xtest)) # (0.8051659689805782, 15.704694614173214)

岭回归

from sklearn.linear_model import Ridge
LR = Ridge().fit(xtrain,ytrain)
LR.score(xtest,ytest),mean_squared_error(ytest,LR.predict(xtest)) # (0.7214294743488996, 22.45431668671955)

多项式回归

from sklearn.preprocessing import PolynomialFeatures  
PF = PolynomialFeatures(degree=2).fit(xtrain)
xtrain_poly = pd.DataFrame(PF.transform(xtrain),columns=PF.get_feature_names(input_features=X.columns))
xtest_poly = pd.DataFrame(PF.transform(xtest),columns=PF.get_feature_names(input_features=X.columns))
LR2 = Ridge().fit(xtrain_poly,ytrain)
LR2.score(xtest_poly,ytest),mean_squared_error(ytest,LR2.predict(xtest_poly)) #(0.7348578552292532, 21.37191532292632)

你可能感兴趣的:(python培训内容及作业,决策树,python,机器学习)