数据集特征介绍:
PassengerId:乘客的ID号,这是个顺序编号,用来唯一地标识一名乘客。这个特征和幸存与否无关,不使用这个特征。
Survived:1 表示幸存,0 表示遇难。这是我们标注的数据
Pclass:仓位等级,是很重要的特征。高仓位等级的乘客能更快地到达甲板,从而更容易获救
Name:乘客名字,这个特征和幸存与否无关,丢弃
Sex:乘客性别,船长让妇女和儿童先上,很重要的特征
Age:乘客年龄,儿童会优先上船
SibSp:兄弟姐妹同在船上的数量
Parch:同船的父辈人员数量
Ticket:乘客票号,不使用这个特征
Fare:乘客体热指标
Cabin:乘客所在的船舱号。实际上这个特征和幸存与否有一定关系,比如最早被水淹没的船舱位置,其乘客的幸存概率要低一些。但由于这个特征由大量丢失数据,所以丢弃这个特征
Embarked:乘客登船的港口,需要把港口数据转换为数值型数据
但是这些特征里面有一些特征是没用的,所以我们把它删除掉,以此来减少数据的运算。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
观测数值类型特征的数据描述:主要观察,标准差(越小越好),最值,均值等
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
data.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
一个可视化信息缺失的库
import missingno
missingno.matrix(data)
可以看到age 和 cabin值缺失比较多,embarked 也存在缺失值
删除异常列 cabin :
del data['Cabin']
观察年龄数据项 :
plt.hist(data['Age'])
由于均值和中位数比较接近,都可以用来填充(这里我选用了整数的中位数)
data.Age.mean() # 29.69911764705882
data.Age.median() # 28.0
填充年龄的缺失:
data['Age'].fillna(data['Age'].median(),inplace=True)
填充embarded 缺失:
data['Embarked'].fillna(method='ffill',inplace=True)
数据处理完毕:
data.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB
missingno.matrix(data)
data.columns
Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Embarked’],
dtype=‘object’)
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
X.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
给性别编码:
X['Sex'] = 1*(X['Sex']=='male')
X.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | S |
4 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | S |
给登船点编码:
unique = data.Embarked.unique().tolist()
unique # ['S', 'C', 'Q']
X['Embarked']=data['Embarked'].apply(lambda x:unique.index(x))
X
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 |
1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 1 |
2 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 0 |
3 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 0 |
4 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 0 |
… | … | … | … | … | … | … | … |
886 | 2 | 1 | 27.0 | 0 | 0 | 13.0000 | 0 |
887 | 1 | 0 | 19.0 | 0 | 0 | 30.0000 | 0 |
888 | 3 | 0 | 28.0 | 1 | 2 | 23.4500 | 0 |
889 | 1 | 1 | 26.0 | 0 | 0 | 30.0000 | 1 |
890 | 3 | 1 | 32.0 | 0 | 0 | 7.7500 | 2 |
891 rows × 7 columns |
from sklearn.model_selection import train_test_split
y = data['Survived']
xtrain,xtest,ytrain,ytest = train_test_split(X,y,random_state =60)
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state=120).fit(xtrain,ytrain)
DT.score(xtest,ytest) # 0.7713004484304933
from sklearn.model_selection import cross_val_score
cross_val_score(DT,xtrain,ytrain,cv=10)
array([0.8358209 , 0.85074627, 0.71641791, 0.67164179, 0.86567164,
0.7761194 , 0.86567164, 0.76119403, 0.8030303 , 0.81818182])
不同深度的决策树模型的训练集和测试集,交叉验证的对比参数
cross = []
score = []
train = []
for i in np.arange(1,20):
DT1 = DecisionTreeClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
c = cross_val_score(DT1,xtrain,ytrain,cv=5).mean()
cross.append(c)
score.append(DT1.score(xtest,ytest))
train.append(DT1.score(xtrain,ytrain))
plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))
DT = DecisionTreeClassifier(random_state = 20, max_depth = 6)
cross_val_score(DT, xtrain,ytrain,cv=5).mean() # 0.8278083267871171
可以看到深度为6时,交叉验证的准确率最高
[*zip(np.arange(1,20),cross)]
[(1, 0.7978341375827629),
(2, 0.7783525979126922),
(3, 0.8023678599483783),
(4, 0.8248008079901246),
(5, 0.8203119739647626),
(6, 0.8278083267871171),
(7, 0.8158455841095276),
(8, 0.8218045112781954),
(9, 0.8158904724497813),
(10, 0.8038603972618112),
(11, 0.8039052856020648),
(12, 0.8054090450005612),
(13, 0.8039165076871282),
(14, 0.8009650993154528),
(15, 0.7979351363483336),
(16, 0.8009426551453259),
(17, 0.7979351363483336),
(18, 0.7964425990349007),
(19, 0.7994276736617663)]
from sklearn.model_selection import GridSearchCV
设置需要网格搜索的参数
paras = {
"max_depth":np.arange(1,20),
"min_samples_leaf":np.arange(1,20),
"criterion":['gini','entropy']
}
实例化模型(不用fit数据)
DT = DecisionTreeClassifier()
定义网格搜索,并fit()数据
GS = GridSearchCV(DT,param_grid=paras,cv = 8).fit(xtrain,ytrain)
最优参数:
GS.best_params_
结果:{‘criterion’: ‘entropy’, ‘max_depth’: 9, ‘min_samples_leaf’: 9}
最优分数
GS.best_score_
结果:0.8441802925989673
最优评估器:
GS.best_estimator_
结果:DecisionTreeClassifier(criterion=‘entropy’, max_depth=14, min_samples_leaf=9)
由上可设置最优决策树分类器:
DT = DecisionTreeClassifier(criterion='entropy', max_depth=14, min_samples_leaf=9).fit(xtrain,ytrain)
DT.score(xtest,ytest) # 0.8071748878923767
DT.predict([[1,0,30,1,2,58,0]]) # array([1], dtype=int64)
可视化决策树:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(DT
,out_file = None
,feature_names= X.columns
,class_names=['死亡','存活']
,filled=True
,rounded=True
)
graph = graphviz.Source(dot_data)
graph
计算特征重要性:
[*zip(DT.feature_importances_,X.columns)]
[(0.17948345431191473, ‘Pclass’),
(0.4272386937323802, ‘Sex’),
(0.12579044257521602, ‘Age’),
(0.060547878544091265, ‘SibSp’),
(0.0, ‘Parch’),
(0.1842363885730208, ‘Fare’),
(0.0227031422633769, ‘Embarked’)]
预测可能性:
DT.predict_proba(xtest)
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth = 4).fit(xtrain,ytrain)
RF.score(xtest,ytest)
结果:0.8026905829596412
from sklearn.model_selection import cross_val_score
cross_val_score(RF,xtrain,ytrain,cv=10)
cross = []
score = []
train = []
for i in np.arange(1,20):
RF1 = RandomForestClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
c = cross_val_score(RF1,xtrain,ytrain,cv=5).mean()
cross.append(c)
score.append(RF1.score(xtest,ytest))
train.append(RF1.score(xtrain,ytrain))
plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))
RF = RandomForestClassifier(random_state = 20, max_depth = 5).fit(xtrain,ytrain)
cross_val_score(RF, xtrain,ytrain,cv=5).mean() # 0.8367523285826506
结果:0.8367523285826506
RF.score(xtest,ytest)
结果:0.8071748878923767
[*zip(np.arange(1,20),cross)]
[(1, 0.7858489507350466),
(2, 0.7888452474469756),
(3, 0.8053192683200538),
(4, 0.8158006957692739),
(5, 0.8367523285826506),
(6, 0.8232970485916283),
(7, 0.8188306587363933),
(8, 0.82933453035574),
(9, 0.8203456402199528),
(10, 0.8128492873975984),
(11, 0.8113791942542925),
(12, 0.802412748288632),
(13, 0.802412748288632),
(14, 0.7964089327797105),
(15, 0.7994276736617664),
(16, 0.7979239142632701),
(17, 0.7934238581528448),
(18, 0.7949163954662776),
(19, 0.7949163954662776)]
from sklearn.model_selection import GridSearchCV
paras = {
"max_depth":np.arange(1,20),
"min_samples_leaf":np.arange(1,20),
"criterion":['gini','entropy']
}
RF = RandomForestClassifier()
GS = GridSearchCV(RF,param_grid=paras).fit(xtrain,ytrain)
print(GS.best_params_) # {'criterion': 'entropy', 'max_depth': 8, 'min_samples_leaf': 3}
print(GS.best_score_) # 0.8382785321512737
print(GS.best_estimator_) # RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3)
根据最优结果重新设置随机森林分类器
RF = RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3).fit(xtrain,ytrain)
RF.score(xtest,ytest) # 0.8026905829596412
数据准备
from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston.data,columns=boston.feature_names)
y = boston.target
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
xtrain,xtest,ytrain,ytest =train_test_split(X,y,random_state = 20)
DTR = DecisionTreeRegressor(max_depth = 8,random_state = 20).fit(xtrain,ytrain)
DTR.score(xtest,ytest),mean_squared_error(ytest,DTR.predict(xtest)) # (0.603193016561408, 31.984825561881298)
# 随机森林
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor(random_state = 20).fit(xtrain,ytrain)
RFR.score(xtest,ytest),mean_squared_error(ytest,RFR.predict(xtest)) # (0.8051659689805782, 15.704694614173214)
from sklearn.linear_model import Ridge
LR = Ridge().fit(xtrain,ytrain)
LR.score(xtest,ytest),mean_squared_error(ytest,LR.predict(xtest)) # (0.7214294743488996, 22.45431668671955)
from sklearn.preprocessing import PolynomialFeatures
PF = PolynomialFeatures(degree=2).fit(xtrain)
xtrain_poly = pd.DataFrame(PF.transform(xtrain),columns=PF.get_feature_names(input_features=X.columns))
xtest_poly = pd.DataFrame(PF.transform(xtest),columns=PF.get_feature_names(input_features=X.columns))
LR2 = Ridge().fit(xtrain_poly,ytrain)
LR2.score(xtest_poly,ytest),mean_squared_error(ytest,LR2.predict(xtest_poly)) #(0.7348578552292532, 21.37191532292632)