【阿旭机器学习实战】【13】决策树分类模型实战:泰坦尼克号生存预测

【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

本文用机器学习中的决策树分类模型对泰坦尼克号生存项目进行预测。

关于决策树的详细介绍及原理参见前一篇博文【阿旭机器学习实战】【12】决策树基本原理及其构造与使用方法.

目录

  • 决策树分类模型实战:泰坦尼克号生存预测
    • 导入数据集并查看基本信息
    • 选择特征并进行特征处理
      • 补全缺失值
    • 特征处理:对特征进行向量化
    • 创建决策树模型,训练预测
    • 性能评测报告
    • 性能评测报告的相关指标:

决策树分类模型实战:泰坦尼克号生存预测

导入数据集并查看基本信息

import pandas as pd
titanic = pd.read_csv("../data/titanic.txt")
titanic.head()
row.names pclass survived name age embarked home.dest room ticket boat sex
0 1 1st 1 Allen, Miss Elisabeth Walton 29.0000 Southampton St Louis, MO B-5 24160 L221 2 female
1 2 1st 0 Allison, Miss Helen Loraine 2.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 3 1st 0 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN (135) male
3 4 1st 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female
4 5 1st 1 Allison, Master Hudson Trevor 0.9167 Southampton Montreal, PQ / Chesterville, ON C22 NaN 11 male
# 打印数据集表头
titanic.columns
Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked',
       'home.dest', 'room', 'ticket', 'boat', 'sex'],
      dtype='object')

数据字段的含义:

数据集中有12 个字段,每一个字段的名称和含义如下
PassengerId:乘客 ID
Survived:是否生存
Pclass:客舱等级
Name:乘客姓名
Sex:性别
Age:年龄
SibSp:在船兄弟姐妹数/配偶数
Parch:在船父母数/子女数
Ticket:船票编号
Fare:船票价格
Cabin:客舱号
Embarked:登船港口

选择属性:通过分析发现某些属性(如:name)和是否生还没有关系

选择特征并进行特征处理

# 我们选择"pclass","age","sex"这三个主要特征进行模型训练
x = titanic[["pclass","age","sex"]]
y = titanic[["survived"]]

补全缺失值

x.isnull().any()
pclass    False
age        True
sex       False
dtype: bool
# 查看缺失
x.info()

RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       633 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
# 分析发现年龄缺失了一半,如果全都丢弃,数据损失过多
# 丢弃不行需要填补,用所有年龄的平均值来填补
x["age"].fillna(x["age"].mean(),inplace=True)

D:\anaconda3\lib\site-packages\pandas\core\generic.py:5430: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)
x_train[:10]
pclass age sex
1220 3rd 31.194181 female
174 1st 46.000000 male
144 1st 35.000000 female
151 1st 46.000000 male
391 2nd 18.000000 male
563 2nd 25.000000 male
1260 3rd 31.194181 male
428 2nd 6.000000 female
580 2nd 36.000000 female
344 2nd 26.000000 male

特征处理:对特征进行向量化

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)#sparse=False意思是不产生稀疏矩阵
# 非数字类型的特征向量化
x_train = vec.fit_transform(x_train.to_dict(orient="record"))
x_train[:5]
array([[31.19418104,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        ],
       [46.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [35.        ,  1.        ,  0.        ,  0.        ,  1.        ,
         0.        ],
       [46.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [18.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ]])
x_train.shape
(984, 6)
x_test = vec.fit_transform(x_test.to_dict(orient="record"))
x_test.shape
(329, 6)

创建决策树模型,训练预测

dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
y_pre = dt.predict(x_test)
y_pre[:10],y_test[:10]
(array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0], dtype=int64),       survived
 908          0
 822          0
 657          1
 856          0
 212          1
 641          1
 305          0
 778          1
 818          1
 1179         0)
dt.score(x_test,y_test)
# score也成为准确性,只能从宏观上查看到一个模型的准确程度
0.7872340425531915

性能评测报告

from sklearn.metrics import classification_report
print(classification_report(y_pre,y_test,target_names=["died","servived"]))
             precision    recall  f1-score   support

       died       0.92      0.78      0.84       244
   servived       0.56      0.81      0.66        85

avg / total       0.83      0.79      0.80       329

性能评测报告的相关指标:

比如两个类别A和B,预测的情况会有四种:True A、True B、False A、False B
1、准确率(score):模型预测的正确的概率:score = (True A+True B)/(True A + True B + False A +False B)
2、精确率:表示的是每一个类别预测准确的数量占所有预测为该类别的数量的比例:precision_a = True A / (True A + False A)
3、召回率:表示的每一个类别预测正确的数量占这里类别真正数量的比例:recall_a = True A / (True A + False B)
4、F1指标:F1_a = 2/(1/precision_a + 1/recall_a) = 2*(precision_a*recall_a)/(precision_a+recall_a) 调和平均数,F1指标指的就是精确率和召回率的调和平均数,除了把精确率和召回率平均,还可以给两个指标相近的模型以较高的评分;
【注意】如果精确率和召回率差距太大,模型就不具备参考价值

如果内容对你有帮助,感谢点赞+关注哦!

欢迎关注我的公众号:阿旭算法与机器学习,共同学习交流。
更多干货内容持续更新中…

你可能感兴趣的:(机器学习实战,python,决策树)