【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。
本文用机器学习中的决策树分类模型对泰坦尼克号生存项目进行预测。
关于决策树的详细介绍及原理参见前一篇博文【阿旭机器学习实战】【12】决策树基本原理及其构造与使用方法
.
import pandas as pd
titanic = pd.read_csv("../data/titanic.txt")
titanic.head()
row.names | pclass | survived | name | age | embarked | home.dest | room | ticket | boat | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1st | 1 | Allen, Miss Elisabeth Walton | 29.0000 | Southampton | St Louis, MO | B-5 | 24160 L221 | 2 | female |
1 | 2 | 1st | 0 | Allison, Miss Helen Loraine | 2.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | NaN | female |
2 | 3 | 1st | 0 | Allison, Mr Hudson Joshua Creighton | 30.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | (135) | male |
3 | 4 | 1st | 0 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) | 25.0000 | Southampton | Montreal, PQ / Chesterville, ON | C26 | NaN | NaN | female |
4 | 5 | 1st | 1 | Allison, Master Hudson Trevor | 0.9167 | Southampton | Montreal, PQ / Chesterville, ON | C22 | NaN | 11 | male |
# 打印数据集表头
titanic.columns
Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked',
'home.dest', 'room', 'ticket', 'boat', 'sex'],
dtype='object')
数据字段的含义:
数据集中有12 个字段,每一个字段的名称和含义如下
PassengerId:乘客 ID
Survived:是否生存
Pclass:客舱等级
Name:乘客姓名
Sex:性别
Age:年龄
SibSp:在船兄弟姐妹数/配偶数
Parch:在船父母数/子女数
Ticket:船票编号
Fare:船票价格
Cabin:客舱号
Embarked:登船港口
选择属性:通过分析发现某些属性(如:name)和是否生还没有关系
# 我们选择"pclass","age","sex"这三个主要特征进行模型训练
x = titanic[["pclass","age","sex"]]
y = titanic[["survived"]]
x.isnull().any()
pclass False
age True
sex False
dtype: bool
# 查看缺失
x.info()
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass 1313 non-null object
age 633 non-null float64
sex 1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
# 分析发现年龄缺失了一半,如果全都丢弃,数据损失过多
# 丢弃不行需要填补,用所有年龄的平均值来填补
x["age"].fillna(x["age"].mean(),inplace=True)
D:\anaconda3\lib\site-packages\pandas\core\generic.py:5430: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)
x_train[:10]
pclass | age | sex | |
---|---|---|---|
1220 | 3rd | 31.194181 | female |
174 | 1st | 46.000000 | male |
144 | 1st | 35.000000 | female |
151 | 1st | 46.000000 | male |
391 | 2nd | 18.000000 | male |
563 | 2nd | 25.000000 | male |
1260 | 3rd | 31.194181 | male |
428 | 2nd | 6.000000 | female |
580 | 2nd | 36.000000 | female |
344 | 2nd | 26.000000 | male |
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)#sparse=False意思是不产生稀疏矩阵
# 非数字类型的特征向量化
x_train = vec.fit_transform(x_train.to_dict(orient="record"))
x_train[:5]
array([[31.19418104, 0. , 0. , 1. , 1. ,
0. ],
[46. , 1. , 0. , 0. , 0. ,
1. ],
[35. , 1. , 0. , 0. , 1. ,
0. ],
[46. , 1. , 0. , 0. , 0. ,
1. ],
[18. , 0. , 1. , 0. , 0. ,
1. ]])
x_train.shape
(984, 6)
x_test = vec.fit_transform(x_test.to_dict(orient="record"))
x_test.shape
(329, 6)
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
y_pre = dt.predict(x_test)
y_pre[:10],y_test[:10]
(array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0], dtype=int64), survived
908 0
822 0
657 1
856 0
212 1
641 1
305 0
778 1
818 1
1179 0)
dt.score(x_test,y_test)
# score也成为准确性,只能从宏观上查看到一个模型的准确程度
0.7872340425531915
from sklearn.metrics import classification_report
print(classification_report(y_pre,y_test,target_names=["died","servived"]))
precision recall f1-score support
died 0.92 0.78 0.84 244
servived 0.56 0.81 0.66 85
avg / total 0.83 0.79 0.80 329
比如两个类别A和B,预测的情况会有四种:True A、True B、False A、False B
1、准确率(score):模型预测的正确的概率:score = (True A+True B)/(True A + True B + False A +False B)
2、精确率:表示的是每一个类别预测准确的数量占所有预测为该类别的数量的比例:precision_a = True A / (True A + False A)
3、召回率:表示的每一个类别预测正确的数量占这里类别真正数量的比例:recall_a = True A / (True A + False B)
4、F1指标:F1_a = 2/(1/precision_a + 1/recall_a) = 2*(precision_a*recall_a)/(precision_a+recall_a) 调和平均数,F1指标指的就是精确率和召回率的调和平均数,除了把精确率和召回率平均,还可以给两个指标相近的模型以较高的评分;
【注意】如果精确率和召回率差距太大,模型就不具备参考价值
如果内容对你有帮助,感谢点赞+关注哦!
欢迎关注我的公众号:
阿旭算法与机器学习
,共同学习交流。
更多干货内容持续更新中…