该例程通过分析titanic数据中个各个特征来判断最后是否获救。例程背景就不多说了,网上到处都有。
该例程是到目前kaggle上讨论最多,关注度最高,最活跃的一个,其中有对数据处理的常见的方法如:缺失值处理、类别值处理,离散值处理。也包括了对应该问题的最常见的适用算法。
该例程的数据集在这里:https://download.csdn.net/download/rh8866/12507192
代码中有较详细的注释说明,直接上代码:
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#各种算法
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
#读入数据
train_df = pd.read_csv("./dataset/titanic/train.csv")
test_df = pd.read_csv("./dataset/titanic/test.csv")
#看看我们的数据集都有哪些特征
print(train_df.columns.values)
#PassengerId:乘客的ID
#Survived:乘客是否获救,Key:0=没获救,1=已获救
#Pclass:乘客船舱等级(1/2/3三个等级舱位)
#Name:乘客姓名
#Sex:性别
#Age:年龄
#SibSp:乘客在船上的兄弟姐妹/配偶数量
#Parch:乘客在船上的父母/孩子数量
#Ticket:船票号
#Fare:船票价
#Cabin:客舱号码
#Embarked:登船的港口
#大概看看数据样式
train_df.head(8)
#由下图可知数据集特征有数值属性,有类别属性,还有缺失值
#我们详细看一下数据分布
train_df.info()
print('*'*50)
test_df.info()
#可以看到训练数据和测试数据集中都有缺失值
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
**************************************************
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
train_df.describe()
#可以看到训练数据总共891个样本,幸存(Survived)是一个0或者1的分类特征
#年龄跨度很大,小到4月的婴儿,大到80的老人。
#票价差距也很大,min(0.000000),看上好像是不要钱的,难道是船长亲戚
#查看object属性数据统计情况
train_df.describe(include=['O'])
#Name 有891个看样子没有重名重性的
#Sex 肯定只有男女了(估计那时候没有人妖吧)
#Cabin 船舱号有204个列别有147个,看样子是有多人同住一个房间
#Embarked 乘船码头(始发站)有三个
#接下来就开始我们的特征工程了
#选取数据中两列,以Pclass分组,计算每个分组内平均值,最后根据Survived平均值降序排列。其中as_index=False不以Pclass做结果行键。
train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean().sort_values(by='Survived',ascending=False)
#如下可以看到船舱等级越高幸存率越高(看来任何时候有钱人都是掌握着最多的资源,当然也是人家能力的体现,不能仇富)
train_df[['Sex','Survived']].groupby(['Sex'],as_index=False).mean().sort_values(by='Survived',ascending=False)
#性别和获救率这个关系看着还是很让人欣慰的,欧洲人的Lady first还是执行的很到为的,(命可以丢,但是绅士分度可不没了,点赞!)
train_df[['SibSp','Survived']].groupby(['SibSp'],as_index=False).mean().sort_values(by='Survived',ascending=False)
#有没有兄妹,配偶这个属性和幸存之间看不出什么直接的关联
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#有没有父母自律这个属性和幸存之间看不出什么直接的关联,但是按照常人的理解父母应该会尽力救儿女,当然也有可能会被拖累
#接下来我们通过图示来看看各属性和获救的关系
g = sns.FacetGrid(train_df,col='Survived')
g.map(plt.hist,'Age',bins=20)
#再看看不同级别舱获救与否的年龄分布
grid = sns.FacetGrid(train_df,col='Survived',row='Pclass',size=2.2,aspect=1.8)
grid.map(plt.hist,'Age',bins=20)
#可以看到一等舱获救的人年龄分布跨度较大,各个年龄段获救人数较多
#二等舱获救人年龄就比较年轻化了,
#三等舱很明显获救人年龄整体更加小,
#可以概括为小孩和女人优先,然后有钱人优先
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
#下图看出 Embarked = C 的男性获救高于女性,不知道什么情况(难道这个港口附近民间风俗是重男轻女?,也有可能这个港口上船的小男孩多)
grid = sns.FacetGrid(train_df,row='Embarked',col='Survived',size=2.2,aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare')
grid.add_legend()
#可以看到不同港口登船存活率也是不一样的,而且票价越高存活率越高
#我们删除一些我们认为和获救率关系不大或者没关系的属性如'Ticket', 'Cabin'
#'Ticket', 'Cabin'能否获救貌似和船票号码关系不大,和船舱号码关系也不大
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df,test_df]
#我们注意到数据集中还有一个Name属性,我们不知道这个属性和Survived有什么关系(很多人肯定会想,名字和Survived应该没关系)
#但是我们要知道,名称中也会有一些信息的(比如叫二狗子,和叫王润泽,这两个名字显然二狗子就是随便起的名字,可能家庭情况也不好。
#再比如印度,姓氏是区分高低贵贱的最明显的一个特征)
#所以我们有理由考察一下这里的name属性
#将name属性截取为Title
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_df['Title'], train_df['Sex'])
#我们将数量比较少的Title归类为一类,将某些看不懂的划入类似的类别中,
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
#我们可以把Title转换成序号。
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
train_df.head()
#现在我们可以放心地从训练和测试数据集中删除name特性。
#我们也不需要在训练数据集中使用PassengerId特性。
train_df = train_df.drop(['Name','PassengerId'],axis=1)
test_df = test_df.drop(['Name','PassengerId'],axis=1)
combine = [train_df,test_df]
#看看两个数据集的形状
train_df.shape,test_df.shape
#((891, 9), (418, 8))
#接下来我们把分类特征转为为数值,因为很多算法中属性是可以是数值型
#首先我们转换性别特征
#Sex 特征没有缺失值,我们直接进行转换
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map({'female':1,'male':0}).astype(int)
train_df.head()
#有上面表格看到我们还有Embarked需要数值化,
#但是由上面train_df.info()可知 Embarked 889 non-null,有缺失值,
#我们首先需要处理缺失值
#处理年龄缺失值
#处理方法是取 Sex Pclass 两个条件来推断缺失值,即缺失值是该样本在Sex Pclass这两个条件相同情况下的年龄的平均值
guess_ages = np.zeros((2,3))
guess_ages
for dataset in combine:
for i in range(0, 2):
for j in range(0, 3):
guess_df = dataset[(dataset['Sex'] == i) & \
(dataset['Pclass'] == j+1)]['Age'].dropna()
#该条件下年龄的均值
age_guess = guess_df.median()
# Convert random age float to nearest .5 age
#四舍五入
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
for i in range(0, 2):
for j in range(0, 3):
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
'Age'] = guess_ages[i,j]
dataset['Age'] = dataset['Age'].astype(int)
train_df.head()
#由于年龄数据是离散值,为了计算速率我们将其按照区间归类
#先看看将其分为5等分区间的存活率情况
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
#我们以16为一个区间分割年龄属性
for dataset in combine:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()
#删除AgeBand
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()
#我们数据集中有SibSp Parch 这两个属性是比较接近的属性我们将其归为统一属性FamilySize
for dataset in combine:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#我们再创将一个是否独自一人的属性(一个人可能活下去的动力和信心会比较底)
for dataset in combine:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
#我们删除已经被替换的属性
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]
train_df.head()
#我们还有Embarked中的两个缺失值没有处理,因为只有两个缺失值,我们简单的将其设置为最常见的值
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
#'S'
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#再将Embarked 映射为数值型
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_df.head()
#测试集中 Fare 有一个缺失值我们将其设为均值
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()
#我们再将FareBand 分成不同的几个区间,再映射为数值型
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
#射为数值型
for dataset in combine:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
train_df.head(10)
test_df.head(10)
#我们最后再详细看一下数据分布
#两个数据集没有缺失值,而且所有属性都是数值型
train_df.info()
print('*'*50)
test_df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 Age 891 non-null int64
4 Fare 891 non-null int64
5 Embarked 891 non-null int64
6 Title 891 non-null int64
7 IsAlone 891 non-null int64
dtypes: int64(8)
memory usage: 55.8 KB
**************************************************
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Sex 418 non-null int64
2 Age 418 non-null int64
3 Fare 418 non-null int64
4 Embarked 418 non-null int64
5 Title 418 non-null int64
6 IsAlone 418 non-null int64
dtypes: int64(7)
memory usage: 23.0 KB
#以上我们做了这么多准备工作就是为了接下来训练和预测
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df
X_train.shape, Y_train.shape, X_test.shape
#((891, 7), (891,), (418, 7))
#逻辑回归
logreg = LogisticRegression()
logreg.fit(X_train,Y_train)
Y_pred = logreg.predict(X_test)
#round() 返回浮点数x的四舍五入值。
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
#78.56
#SVM
svc = SVC()
svc.fit(X_train,Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
#78.34
#KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train,Y_train) * 100,2)
acc_knn
#84.62
#高斯朴素贝叶斯
gaussian = GaussianNB()
gaussian.fit(X_train,Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train,Y_train) * 100,2)
acc_gaussian
#76.99
#感知器
perceptron = Perceptron()
perceptron.fit(X_train,Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train,Y_train) * 100,2)
acc_perceptron
#76.77
#LinearSVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
#78.45
#随机梯度下降
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
#69.36
#决策树
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train,Y_train)*100,2)
acc_decision_tree
#86.76
#随机森林
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
#86.76
#将各个算法和得分放在一起排序
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
#看到决策树和随机森林得分最高,但我们选择使用随机森林,因为它们纠正了决策树过拟合问题
submission = pd.DataFrame({
"Survived": Y_pred
})
#保存预测结果
submission.to_csv('./submission.csv', index=False)
至此该例程就全部完成了,可以看到在特征工程部分我们做了大量的工作。这在工作中可能就是常态,真正的算法研究和参数设置只会是其中一小部分而以。
如有问题欢迎流言交流。