泰坦尼克号预测结果分析报告

目录

  1. 提出问题(Business Understanding)
  2. 理解数据(Data Understanding)
    • 采集数据
    • 导入数据
    • 查看数据集信息
  3. 数据清洗(Data Preparation)
    • 数据预处理
    • 特征工程(Feature Engineering)
  4. 构建模型(Modeling)
  5. 模型预估(Evaluation)
  6. 方案实施(Deployment)
    • 将结果提交到kaggle
    • 报告撰写

1. 提出问题

什么样的人容易在泰坦尼克号存活?

2. 理解数据

2.1 采集数据

点击此链接进入kaggle的titanic项目下载数据集

2.2 导入数据

用pd.read_csv()函数读取数据集中的数据;然后将训练数据集和测试数据集合并成一个数据集来进行清洗
# 忽略警告提示
import warnings
warnings.filterwarnings('ignore')

#导入处理数据包
import numpy as np
import pandas as pd
train=pd.read_csv('E:\\titanic\\train.csv')
test=pd.read_csv('E:\\titanic\\test.csv')
print('训练数据集:',train.shape,'测试数据集:',test.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
#合并数据集,方便同时对两个数据集进行清洗
full = train.append( test , ignore_index = True )

print ('合并后的数据集:',full.shape)
合并后的数据集: (1309, 12)

2.3 查看数据集信息

#查看数据
full.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

数据集中的字段都是英文,为了方便了解字段含义,查询了官网的项目介绍,总结如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QKBADEtQ-1654679972210)(https://blog.csdn.net/qq_26675765/article/details/125180282?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22125180282%22%2C%22source%22%3A%22qq_26675765%22%7D&ctrtid=6h2E2)]

'''
describe只能查看数据类型的描述统计信息,对于其他类型的数据不显示,比如字符串类型姓名(name),客舱号(Cabin)
这很好理解,因为描述统计指标是计算数值,所以需要该列的数据类型是数据
'''
#获取数据类型列的描述统计信息
full.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 1309.000000 891.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 655.000000 0.383838 2.294882 29.881138 0.498854 0.385027 33.295479
std 378.020061 0.486592 0.837836 14.413493 1.041658 0.865560 51.758668
min 1.000000 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 328.000000 0.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 655.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 982.000000 1.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1309.000000 1.000000 3.000000 80.000000 8.000000 9.000000 512.329200
# 查看每一列的数据类型,和数据总数
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 97.2+ KB

根据上面打印的结果,我们发现数据总共有1309行。

其中数据类型列:年龄(Age)、船票价格(Fare)里面有缺失数据:

年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%
船票价格(Fare)里面数据总数是1308条,缺失了1条数据

字符串列:

登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少
船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大

接下来进行数据清洗,针对以上指标处理缺失数据。

3. 数据清洗

3.1 数据预处理

3.1.1 缺失值处理

很多机器学习算法为了训练模型,要求传入的特征中不能由空值;所以要对缺失值进行处理,针对数据类型的列(年龄(Age)、船票价格(Fare)),最简单的方法用平均值代替缺失值

print('处理前数据:')
full.info()
full['Age']=full['Age'].fillna(full['Age'].mean())
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
print('处理后数据:')
full.info()
处理前数据:

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 97.2+ KB
处理后数据:

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 97.2+ KB
#检查数据处理是否正常
full.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

针对字符串数列,登船港口(Embarked)和船舱号(Cabin),缺失值处理方法:分别查看两个列数据都是什么,针对登船港口(Embarked)只缺失两个,用最多的那个数据填充;船舱号(Cabin)缺失较多,用U填充(Uknow)

#登船港口(Embarked):查看里面数据长啥样
'''
出发地点:S=英国南安普顿Southampton
途径地点1:C=法国 瑟堡市Cherbourg
途径地点2:Q=爱尔兰 昆士敦Queenstown
'''
full['Embarked'].head()
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object
full['Embarked'].value_counts()
S    914
C    270
Q    123
Name: Embarked, dtype: int64
'''
# 只有两个缺失值,我们将缺失值填充为最频繁出现的值:
S=英国南安普顿Southampton
'''
full['Embarked'] = full['Embarked'].fillna( 'S' )
#船舱号(Cabin):查看里面数据长啥样
full['Cabin'].head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
#缺失数据比较多,船舱号(Cabin)缺失值填充为U,表示未知(Uknow) 
full['Cabin'] = full['Cabin'].fillna( 'U' )
#检查数据处理是否正常
full.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 U S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 U S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 U S
#查看最终缺失值处理情况,记住生成情况(Survived)这里一列是我们的标签,用来做机器学习预测的,不需要处理这一列
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 97.2+ KB

3.2 特征提取

对不同数据类型的特征提取方法:

①数值类型数据:直接使用
②时间序列:转成单独的年、月、日
③分类数据:one-hot编码用数值代替类别
'''
1.数值类型:
乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
2.时间序列:无
3.分类数据:
1)有直接类别的
乘客性别(Sex):男性male,女性female
登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
2)字符串类型:可能从这里面提取出特征来,也归到分类数据中
乘客姓名(Name)
客舱号(Cabin)
船票编号(Ticket)
'''
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 97.2+ KB

3.2.1 分类数据:有直接类别的

①乘客性别(Sex):男性male,女性female
②登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
③客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱

3.2.1.1 性别

#查看性别数据这一列
full['Sex'].head()
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
'''
将性别的值映射为数值
男(male)对应数值1,女(female)对应数值0
'''
sex_mapDict={'male':1,
            'female':0}
#map函数:对Series每个数据应用自定义的函数计算
full['Sex']=full['Sex'].map(sex_mapDict)
full.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U S

3.2.1.2 登船港口

#查看该类数据内容
full['Embarked'].head()
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object
#存放提取后的特征
embarkedDf = pd.DataFrame()

'''
使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
'''
embarkedDf = pd.get_dummies( full['Embarked'] , prefix='Embarked' )
embarkedDf.head()
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full,embarkedDf],axis=1)

'''
因为已经使用登船港口(Embarked)进行了one-hot编码产生了它的虚拟变量(dummy variables)
所以这里把登船港口(Embarked)删掉
'''
full.drop('Embarked',axis=1,inplace=True)
full.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U 0 0 1
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1 0 0
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 0 1
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U 0 0 1

3.2.1.3 客舱等级

'''
客舱等级(Pclass):
1=1等舱,2=2等舱,3=3等舱
'''
#存放提取后的特征
pclassDf = pd.DataFrame()

#使用get_dummies进行one-hot编码,列名前缀是Pclass
pclassDf = pd.get_dummies( full['Pclass'] , prefix='Pclass' )
pclassDf.head()
Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full,pclassDf],axis=1)

#删掉客舱等级(Pclass)这一列
full.drop('Pclass',axis=1,inplace=True)
full.head()
PassengerId Survived Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3
0 1 0.0 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U 0 0 1 0 0 1
1 2 1.0 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1 0 0 1 0 0
2 3 1.0 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 0 1 0 0 1
3 4 1.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 0 0 1 1 0 0
4 5 0.0 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U 0 0 1 0 0 1

3.2.2 分类数据:字符串数据

字符串类型:可能从这里面提取出特征来,也归到分类数据中,这里数据有:

①乘客姓名(Name)
②客舱号(Cabin)
③船票编号(Ticket)

3.2.2.1 从姓名提取头衔

'''
查看姓名这一列长啥样
注意到在乘客名字(Name)中,有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。
例如:
Braund, Mr. Owen Harris
Heikkinen, Miss. Laina
Oliva y Ocana, Dona. Fermina
Peter, Master. Michael J
'''
full[ 'Name' ].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
'''
定义函数:从姓名中获取头衔
'''
def getTitle(name):
    str1=name.split( ',' )[1] #Mr. Owen Harris
    str2=str1.split( '.' )[0]#Mr
    #strip() 方法用于移除字符串头尾指定的字符(默认为空格)
    str3=str2.strip()
    return str3
#存放提取后的特征
titleDf = pd.DataFrame()
#map函数:对Series每个数据应用自定义的函数计算
titleDf['Title'] = full['Name'].map(getTitle)
titleDf.head()
Title
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
'''
#姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }

#map函数:对Series每个数据应用自定义的函数计算
titleDf['Title'] = titleDf['Title'].map(title_mapDict)

#使用get_dummies进行one-hot编码
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
Master Miss Mr Mrs Officer Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full,titleDf],axis=1)

#删掉姓名这一列
full.drop('Name',axis=1,inplace=True)
full.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Cabin Embarked_C ... Embarked_S Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 C85 1 ... 0 1 0 0 0 0 0 1 0 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 ... 1 0 0 1 0 1 0 0 0 0
3 4 1.0 0 35.0 1 0 113803 53.1000 C123 0 ... 1 1 0 0 0 0 0 1 0 0
4 5 0.0 1 35.0 0 0 373450 8.0500 U 0 ... 1 0 0 1 0 0 1 0 0 0

5 rows × 21 columns

3.2.2.2 从客舱号中提取客舱类别

'''
客舱号的首字母是客舱的类别
'''
#查看客舱号的内容
full['Cabin'].head()
0       U
1     C85
2       U
3    C123
4       U
Name: Cabin, dtype: object
#存放客舱号信息
cabinDf = pd.DataFrame()

'''
客场号的类别值是首字母,例如:
C85 类别映射为首字母C
'''
full[ 'Cabin' ] = full[ 'Cabin' ].map( lambda c : c[0] )

##使用get_dummies进行one-hot编码,列名前缀是Cabin
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )

cabinDf.head()
Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1
#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full,cabinDf],axis=1)

#删掉客舱号这一列
full.drop('Cabin',axis=1,inplace=True)
full.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Embarked_C Embarked_Q ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 1 0 ... 0 0 0 1 0 0 0 0 0 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 4 1.0 0 35.0 1 0 113803 53.1000 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 5 0.0 1 35.0 0 0 373450 8.0500 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

3.2.3 建立家庭人数和家庭类别¶

#存放家庭信息
familyDf = pd.DataFrame()

'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
(因为乘客自己也是家庭成员的一个,所以这里加1)
'''
familyDf[ 'FamilySize' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

'''
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''
#if 条件为真的时候返回if前面内容,否则返回0
familyDf[ 'Family_Single' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if s == 1 else 0 )
familyDf[ 'Family_Small' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 2 <= s <= 4 else 0 )
familyDf[ 'Family_Large' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 5 <= s else 0 )

familyDf.head()
FamilySize Family_Single Family_Small Family_Large
0 2 0 1 0
1 2 0 1 0
2 1 1 0 0
3 2 0 1 0
4 1 1 0 0
#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full,familyDf],axis=1)
full.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Embarked_C Embarked_Q ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Single Family_Small Family_Large
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 0 0 ... 0 0 0 0 0 1 2 0 1 0
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 1 0 ... 0 0 0 0 0 0 2 0 1 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 0 0 ... 0 0 0 0 0 1 1 1 0 0
3 4 1.0 0 35.0 1 0 113803 53.1000 0 0 ... 0 0 0 0 0 0 2 0 1 0
4 5 0.0 1 35.0 0 0 373450 8.0500 0 0 ... 0 0 0 0 0 1 1 1 0 0

5 rows × 33 columns

#到现在我们已经有了这么多个特征了
full.shape
(1309, 33)

3.3 特征选择

3.3.1 相关系数法:计算相关系数的相关关系

#相关性矩阵
corrDf = full.corr() 
corrDf
PassengerId Survived Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Single Family_Small Family_Large
PassengerId 1.000000 -0.005007 0.013406 0.025731 -0.055224 0.008942 0.031416 0.048101 0.011585 -0.049836 ... 0.000549 -0.008136 0.000306 -0.045949 -0.023049 0.000208 -0.031437 0.028546 0.002975 -0.063415
Survived -0.005007 1.000000 -0.543351 -0.070323 -0.035322 0.081629 0.257307 0.168240 0.003650 -0.149683 ... 0.150716 0.145321 0.057935 0.016040 -0.026456 -0.316912 0.016639 -0.203367 0.279855 -0.125147
Sex 0.013406 -0.543351 1.000000 0.057397 -0.109609 -0.213125 -0.185484 -0.066564 -0.088651 0.115193 ... -0.057396 -0.040340 -0.006655 -0.083285 0.020558 0.137396 -0.188583 0.284537 -0.255196 -0.077748
Age 0.025731 -0.070323 0.057397 1.000000 -0.190747 -0.130872 0.171521 0.076179 -0.012718 -0.059153 ... 0.132886 0.106600 -0.072644 -0.085977 0.032461 -0.271918 -0.196996 0.116675 -0.038189 -0.161210
SibSp -0.055224 -0.035322 -0.109609 -0.190747 1.000000 0.373587 0.160224 -0.048396 -0.048678 0.073709 ... -0.015727 -0.027180 -0.008619 0.006015 -0.013247 0.009064 0.861952 -0.591077 0.253590 0.699681
Parch 0.008942 0.081629 -0.213125 -0.130872 0.373587 1.000000 0.221522 -0.008635 -0.100943 0.071881 ... -0.027385 0.001084 0.020481 0.058325 -0.012304 -0.036806 0.792296 -0.549022 0.248532 0.624627
Fare 0.031416 0.257307 -0.185484 0.171521 0.160224 0.221522 1.000000 0.286241 -0.130054 -0.169894 ... 0.072737 0.073949 -0.037567 -0.022857 0.001179 -0.507197 0.226465 -0.274826 0.197281 0.170853
Embarked_C 0.048101 0.168240 -0.066564 0.076179 -0.048396 -0.008635 0.286241 1.000000 -0.164166 -0.778262 ... 0.107782 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.036553 -0.107874 0.159594 -0.092825
Embarked_Q 0.011585 0.003650 -0.088651 -0.012718 -0.048678 -0.100943 -0.130054 -0.164166 1.000000 -0.491656 ... -0.061459 -0.042877 -0.020282 -0.019941 -0.008904 0.142369 -0.087190 0.127214 -0.122491 -0.018423
Embarked_S -0.049836 -0.149683 0.115193 -0.059153 0.073709 0.071881 -0.169894 -0.778262 -0.491656 1.000000 ... -0.056023 0.002960 0.030575 0.040560 0.018111 0.137351 0.087771 0.014246 -0.062909 0.093671
Pclass_1 0.026495 0.285904 -0.107371 0.362587 -0.034256 -0.013033 0.599956 0.325722 -0.166101 -0.181800 ... 0.275698 0.242963 -0.073083 -0.035441 0.048310 -0.776987 -0.029656 -0.126551 0.165965 -0.067523
Pclass_2 0.022714 0.093349 -0.028862 -0.014193 -0.052419 -0.010057 -0.121372 -0.134675 -0.121973 0.196532 ... -0.037929 -0.050210 0.127371 -0.032081 -0.014325 0.176485 -0.039976 -0.035075 0.097270 -0.118495
Pclass_3 -0.041544 -0.322308 0.116562 -0.302093 0.072610 0.019521 -0.419616 -0.171430 0.243706 -0.003805 ... -0.207455 -0.169063 -0.041178 0.056964 -0.030057 0.527614 0.058430 0.138250 -0.223338 0.155560
Master 0.002254 0.085221 0.164375 -0.363923 0.329171 0.253482 0.011596 -0.014172 -0.009091 0.018297 ... -0.042192 0.001860 0.058311 -0.013690 -0.006113 0.041178 0.355061 -0.265355 0.120166 0.301809
Miss -0.050027 0.332795 -0.672819 -0.254146 0.077564 0.066473 0.092051 -0.014351 0.198804 -0.113886 ... -0.012516 0.008700 -0.003088 0.061881 -0.013832 -0.004364 0.087350 -0.023890 -0.018085 0.083422
Mr 0.014116 -0.549199 0.870678 0.165476 -0.243104 -0.304780 -0.192192 -0.065538 -0.080224 0.108924 ... -0.030261 -0.032953 -0.026403 -0.072514 0.023611 0.131807 -0.326487 0.386262 -0.300872 -0.194207
Mrs 0.033299 0.344935 -0.571176 0.198091 0.061643 0.213491 0.139235 0.098379 -0.100374 -0.022950 ... 0.080393 0.045538 0.013376 0.042547 -0.011742 -0.162253 0.157233 -0.354649 0.361247 0.012893
Officer 0.002231 -0.031316 0.087288 0.162818 -0.013813 -0.032631 0.028696 0.003678 -0.003212 -0.001202 ... 0.006055 -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 -0.026921 0.013303 0.003966 -0.034572
Royalty 0.004400 0.033391 -0.020408 0.059466 -0.010787 -0.030197 0.026214 0.077213 -0.021853 -0.054250 ... -0.012950 -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 -0.023600 0.008761 -0.000073 -0.017542
Cabin_A -0.002831 0.022287 0.047561 0.125177 -0.039808 -0.030707 0.020094 0.094914 -0.042105 -0.056984 ... -0.024952 -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 -0.042967 0.045227 -0.029546 -0.033799
Cabin_B 0.015895 0.175095 -0.094453 0.113458 -0.011569 0.073051 0.393743 0.161595 -0.073613 -0.095790 ... -0.043624 -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 0.032318 -0.087912 0.084268 0.013470
Cabin_C 0.006092 0.114652 -0.077473 0.167993 0.048616 0.009601 0.401370 0.158043 -0.059151 -0.101861 ... -0.053083 -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 0.037226 -0.137498 0.141925 0.001362
Cabin_D 0.000549 0.150716 -0.057396 0.132886 -0.015727 -0.027385 0.072737 0.107782 -0.061459 -0.056023 ... 1.000000 -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.025313 -0.074310 0.102432 -0.049336
Cabin_E -0.008136 0.145321 -0.040340 0.106600 -0.027180 0.001084 0.073949 0.027566 -0.042877 0.002960 ... -0.034317 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.017285 -0.042535 0.068007 -0.046485
Cabin_F 0.000306 0.057935 -0.006655 -0.072644 -0.008619 0.020481 -0.037567 -0.020010 -0.020282 0.030575 ... -0.024369 -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.005525 0.004055 0.012756 -0.033009
Cabin_G -0.045949 0.016040 -0.083285 -0.085977 0.006015 0.058325 -0.022857 -0.031566 -0.019941 0.040560 ... -0.011817 -0.011135 -0.007907 1.000000 -0.001712 -0.114803 0.035835 -0.076397 0.087471 -0.016008
Cabin_T -0.023049 -0.026456 0.020558 0.032461 -0.013247 -0.012304 0.001179 -0.014095 -0.008904 0.018111 ... -0.005277 -0.004972 -0.003531 -0.001712 1.000000 -0.051263 -0.015438 0.022411 -0.019574 -0.007148
Cabin_U 0.000208 -0.316912 0.137396 -0.271918 0.009064 -0.036806 -0.507197 -0.258257 0.142369 0.137351 ... -0.353822 -0.333381 -0.236733 -0.114803 -0.051263 1.000000 -0.014155 0.175812 -0.211367 0.056438
FamilySize -0.031437 0.016639 -0.188583 -0.196996 0.861952 0.792296 0.226465 -0.036553 -0.087190 0.087771 ... -0.025313 -0.017285 0.005525 0.035835 -0.015438 -0.014155 1.000000 -0.688864 0.302640 0.801623
Family_Single 0.028546 -0.203367 0.284537 0.116675 -0.591077 -0.549022 -0.274826 -0.107874 0.127214 0.014246 ... -0.074310 -0.042535 0.004055 -0.076397 0.022411 0.175812 -0.688864 1.000000 -0.873398 -0.318944
Family_Small 0.002975 0.279855 -0.255196 -0.038189 0.253590 0.248532 0.197281 0.159594 -0.122491 -0.062909 ... 0.102432 0.068007 0.012756 0.087471 -0.019574 -0.211367 0.302640 -0.873398 1.000000 -0.183007
Family_Large -0.063415 -0.125147 -0.077748 -0.161210 0.699681 0.624627 0.170853 -0.092825 -0.018423 0.093671 ... -0.049336 -0.046485 -0.033009 -0.016008 -0.007148 0.056438 0.801623 -0.318944 -0.183007 1.000000

32 rows × 32 columns

'''
查看各个特征与生成情况(Survived)的相关系数,
ascending=False表示按降序排列
'''
corrDf['Survived'].sort_values(ascending =False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
Pclass_1         0.285904
Family_Small     0.279855
Fare             0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Pclass_2         0.093349
Master           0.085221
Parch            0.081629
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
FamilySize       0.016639
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
SibSp           -0.035322
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Single   -0.203367
Cabin_U         -0.316912
Pclass_3        -0.322308
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64

3.3.2 选择特征

根据各个特征与生成情况(Survived)的相关系数大小,我们选择了这几个特征作为模型的输入:

头衔(前面所在的数据集titleDf)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)

#特征选择
full_X = pd.concat( [titleDf,#头衔
                     pclassDf,#客舱等级
                     familyDf,#家庭大小
                     full['Fare'],#船票价格
                     cabinDf,#船舱号
                     embarkedDf,#登船港口
                     full['Sex']#性别
                    ] , axis=1 )
full_X.head()
Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 FamilySize ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Embarked_C Embarked_Q Embarked_S Sex
0 0 0 1 0 0 0 0 0 1 2 ... 0 0 0 0 0 1 0 0 1 1
1 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 1 0 0 0
2 0 1 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 1 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 1

5 rows × 27 columns

4. 构建模型

用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型

4.1 建立训练数据集和测试数据集

* 坦尼克号测试数据集因为是我们最后要提交给Kaggle的,里面没有生存情况的值,所以不能用于评估模型。 我们将Kaggle泰坦尼克号项目给我们的测试数据,叫做预测数据集(记为pred,也就是预测英文单词predict的缩写)。 也就是我们使用机器学习模型来对其生存情况就那些预测。
* 我们使用Kaggle泰坦尼克号项目给的训练数据集,做为我们的原始数据集(记为source);从这个原始数据集中拆分出训练数据集(记为train:用于模型训练)和测试数据集(记为test:用于模型评估)。
#原始数据集有891行
sourceRow=891

'''
sourceRow是我们在最开始合并数据前知道的,原始数据集有总共有891条数据
从特征集合full_X中提取原始数据集提取前891行数据时,我们要减去1,因为行号是从0开始的。
'''
#原始数据集:特征
source_X = full_X.loc[0:sourceRow-1,:]
#原始数据集:标签
source_y = full.loc[0:sourceRow-1,'Survived']   

#预测数据集:特征
pred_X = full_X.loc[sourceRow:,:]
'''
确保这里原始数据集取的是前891行的数据,不然后面模型会有错误
'''
#原始数据集有多少行
print('原始数据集有多少行:',source_X.shape[0])
#预测数据集大小
print('原始数据集有多少行:',pred_X.shape[0])
原始数据集有多少行: 891
原始数据集有多少行: 418
'''
从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
'''

'''
sklearn包0.8版本以后,需要将之前的sklearn.cross_validation 换成sklearn.model_selection
所以课程中的代码
from sklearn.cross_validation import train_test_split 
更新为下面的代码
'''
from sklearn.model_selection import train_test_split

#建立模型用的训练数据集和测试数据集
train_X, test_X, train_y, test_y = train_test_split(source_X ,
                                                    source_y,
                                                    train_size=.8)

#输出数据集大小
print ('原始数据集特征:',source_X.shape, 
       '训练数据集特征:',train_X.shape ,
      '测试数据集特征:',test_X.shape)

print ('原始数据集标签:',source_y.shape, 
       '训练数据集标签:',train_y.shape ,
      '测试数据集标签:',test_y.shape)
原始数据集特征: (891, 27) 训练数据集特征: (712, 27) 测试数据集特征: (179, 27)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)
#原始数据查看
source_y.head()
0    0.0
1    1.0
2    1.0
3    1.0
4    0.0
Name: Survived, dtype: float64

4.2 选择机器学习算法

#第1步:导入算法
from sklearn.linear_model import LogisticRegression
#第2步:创建模型:逻辑回归(logisic regression)
model = LogisticRegression()
#随机森林Random Forests Model
#from sklearn.ensemble import RandomForestClassifier
#model = RandomForestClassifier(n_estimators=100)
#支持向量机Support Vector Machines
#from sklearn.svm import SVC, LinearSVC
#model = SVC()
#Gradient Boosting Classifier
#from sklearn.ensemble import GradientBoostingClassifier
#model = GradientBoostingClassifier()
#K-nearest neighbors
#from sklearn.neighbors import KNeighborsClassifier
#model = KNeighborsClassifier(n_neighbors = 3)
# Gaussian Naive Bayes
#from sklearn.naive_bayes import GaussianNB
#model = GaussianNB()

4.3 训练模型

#第3步:训练模型
model.fit( train_X , train_y )
LogisticRegression()

5. 评估模型

评估模型使用的是测试数据。因为我们这里使用的是分类机器学习算法,所以模型的score方法计算出的就是模型的准确率。

score方法输入的第1个参数test_X是测试数据的特征,test_y是测试数据的标签,模型输出预测结果。

# 分类问题,score得到的是模型的准确率
model.score(test_X , test_y )
0.8044692737430168

6. 方案实施

6.1 得到预测结果上传到Kaggle

使用预测数据集到底预测结果,并保存到csv文件中,上传到Kaggle中,就可以看到排名。

#使用机器学习模型,对预测数据集中的生存情况进行预测
pred_Y = model.predict(pred_X)

'''
生成的预测值是浮点数(0.0,1,0)
但是Kaggle要求提交的结果是整型(0,1)
所以要对数据类型进行转换
'''
pred_Y=pred_Y.astype(int)
#乘客id
passenger_id = full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf = pd.DataFrame( 
    { 'PassengerId': passenger_id , 
     'Survived': pred_Y } )
predDf.shape
predDf.head()
#保存结果
predDf.to_csv( 'titanic_pred.csv' , index = False )

你可能感兴趣的:(数据挖掘,python,数据分析)