泰坦尼克号将乘客分为一等舱、二等舱、三等舱三个等级,等级不同决定了安全设施、娱乐设施、餐饮等的不同,对生存率有一定影响。
那是个绅士的年代,船难时,很多男士放弃逃生机会优先女士孩子逃生,然后慷慨赴死,性别年龄也是影响生存率的因素之一。
根据背景初步判断船舱等级、乘客年龄、性别是影响生存率的因素。
一些人比其他人更有可能生存,比如妇女,儿童和上层阶级。什么样的人在泰坦尼克号中更容易存活?
下载数据地址如下:
https://www.kaggle.com/competitions/titanic/data
import warnings
warnings.filterwarnings('ignore')
# 导入处理数据包
import numpy as np
import pandas as pd
# 导入数据
train_data = pd.read_csv("./titanic_data/train.csv")
test_data = pd.read_csv("./titanic_data/test.csv")
print('训练数据集:',train_data.shape,'测试数据集:',test_data.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
# 合并数据集,方便同时对两个数据集进行清洗
full_data = train_data.append(test_data,ignore_index=True)
print('合并后的数据集:',full_data.shape)
合并后的数据集: (1309, 12)
# 查看数据
full_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 获取数据类型列的描述性统计信息
full_data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 891.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 |
mean | 655.000000 | 0.383838 | 2.294882 | 29.881138 | 0.498854 | 0.385027 | 33.295479 |
std | 378.020061 | 0.486592 | 0.837836 | 14.413493 | 1.041658 | 0.865560 | 51.758668 |
min | 1.000000 | 0.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
25% | 328.000000 | 0.000000 | 2.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 655.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 982.000000 | 1.000000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 |
max | 1309.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 |
describe只能查看数据类型的描述统计信息,对于其他类型的数据不显示,比如字符串类型姓名(name),客舱号(Cabin)。
这很好理解,因为描述统计指标是计算数值,所以需要该列的数据类型是数据
# 查看每一列的数据类型和数据总数
full_data.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1046 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1308 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
我们发现数据总共有1309行。
其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:
1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%
2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据
字符串列:
1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少
2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大
这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。
缺失值处理:
在前面,理解数据阶段,我们发现数据总共有1309行。
这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。很多机器学习算法为了训练模型,要求所传入的特征中不能有空值。
# 1、对于数值类型年龄(Age)和船票价格(Fare)这两列数值类型,我们用平均值进行填充
full_data['Age'] = full_data['Age'].fillna(full_data['Age'].mean())
full_data['Fare'] = full_data['Fare'].fillna(full_data['Fare'].mean())
# 可以看到Age列和Fare列已经没有空值了
full_data.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1309 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1309 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 2、填充登船港口(Embarked) 这一列
'''
出发地点: S=英国 南安普顿 Southampton
途径地点1: C=法国 瑟堡市 Cherbourg
途径地点2: Q=爱尔兰 昆士敦 Queenstown
'''
# 可以看到S类别是最常见的,我们将缺失值填充为最频繁出现的
full_data['Embarked'].value_counts()
S 914
C 270
Q 123
Name: Embarked, dtype: int64
# 将缺失值填充为最频繁出现的S
full_data['Embarked'] = full_data['Embarked'].fillna('S')
# 可以看到Embarked列已经没有空值了
full_data.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1309 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1309 non-null float64
10 Cabin 295 non-null object
11 Embarked 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 3、填充船舱号(Cabin) 这一列
full_data['Cabin'].value_counts()
C23 C25 C27 6
G6 5
B57 B59 B63 B66 5
C22 C26 4
F33 4
..
A14 1
E63 1
E12 1
E38 1
C105 1
Name: Cabin, Length: 186, dtype: int64
# 缺失值比较多,填充为U,表示未知(unknown)
full_data['Cabin'] = full_data['Cabin'].fillna('U')
# 可以看到所有列已经没有空值了,Survived这一列是标签列,不需要进行处理
full_data.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1309 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1309 non-null float64
10 Cabin 1309 non-null object
11 Embarked 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 查看数据是否正常
full_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | U | S |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | S |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | U | S |
查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码
(1)数值类型:
乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
(2)时间序列:无
(3) 分类数据:
1)有直接类别的
乘客性别(Sex):男性male,女性female
登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
2)字符串类型:可能从这里面提取出特征来,也归到分类数据中
乘客姓名(Name)
客舱号(Cabin)
船票编号(Ticket)
# 1、将性别值映射为数值,男(male)对应数值1,女(female)对应数值0
sex_dict = {
'male':1,
'female':0
}
full_data['Sex'] = full_data['Sex'].map(sex_dict)
full_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | U | S |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | S |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | U | S |
# 2、登船港口(Embarked)进行one-hot编码
'''
使用get_dummies进行one-hot编码,产生虚拟变量
'''
embarkedDf = pd.get_dummies(full_data['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,embarkedDf],axis=1)
'''
因为已经对Embarked进行了one-hot编码,产生了虚拟变量,因此我们把Embarked列删除
drop删除某一列代码解释:
因为drop(name,axis=1)里面指定了name是哪一列,比如指定的是A这一列,axis=1表示按行操作
那么结合起来就是把A列里面每一行删除,最终结果是删除了A这一列。
简单来说,使用drop删除某几列的方法记住这个语法就可以了: drop([列名1,列名2],axis=1)
'''
full_data.drop('Embarked',axis=1,inplace=True)
full_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | U | 0 | 0 | 1 |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 0 | 0 |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | 0 | 0 | 1 |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | 0 | 1 |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | U | 0 | 0 | 1 |
# 3、客舱等级(Pclass)进行one-hot编码
# 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
pclassDf = pd.get_dummies(full_data['Pclass'],prefix='Pclass')
pclassDf.head()
Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,pclassDf],axis=1)
full_data.drop('Pclass',axis=1,inplace=True)
full_data.head()
PassengerId | Survived | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | U | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1.0 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 4 | 1.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | 0 | 1 | 1 | 0 | 0 |
4 | 5 | 0.0 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | U | 0 | 0 | 1 | 0 | 0 | 1 |
# 1、从姓名列[Name]提取头衔
'''
注意到在乘客名字 (Name) 中,有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。
'''
full_data['Name'].head(10)
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
7 Palsson, Master. Gosta Leonard
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object
'''
定义函数,从姓名中获取头衔
'''
def getTitle(name):
str1 = name.split(',')[1]
str2 = str1.split('.')[0]
str3 = "".join(str2.strip())
return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full_data['Name'].map(getTitle)
titleDf
Title | |
---|---|
0 | Mr |
1 | Mrs |
2 | Miss |
3 | Mrs |
4 | Mr |
... | ... |
1304 | Mr |
1305 | Dona |
1306 | Mr |
1307 | Mr |
1308 | Master |
1309 rows × 1 columns
'''
定义以下几种头衔类别:
Officer 政府官员
Royalty 王室
Mr 已婚男士
Mrs 已婚妇女
Miss 年轻未婚女子
Master 有技能的人/教师
'''
# 姓名中头衔字符串与定义头衔类别的映射关系
title_dict = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Don": "Royalty",
"Sir": "Royalty",
"Jonkheer": "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess": "Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr": "Mr",
"Mrs": "Mrs",
"Miss": "Miss",
"Master": "Master",
"Lady": "Royalty"
}
titleDf['Title'] = titleDf['Title'].map(title_dict)
# one-hot编码
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
# 添加one-hot编码到full_data,bing'q并且删除Name这一列
full_data = pd.concat([full_data,titleDf],axis=1)
full_data.drop('Name',axis=1,inplace=True)
full_data
PassengerId | Survived | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked_C | ... | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 2 | 1.0 | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1.0 | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 4 | 1.0 | 0 | 35.000000 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 5 | 0.0 | 1 | 35.000000 | 0 | 0 | 373450 | 8.0500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1304 | 1305 | NaN | 1 | 29.881138 | 0 | 0 | A.5. 3236 | 8.0500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1305 | 1306 | NaN | 0 | 39.000000 | 0 | 0 | PC 17758 | 108.9000 | C105 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1306 | 1307 | NaN | 1 | 38.500000 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1307 | 1308 | NaN | 1 | 29.881138 | 0 | 0 | 359309 | 8.0500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1308 | 1309 | NaN | 1 | 29.881138 | 1 | 1 | 2668 | 22.3583 | U | 1 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
1309 rows × 21 columns
# 2、从Cabin列提取客舱号信息
full_data['Cabin'] = full_data['Cabin'].map(lambda c:c[0])
full_data.head()
PassengerId | Survived | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked_C | ... | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 2 | 1.0 | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1.0 | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 4 | 1.0 | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 5 | 0.0 | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | U | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 21 columns
# 进行one-hot编码
cabinDf = pd.get_dummies(full_data['Cabin'],prefix='Cabin')
cabinDf.head()
Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
full_data = pd.concat([full_data,cabinDf],axis=1)
full_data.drop('Cabin',axis=1,inplace=True)
full_data.head()
PassengerId | Survived | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked_C | Embarked_Q | ... | Royalty | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1.0 | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | 1.0 | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1.0 | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5 | 0.0 | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 29 columns
# 3、建立家庭人数和家庭类别
familyDf = pd.DataFrame()
'''
家庭人数 = 同代直系亲属数(Parch) + 不同代直系亲属数(SibSp) + 乘客自己
'''
familyDf['FamilySize'] = full_data['Parch'] + full_data['SibSp'] + 1
familyDf.head()
FamilySize | |
---|---|
0 | 2 |
1 | 2 |
2 | 1 |
3 | 2 |
4 | 1 |
'''
家庭类别
小家庭Family_Small: 家庭人数=1
中等家庭Family_Middle: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''
familyDf['Family_Small'] = familyDf['FamilySize'].map(lambda cnt: 1 if cnt == 1 else 0 )
familyDf['Family_Middle'] = familyDf['FamilySize'].map(lambda cnt: 1 if 2 <= cnt <= 4 else 0 )
familyDf['Family_Large'] = familyDf['FamilySize'].map(lambda cnt: 1 if cnt >= 5 else 0 )
familyDf.head()
FamilySize | Family_Small | Family_Middle | Family_Large | |
---|---|---|---|---|
0 | 2 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 |
3 | 2 | 0 | 1 | 0 |
4 | 1 | 1 | 0 | 0 |
# 拼接到full_data
full_data = pd.concat([full_data,familyDf],axis=1)
full_data.head()
PassengerId | Survived | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked_C | Embarked_Q | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | FamilySize | Family_Small | Family_Middle | Family_Large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 |
1 | 2 | 1.0 | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
2 | 3 | 1.0 | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
3 | 4 | 1.0 | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
4 | 5 | 0.0 | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
5 rows × 33 columns
# 目前的特征
full_data.shape
(1309, 33)
# 相关性矩阵
corrDf = full_data.corr()
corrDf
PassengerId | Survived | Sex | Age | SibSp | Parch | Fare | Embarked_C | Embarked_Q | Embarked_S | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | FamilySize | Family_Small | Family_Middle | Family_Large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 1.000000 | -0.005007 | 0.013406 | 0.025731 | -0.055224 | 0.008942 | 0.031416 | 0.048101 | 0.011585 | -0.049836 | ... | 0.000549 | -0.008136 | 0.000306 | -0.045949 | -0.023049 | 0.000208 | -0.031437 | 0.028546 | 0.002975 | -0.063415 |
Survived | -0.005007 | 1.000000 | -0.543351 | -0.070323 | -0.035322 | 0.081629 | 0.257307 | 0.168240 | 0.003650 | -0.149683 | ... | 0.150716 | 0.145321 | 0.057935 | 0.016040 | -0.026456 | -0.316912 | 0.016639 | -0.203367 | 0.279855 | -0.125147 |
Sex | 0.013406 | -0.543351 | 1.000000 | 0.057397 | -0.109609 | -0.213125 | -0.185484 | -0.066564 | -0.088651 | 0.115193 | ... | -0.057396 | -0.040340 | -0.006655 | -0.083285 | 0.020558 | 0.137396 | -0.188583 | 0.284537 | -0.255196 | -0.077748 |
Age | 0.025731 | -0.070323 | 0.057397 | 1.000000 | -0.190747 | -0.130872 | 0.171521 | 0.076179 | -0.012718 | -0.059153 | ... | 0.132886 | 0.106600 | -0.072644 | -0.085977 | 0.032461 | -0.271918 | -0.196996 | 0.116675 | -0.038189 | -0.161210 |
SibSp | -0.055224 | -0.035322 | -0.109609 | -0.190747 | 1.000000 | 0.373587 | 0.160224 | -0.048396 | -0.048678 | 0.073709 | ... | -0.015727 | -0.027180 | -0.008619 | 0.006015 | -0.013247 | 0.009064 | 0.861952 | -0.591077 | 0.253590 | 0.699681 |
Parch | 0.008942 | 0.081629 | -0.213125 | -0.130872 | 0.373587 | 1.000000 | 0.221522 | -0.008635 | -0.100943 | 0.071881 | ... | -0.027385 | 0.001084 | 0.020481 | 0.058325 | -0.012304 | -0.036806 | 0.792296 | -0.549022 | 0.248532 | 0.624627 |
Fare | 0.031416 | 0.257307 | -0.185484 | 0.171521 | 0.160224 | 0.221522 | 1.000000 | 0.286241 | -0.130054 | -0.169894 | ... | 0.072737 | 0.073949 | -0.037567 | -0.022857 | 0.001179 | -0.507197 | 0.226465 | -0.274826 | 0.197281 | 0.170853 |
Embarked_C | 0.048101 | 0.168240 | -0.066564 | 0.076179 | -0.048396 | -0.008635 | 0.286241 | 1.000000 | -0.164166 | -0.778262 | ... | 0.107782 | 0.027566 | -0.020010 | -0.031566 | -0.014095 | -0.258257 | -0.036553 | -0.107874 | 0.159594 | -0.092825 |
Embarked_Q | 0.011585 | 0.003650 | -0.088651 | -0.012718 | -0.048678 | -0.100943 | -0.130054 | -0.164166 | 1.000000 | -0.491656 | ... | -0.061459 | -0.042877 | -0.020282 | -0.019941 | -0.008904 | 0.142369 | -0.087190 | 0.127214 | -0.122491 | -0.018423 |
Embarked_S | -0.049836 | -0.149683 | 0.115193 | -0.059153 | 0.073709 | 0.071881 | -0.169894 | -0.778262 | -0.491656 | 1.000000 | ... | -0.056023 | 0.002960 | 0.030575 | 0.040560 | 0.018111 | 0.137351 | 0.087771 | 0.014246 | -0.062909 | 0.093671 |
Pclass_1 | 0.026495 | 0.285904 | -0.107371 | 0.362587 | -0.034256 | -0.013033 | 0.599956 | 0.325722 | -0.166101 | -0.181800 | ... | 0.275698 | 0.242963 | -0.073083 | -0.035441 | 0.048310 | -0.776987 | -0.029656 | -0.126551 | 0.165965 | -0.067523 |
Pclass_2 | 0.022714 | 0.093349 | -0.028862 | -0.014193 | -0.052419 | -0.010057 | -0.121372 | -0.134675 | -0.121973 | 0.196532 | ... | -0.037929 | -0.050210 | 0.127371 | -0.032081 | -0.014325 | 0.176485 | -0.039976 | -0.035075 | 0.097270 | -0.118495 |
Pclass_3 | -0.041544 | -0.322308 | 0.116562 | -0.302093 | 0.072610 | 0.019521 | -0.419616 | -0.171430 | 0.243706 | -0.003805 | ... | -0.207455 | -0.169063 | -0.041178 | 0.056964 | -0.030057 | 0.527614 | 0.058430 | 0.138250 | -0.223338 | 0.155560 |
Master | 0.002254 | 0.085221 | 0.164375 | -0.363923 | 0.329171 | 0.253482 | 0.011596 | -0.014172 | -0.009091 | 0.018297 | ... | -0.042192 | 0.001860 | 0.058311 | -0.013690 | -0.006113 | 0.041178 | 0.355061 | -0.265355 | 0.120166 | 0.301809 |
Miss | -0.050027 | 0.332795 | -0.672819 | -0.254146 | 0.077564 | 0.066473 | 0.092051 | -0.014351 | 0.198804 | -0.113886 | ... | -0.012516 | 0.008700 | -0.003088 | 0.061881 | -0.013832 | -0.004364 | 0.087350 | -0.023890 | -0.018085 | 0.083422 |
Mr | 0.014116 | -0.549199 | 0.870678 | 0.165476 | -0.243104 | -0.304780 | -0.192192 | -0.065538 | -0.080224 | 0.108924 | ... | -0.030261 | -0.032953 | -0.026403 | -0.072514 | 0.023611 | 0.131807 | -0.326487 | 0.386262 | -0.300872 | -0.194207 |
Mrs | 0.033299 | 0.344935 | -0.571176 | 0.198091 | 0.061643 | 0.213491 | 0.139235 | 0.098379 | -0.100374 | -0.022950 | ... | 0.080393 | 0.045538 | 0.013376 | 0.042547 | -0.011742 | -0.162253 | 0.157233 | -0.354649 | 0.361247 | 0.012893 |
Officer | 0.002231 | -0.031316 | 0.087288 | 0.162818 | -0.013813 | -0.032631 | 0.028696 | 0.003678 | -0.003212 | -0.001202 | ... | 0.006055 | -0.024048 | -0.017076 | -0.008281 | -0.003698 | -0.067030 | -0.026921 | 0.013303 | 0.003966 | -0.034572 |
Royalty | 0.004400 | 0.033391 | -0.020408 | 0.059466 | -0.010787 | -0.030197 | 0.026214 | 0.077213 | -0.021853 | -0.054250 | ... | -0.012950 | -0.012202 | -0.008665 | -0.004202 | -0.001876 | -0.071672 | -0.023600 | 0.008761 | -0.000073 | -0.017542 |
Cabin_A | -0.002831 | 0.022287 | 0.047561 | 0.125177 | -0.039808 | -0.030707 | 0.020094 | 0.094914 | -0.042105 | -0.056984 | ... | -0.024952 | -0.023510 | -0.016695 | -0.008096 | -0.003615 | -0.242399 | -0.042967 | 0.045227 | -0.029546 | -0.033799 |
Cabin_B | 0.015895 | 0.175095 | -0.094453 | 0.113458 | -0.011569 | 0.073051 | 0.393743 | 0.161595 | -0.073613 | -0.095790 | ... | -0.043624 | -0.041103 | -0.029188 | -0.014154 | -0.006320 | -0.423794 | 0.032318 | -0.087912 | 0.084268 | 0.013470 |
Cabin_C | 0.006092 | 0.114652 | -0.077473 | 0.167993 | 0.048616 | 0.009601 | 0.401370 | 0.158043 | -0.059151 | -0.101861 | ... | -0.053083 | -0.050016 | -0.035516 | -0.017224 | -0.007691 | -0.515684 | 0.037226 | -0.137498 | 0.141925 | 0.001362 |
Cabin_D | 0.000549 | 0.150716 | -0.057396 | 0.132886 | -0.015727 | -0.027385 | 0.072737 | 0.107782 | -0.061459 | -0.056023 | ... | 1.000000 | -0.034317 | -0.024369 | -0.011817 | -0.005277 | -0.353822 | -0.025313 | -0.074310 | 0.102432 | -0.049336 |
Cabin_E | -0.008136 | 0.145321 | -0.040340 | 0.106600 | -0.027180 | 0.001084 | 0.073949 | 0.027566 | -0.042877 | 0.002960 | ... | -0.034317 | 1.000000 | -0.022961 | -0.011135 | -0.004972 | -0.333381 | -0.017285 | -0.042535 | 0.068007 | -0.046485 |
Cabin_F | 0.000306 | 0.057935 | -0.006655 | -0.072644 | -0.008619 | 0.020481 | -0.037567 | -0.020010 | -0.020282 | 0.030575 | ... | -0.024369 | -0.022961 | 1.000000 | -0.007907 | -0.003531 | -0.236733 | 0.005525 | 0.004055 | 0.012756 | -0.033009 |
Cabin_G | -0.045949 | 0.016040 | -0.083285 | -0.085977 | 0.006015 | 0.058325 | -0.022857 | -0.031566 | -0.019941 | 0.040560 | ... | -0.011817 | -0.011135 | -0.007907 | 1.000000 | -0.001712 | -0.114803 | 0.035835 | -0.076397 | 0.087471 | -0.016008 |
Cabin_T | -0.023049 | -0.026456 | 0.020558 | 0.032461 | -0.013247 | -0.012304 | 0.001179 | -0.014095 | -0.008904 | 0.018111 | ... | -0.005277 | -0.004972 | -0.003531 | -0.001712 | 1.000000 | -0.051263 | -0.015438 | 0.022411 | -0.019574 | -0.007148 |
Cabin_U | 0.000208 | -0.316912 | 0.137396 | -0.271918 | 0.009064 | -0.036806 | -0.507197 | -0.258257 | 0.142369 | 0.137351 | ... | -0.353822 | -0.333381 | -0.236733 | -0.114803 | -0.051263 | 1.000000 | -0.014155 | 0.175812 | -0.211367 | 0.056438 |
FamilySize | -0.031437 | 0.016639 | -0.188583 | -0.196996 | 0.861952 | 0.792296 | 0.226465 | -0.036553 | -0.087190 | 0.087771 | ... | -0.025313 | -0.017285 | 0.005525 | 0.035835 | -0.015438 | -0.014155 | 1.000000 | -0.688864 | 0.302640 | 0.801623 |
Family_Small | 0.028546 | -0.203367 | 0.284537 | 0.116675 | -0.591077 | -0.549022 | -0.274826 | -0.107874 | 0.127214 | 0.014246 | ... | -0.074310 | -0.042535 | 0.004055 | -0.076397 | 0.022411 | 0.175812 | -0.688864 | 1.000000 | -0.873398 | -0.318944 |
Family_Middle | 0.002975 | 0.279855 | -0.255196 | -0.038189 | 0.253590 | 0.248532 | 0.197281 | 0.159594 | -0.122491 | -0.062909 | ... | 0.102432 | 0.068007 | 0.012756 | 0.087471 | -0.019574 | -0.211367 | 0.302640 | -0.873398 | 1.000000 | -0.183007 |
Family_Large | -0.063415 | -0.125147 | -0.077748 | -0.161210 | 0.699681 | 0.624627 | 0.170853 | -0.092825 | -0.018423 | 0.093671 | ... | -0.049336 | -0.046485 | -0.033009 | -0.016008 | -0.007148 | 0.056438 | 0.801623 | -0.318944 | -0.183007 | 1.000000 |
32 rows × 32 columns
'''
查看各个特征与存活(Survived)的相关系数,倒序排列
'''
corrDf['Survived'].sort_values(ascending=False)
Survived 1.000000
Mrs 0.344935
Miss 0.332795
Pclass_1 0.285904
Family_Middle 0.279855
Fare 0.257307
Cabin_B 0.175095
Embarked_C 0.168240
Cabin_D 0.150716
Cabin_E 0.145321
Cabin_C 0.114652
Pclass_2 0.093349
Master 0.085221
Parch 0.081629
Cabin_F 0.057935
Royalty 0.033391
Cabin_A 0.022287
FamilySize 0.016639
Cabin_G 0.016040
Embarked_Q 0.003650
PassengerId -0.005007
Cabin_T -0.026456
Officer -0.031316
SibSp -0.035322
Age -0.070323
Family_Large -0.125147
Embarked_S -0.149683
Family_Small -0.203367
Cabin_U -0.316912
Pclass_3 -0.322308
Sex -0.543351
Mr -0.549199
Name: Survived, dtype: float64
根据各个特征与Survived的相关系数大小,选择这几个特征作为模型的输入:
头衔(前面所在的数据集titleDf)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)
full_X = pd.concat(
[
titleDf,
pclassDf,
familyDf,
full_data['Fare'],
cabinDf,
embarkedDf,
full_data['Sex']
],axis=1
)
full_X.head()
Master | Miss | Mr | Mrs | Officer | Royalty | Pclass_1 | Pclass_2 | Pclass_3 | FamilySize | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | Embarked_C | Embarked_Q | Embarked_S | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
5 rows × 27 columns
坦尼克号测试数据集因为是我们最后要提交给Kaggle的,里面没有生存情况的值,所以不能用于评估模型。
使用Kaggle泰坦尼克号项目给的训练数据集,做为我们的原始数据集(记为source),从这个原始数据集中拆分出训练数据集(记为train:用于模型训练)和测试数据集(记为test:用于模型评估)
# 原始数据集有891行
source_row = 891
# 原始数据集的特征
source_X = full_X.loc[0:source_row-1,:]
# 原始数据集的标签
source_y = full_data.loc[0:source_row-1,'Survived']
# 预测数据集特征
pred_X = full_X.loc[source_row:,:]
print('原始数据集的大小:',source_X.shape[0])
print('预测数据集的大小:',pred_X.shape[0])
原始数据集的大小: 891
预测数据集的大小: 418
# 1、拆分原始数据集
from sklearn.model_selection import train_test_split
train_X,test_X,train_y,test_y = train_test_split(
source_X,
source_y,
test_size=0.2,
train_size=0.8
)
# 2、选择机器学习算法,我们选择最基础的逻辑回归算法
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# 3、训练模型
lr.fit(train_X,train_y)
# 4、评估模型,用精确率进行评估
lr.score(test_X,test_y)
0.8156424581005587
# 对预测数据集进行预测
pred_y = lr.predict(pred_X)
# 转换为kaggle要求是整形
pred_y = pred_y.astype(int)
# 乘客id
passenger_id = full_data.loc[source_row:,'PassengerId']
predDf = pd.DataFrame(
{
'PassengerId':passenger_id,
'Survived':pred_y
}
)
predDf.head()
PassengerId | Survived | |
---|---|---|
891 | 892 | 0 |
892 | 893 | 1 |
893 | 894 | 0 |
894 | 895 | 0 |
895 | 896 | 1 |
# 保存结果
predDf.to_csv('./titanic_data/titanic_pred.csv',index=False)