什么样的人在泰坦尼克号中更容易存活?
1)采集数据
2)导入数据
3)查看数据集信息
下载Kaggle泰坦尼克号数据
我们将训练数据和测试数据合并,方便同时清洗
#导入处理数据包
import numpy as np
import pandas as pd
#导入数据
path='C:/Users/Titanic'
f=open(path+'/train.csv')
g=open(path+'/test.csv')
#训练数据集
train=pd.read_csv(f)
#测试数据集
test=pd.read_csv(g)
#在这里要记住数据集有891条数据
print('训练数据集:',train.shape,'测试数据集:',test.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
rowNum_train=train.shape[0]
rowNum_test=test.shape[0]
print('kaggle训练数据集有多少行数据:',rowNum_train,
'kaggle测试数据集有多少行数据:',rowNum_test)
kaggle训练数据集有多少行数据: 891 kaggle测试数据集有多少行数据: 418
#合并数据集,方便同时对两个数据集进行清洗
full=train.append(test,ignore_index=True)
print('合并后的数据集:',full.shape)
合并后的数据集: (1309, 12)
#查看数据
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
Embarked 登船港口
(S=英国南安普顿 Southampton C=法国 瑟堡市 Cherbourg Q=爱尔兰 昆士敦 Queenstown)
Fare 船票价格
Parch 船上父母数/子女数(不同代直系亲属数)
SibSp 船上兄弟姐妹数/配偶数(同代直系亲属数)
Pclass 客舱等级(1=1等舱,2=2等舱,3=3等舱)
#获取数据类型列的描述统计信息
full.describe()
Age | Fare | Parch | PassengerId | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.881138 | 33.295479 | 0.385027 | 655.000000 | 2.294882 | 0.498854 | 0.383838 |
std | 14.413493 | 51.758668 | 0.865560 | 378.020061 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 21.000000 | 7.895800 | 0.000000 | 328.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 655.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 31.275000 | 0.000000 | 982.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 1309.000000 | 3.000000 | 8.000000 | 1.000000 |
#查看每一列数据类型和数据总数
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
可知数据总有1309行。
其中部分信息有缺失数据
数据类型列:
* 年龄(Age)总数1046条,缺失263条,缺失率263/1309=20%
* 船票(Fare)总数1308条,缺失1条
字符串列:
* 登船港口(Embarked)总数1307,缺失2条
* 船舱号(Cabin)数据总数是295,缺失了1309-295=1014,缺失率1014/1309=77.5% 缺失较为严重
'''
首先对于数据类型列年龄,船票价格
处理缺失值最简单的方法采用平均数来填充缺失值
'''
print('处理前:')
full.info()
#年龄
full['Age']=full['Age'].fillna(full['Age'].mean())
#船票价格
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
print('处理后:')
full.info()
处理前:
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
处理后:
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
#检查数据
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
'''
处理缺失比较大的字符串列登船港口和船舱号
'''
#Embarked登船港口:读取该列信息
from collections import Counter
Counter(full['Embarked'])
Counter({‘C’: 270, ‘Q’: 123, ‘S’: 914, nan: 2})
'''
只有两个缺失值,我们将缺失值填充为最频繁出现的值S
'''
full['Embarked']=full['Embarked'].fillna('S')
#船舱号:读取该列信息
Counter(full['Cabin'])
#发现缺失信息较多,而且船舱号信息比较杂,因此在这里将缺失值填充为U,表示未知
full['Cabin']=full['Cabin'].fillna('U')
#检查信息处理是否正常
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | U | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | U | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
#查看缺失值处理情况
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 1309 non-null object
Embarked 1309 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
通过查看full.info(),可以看到每一列的数据类型,一般给出三种分类数值,时间,分类数据,在这里像姓名,船舱号等没有明显类别的字符串类型,也归入到分类数据中,之后可以考虑是否可以提取特征。
1.数值类型:
乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
2.时间序列:无
3.分类数据:
1)有直接类别:
乘客性别(Sex),登船港口(Embarked),船舱等级(Pclass)
2)其他字符串类型:
乘客姓名(Name),船舱号(Cabin),船票编号(Ticket)
性别(Sex)
'''
将性别的值映射为数值
男(male)对应数值1,女(female)对应数值0
'''
sex_mapDict={'male':1,'female':0}
#map函数:对Series每个数据应用自定义的函数计算
full['Sex']=full['Sex'].map(sex_mapDict)
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | 1 | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 1 | 0 | 1 | 1.0 | PC 17599 |
2 | 26.0 | U | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | 0 | 1 | 1.0 | 113803 |
4 | 35.0 | U | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | 1 | 0 | 0.0 | 373450 |
登船港口(Embarked)
'''
使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
'''
#存放提取后的特征
embarkedDf=pd.DataFrame()
embarkedDf=pd.get_dummies(full['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
#添加虚拟变量到泰坦尼克号数据集full
full=pd.concat([full,embarkedDf],axis=1)
#删除登船港口
full.drop('Embarked',axis=1,inplace=True)
full.head()
Age | Cabin | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 |
1 | 38.0 | C85 | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 1 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 |
2 | 26.0 | U | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 |
3 | 35.0 | C123 | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 |
4 | 35.0 | U | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 |
客舱等级(Pclass)
#同样对客舱等级进行one-hot编码,前缀名是Pclass
pclassDf=pd.DataFrame()
pclassDf=pd.get_dummies(full['Pclass'],prefix='Pclass')
pclassDf.head()
Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
#添加客舱等级的虚拟变量到原始数据集
full=pd.concat([full,pclassDf],axis=1)
#删掉原客舱等级列
full.drop('Pclass',axis=1,inplace=True)
full.head()
Age | Cabin | Fare | Name | Parch | PassengerId | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 38.0 | C85 | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 2 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 26.0 | U | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 35.0 | C123 | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 | 1 | 0 | 0 |
4 | 35.0 | U | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 | 0 | 0 | 1 |
从字符串数据类型中提取特征,也归为分类数据中,这部分包含的数据为:
1.乘客姓名(Name)
2.客舱号(Cabin)
3.船票编号(Ticket)
'''
观察姓名特点,可以发现乘客头衔每个名字当中都包含了具体的称谓或者头衔,可将该部分提出
'''
full['Name'].head()
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
'''
可以看到名字整体分为“名,称谓.姓”
因此我们可以使用split进行字符串分割,获取所需的头衔
'''
def getTitle(name):
str1=name.split(',')[1]
str2=str1.split('.')[0]
str3=str2.strip() #移除字符串头尾指定字符(默认为空格)
return str3
#存放提取后的特征
titleDf=pd.DataFrame()
titleDf['Title']=full['Name'].map(getTitle)
titleDf.head()
Title | |
---|---|
0 | Mr |
1 | Mrs |
2 | Miss |
3 | Mrs |
4 | Mr |
#查看提取的信息
titleDf['Title'].unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
'Jonkheer', 'Dona'], dtype=object)
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
并与提取的信息一一对应
'''
title_mapDict={
'Mr': 'Mr',
'Mrs': 'Mrs',
'Miss': 'Miss',
'Master': 'Master',
'Don': 'Royalty',
'Rev': 'Officer',
'Dr': 'Officer',
'Mme': 'Mrs',
'Ms': 'Mrs',
'Major': 'Officer',
'Lady': 'Royalty',
'Sir': 'Royalty',
'Mlle': 'Miss',
'Col': 'Officer',
'Capt': 'Officer',
'the Countess':'Royalty',
'Jonkheer': 'Royalty',
'Dona': 'Royalty'
}
titleDf['Title']=titleDf['Title'].map(title_mapDict)
#使用one-hot编码
titleDf=pd.get_dummies(titleDf['Title'])
titleDf.head()
Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
#同样的添加姓名产生的虚拟变量到原始数据集
full=pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)
full.head()
Age | Cabin | Fare | Parch | PassengerId | Sex | SibSp | Survived | Ticket | Embarked_C | … | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | 0 | 1 | 1 | 1 | 0.0 | A/5 21171 | 0 | … | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 38.0 | C85 | 71.2833 | 0 | 2 | 0 | 1 | 1.0 | PC 17599 | 1 | … | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 26.0 | U | 7.9250 | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | … | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 35.0 | C123 | 53.1000 | 0 | 4 | 0 | 1 | 1.0 | 113803 | 0 | … | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 35.0 | U | 8.0500 | 0 | 5 | 1 | 0 | 0.0 | 373450 | 0 | … | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 21 columns
#客舱号的类别为首字母,可以建立映射关系
full['Cabin']=full['Cabin'].map(lambda c: c[0])
#进行one-hot编码,前缀为Cabin
cabinDf=pd.DataFrame()
cabinDf=pd.get_dummies(full['Cabin'],prefix='Cabin')
cabinDf.head()
Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
#添加到原数据集
full=pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace=True)
full.head()
Age | Fare | Parch | PassengerId | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | … | Royalty | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 38.0 | 71.2833 | 0 | 2 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 26.0 | 7.9250 | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 35.0 | 53.1000 | 0 | 4 | 0 | 1 | 1.0 | 113803 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 35.0 | 8.0500 | 0 | 5 | 1 | 0 | 0.0 | 373450 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 29 columns
familyDf=pd.DataFrame()
'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
'''
familyDf['FamilySize']=full['Parch']+full['SibSp']+1
'''
家庭类别:
小家庭Family_Single:家庭人员=1
中等家庭Family_Small:2<=家庭人员<=4
大家庭Family_Large:家庭人员>=5
(根据需求人工设置虚拟变量)
'''
familyDf['Family_Single']=familyDf['FamilySize'].map(lambda s: 1 if s==1 else 0)
familyDf['Family_Small']=familyDf['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
familyDf['Family_Large']=familyDf['FamilySize'].map(lambda s: 1 if 5<=s else 0)
familyDf.head()
FamilySize | Family_Single | Family_Small | Family_Large | |
---|---|---|---|---|
0 | 2 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 |
3 | 2 | 0 | 1 | 0 |
4 | 1 | 1 | 0 | 0 |
#将变量添加到数据集
full=pd.concat([full,familyDf],axis=1)
full.head()
Age | Fare | Parch | PassengerId | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | … | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | FamilySize | Family_Single | Family_Small | Family_Large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 |
1 | 38.0 | 71.2833 | 0 | 2 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
2 | 26.0 | 7.9250 | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
3 | 35.0 | 53.1000 | 0 | 4 | 0 | 1 | 1.0 | 113803 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
4 | 35.0 | 8.0500 | 0 | 5 | 1 | 0 | 0.0 | 373450 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
5 rows × 33 columns
相关系数法:计算各个特征的相关系数
#相关性矩阵
corrDf = full.corr()
corrDf
Age | Fare | Parch | PassengerId | Sex | SibSp | Survived | Embarked_C | Embarked_Q | Embarked_S | … | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | FamilySize | Family_Single | Family_Small | Family_Large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age | 1.000000 | 0.171521 | -0.130872 | 0.025731 | 0.057397 | -0.190747 | -0.070323 | 0.076179 | -0.012718 | -0.059153 | … | 0.132886 | 0.106600 | -0.072644 | -0.085977 | 0.032461 | -0.271918 | -0.196996 | 0.116675 | -0.038189 | -0.161210 |
Fare | 0.171521 | 1.000000 | 0.221522 | 0.031416 | -0.185484 | 0.160224 | 0.257307 | 0.286241 | -0.130054 | -0.169894 | … | 0.072737 | 0.073949 | -0.037567 | -0.022857 | 0.001179 | -0.507197 | 0.226465 | -0.274826 | 0.197281 | 0.170853 |
Parch | -0.130872 | 0.221522 | 1.000000 | 0.008942 | -0.213125 | 0.373587 | 0.081629 | -0.008635 | -0.100943 | 0.071881 | … | -0.027385 | 0.001084 | 0.020481 | 0.058325 | -0.012304 | -0.036806 | 0.792296 | -0.549022 | 0.248532 | 0.624627 |
PassengerId | 0.025731 | 0.031416 | 0.008942 | 1.000000 | 0.013406 | -0.055224 | -0.005007 | 0.048101 | 0.011585 | -0.049836 | … | 0.000549 | -0.008136 | 0.000306 | -0.045949 | -0.023049 | 0.000208 | -0.031437 | 0.028546 | 0.002975 | -0.063415 |
Sex | 0.057397 | -0.185484 | -0.213125 | 0.013406 | 1.000000 | -0.109609 | -0.543351 | -0.066564 | -0.088651 | 0.115193 | … | -0.057396 | -0.040340 | -0.006655 | -0.083285 | 0.020558 | 0.137396 | -0.188583 | 0.284537 | -0.255196 | -0.077748 |
SibSp | -0.190747 | 0.160224 | 0.373587 | -0.055224 | -0.109609 | 1.000000 | -0.035322 | -0.048396 | -0.048678 | 0.073709 | … | -0.015727 | -0.027180 | -0.008619 | 0.006015 | -0.013247 | 0.009064 | 0.861952 | -0.591077 | 0.253590 | 0.699681 |
Survived | -0.070323 | 0.257307 | 0.081629 | -0.005007 | -0.543351 | -0.035322 | 1.000000 | 0.168240 | 0.003650 | -0.149683 | … | 0.150716 | 0.145321 | 0.057935 | 0.016040 | -0.026456 | -0.316912 | 0.016639 | -0.203367 | 0.279855 | -0.125147 |
Embarked_C | 0.076179 | 0.286241 | -0.008635 | 0.048101 | -0.066564 | -0.048396 | 0.168240 | 1.000000 | -0.164166 | -0.778262 | … | 0.107782 | 0.027566 | -0.020010 | -0.031566 | -0.014095 | -0.258257 | -0.036553 | -0.107874 | 0.159594 | -0.092825 |
Embarked_Q | -0.012718 | -0.130054 | -0.100943 | 0.011585 | -0.088651 | -0.048678 | 0.003650 | -0.164166 | 1.000000 | -0.491656 | … | -0.061459 | -0.042877 | -0.020282 | -0.019941 | -0.008904 | 0.142369 | -0.087190 | 0.127214 | -0.122491 | -0.018423 |
Embarked_S | -0.059153 | -0.169894 | 0.071881 | -0.049836 | 0.115193 | 0.073709 | -0.149683 | -0.778262 | -0.491656 | 1.000000 | … | -0.056023 | 0.002960 | 0.030575 | 0.040560 | 0.018111 | 0.137351 | 0.087771 | 0.014246 | -0.062909 | 0.093671 |
Pclass_1 | 0.362587 | 0.599956 | -0.013033 | 0.026495 | -0.107371 | -0.034256 | 0.285904 | 0.325722 | -0.166101 | -0.181800 | … | 0.275698 | 0.242963 | -0.073083 | -0.035441 | 0.048310 | -0.776987 | -0.029656 | -0.126551 | 0.165965 | -0.067523 |
Pclass_2 | -0.014193 | -0.121372 | -0.010057 | 0.022714 | -0.028862 | -0.052419 | 0.093349 | -0.134675 | -0.121973 | 0.196532 | … | -0.037929 | -0.050210 | 0.127371 | -0.032081 | -0.014325 | 0.176485 | -0.039976 | -0.035075 | 0.097270 | -0.118495 |
Pclass_3 | -0.302093 | -0.419616 | 0.019521 | -0.041544 | 0.116562 | 0.072610 | -0.322308 | -0.171430 | 0.243706 | -0.003805 | … | -0.207455 | -0.169063 | -0.041178 | 0.056964 | -0.030057 | 0.527614 | 0.058430 | 0.138250 | -0.223338 | 0.155560 |
Master | -0.363923 | 0.011596 | 0.253482 | 0.002254 | 0.164375 | 0.329171 | 0.085221 | -0.014172 | -0.009091 | 0.018297 | … | -0.042192 | 0.001860 | 0.058311 | -0.013690 | -0.006113 | 0.041178 | 0.355061 | -0.265355 | 0.120166 | 0.301809 |
Miss | -0.254146 | 0.092051 | 0.066473 | -0.050027 | -0.672819 | 0.077564 | 0.332795 | -0.014351 | 0.198804 | -0.113886 | … | -0.012516 | 0.008700 | -0.003088 | 0.061881 | -0.013832 | -0.004364 | 0.087350 | -0.023890 | -0.018085 | 0.083422 |
Mr | 0.165476 | -0.192192 | -0.304780 | 0.014116 | 0.870678 | -0.243104 | -0.549199 | -0.065538 | -0.080224 | 0.108924 | … | -0.030261 | -0.032953 | -0.026403 | -0.072514 | 0.023611 | 0.131807 | -0.326487 | 0.386262 | -0.300872 | -0.194207 |
Mrs | 0.198091 | 0.139235 | 0.213491 | 0.033299 | -0.571176 | 0.061643 | 0.344935 | 0.098379 | -0.100374 | -0.022950 | … | 0.080393 | 0.045538 | 0.013376 | 0.042547 | -0.011742 | -0.162253 | 0.157233 | -0.354649 | 0.361247 | 0.012893 |
Officer | 0.162818 | 0.028696 | -0.032631 | 0.002231 | 0.087288 | -0.013813 | -0.031316 | 0.003678 | -0.003212 | -0.001202 | … | 0.006055 | -0.024048 | -0.017076 | -0.008281 | -0.003698 | -0.067030 | -0.026921 | 0.013303 | 0.003966 | -0.034572 |
Royalty | 0.059466 | 0.026214 | -0.030197 | 0.004400 | -0.020408 | -0.010787 | 0.033391 | 0.077213 | -0.021853 | -0.054250 | … | -0.012950 | -0.012202 | -0.008665 | -0.004202 | -0.001876 | -0.071672 | -0.023600 | 0.008761 | -0.000073 | -0.017542 |
Cabin_A | 0.125177 | 0.020094 | -0.030707 | -0.002831 | 0.047561 | -0.039808 | 0.022287 | 0.094914 | -0.042105 | -0.056984 | … | -0.024952 | -0.023510 | -0.016695 | -0.008096 | -0.003615 | -0.242399 | -0.042967 | 0.045227 | -0.029546 | -0.033799 |
Cabin_B | 0.113458 | 0.393743 | 0.073051 | 0.015895 | -0.094453 | -0.011569 | 0.175095 | 0.161595 | -0.073613 | -0.095790 | … | -0.043624 | -0.041103 | -0.029188 | -0.014154 | -0.006320 | -0.423794 | 0.032318 | -0.087912 | 0.084268 | 0.013470 |
Cabin_C | 0.167993 | 0.401370 | 0.009601 | 0.006092 | -0.077473 | 0.048616 | 0.114652 | 0.158043 | -0.059151 | -0.101861 | … | -0.053083 | -0.050016 | -0.035516 | -0.017224 | -0.007691 | -0.515684 | 0.037226 | -0.137498 | 0.141925 | 0.001362 |
Cabin_D | 0.132886 | 0.072737 | -0.027385 | 0.000549 | -0.057396 | -0.015727 | 0.150716 | 0.107782 | -0.061459 | -0.056023 | … | 1.000000 | -0.034317 | -0.024369 | -0.011817 | -0.005277 | -0.353822 | -0.025313 | -0.074310 | 0.102432 | -0.049336 |
Cabin_E | 0.106600 | 0.073949 | 0.001084 | -0.008136 | -0.040340 | -0.027180 | 0.145321 | 0.027566 | -0.042877 | 0.002960 | … | -0.034317 | 1.000000 | -0.022961 | -0.011135 | -0.004972 | -0.333381 | -0.017285 | -0.042535 | 0.068007 | -0.046485 |
Cabin_F | -0.072644 | -0.037567 | 0.020481 | 0.000306 | -0.006655 | -0.008619 | 0.057935 | -0.020010 | -0.020282 | 0.030575 | … | -0.024369 | -0.022961 | 1.000000 | -0.007907 | -0.003531 | -0.236733 | 0.005525 | 0.004055 | 0.012756 | -0.033009 |
Cabin_G | -0.085977 | -0.022857 | 0.058325 | -0.045949 | -0.083285 | 0.006015 | 0.016040 | -0.031566 | -0.019941 | 0.040560 | … | -0.011817 | -0.011135 | -0.007907 | 1.000000 | -0.001712 | -0.114803 | 0.035835 | -0.076397 | 0.087471 | -0.016008 |
Cabin_T | 0.032461 | 0.001179 | -0.012304 | -0.023049 | 0.020558 | -0.013247 | -0.026456 | -0.014095 | -0.008904 | 0.018111 | … | -0.005277 | -0.004972 | -0.003531 | -0.001712 | 1.000000 | -0.051263 | -0.015438 | 0.022411 | -0.019574 | -0.007148 |
Cabin_U | -0.271918 | -0.507197 | -0.036806 | 0.000208 | 0.137396 | 0.009064 | -0.316912 | -0.258257 | 0.142369 | 0.137351 | … | -0.353822 | -0.333381 | -0.236733 | -0.114803 | -0.051263 | 1.000000 | -0.014155 | 0.175812 | -0.211367 | 0.056438 |
FamilySize | -0.196996 | 0.226465 | 0.792296 | -0.031437 | -0.188583 | 0.861952 | 0.016639 | -0.036553 | -0.087190 | 0.087771 | … | -0.025313 | -0.017285 | 0.005525 | 0.035835 | -0.015438 | -0.014155 | 1.000000 | -0.688864 | 0.302640 | 0.801623 |
Family_Single | 0.116675 | -0.274826 | -0.549022 | 0.028546 | 0.284537 | -0.591077 | -0.203367 | -0.107874 | 0.127214 | 0.014246 | … | -0.074310 | -0.042535 | 0.004055 | -0.076397 | 0.022411 | 0.175812 | -0.688864 | 1.000000 | -0.873398 | -0.318944 |
Family_Small | -0.038189 | 0.197281 | 0.248532 | 0.002975 | -0.255196 | 0.253590 | 0.279855 | 0.159594 | -0.122491 | -0.062909 | … | 0.102432 | 0.068007 | 0.012756 | 0.087471 | -0.019574 | -0.211367 | 0.302640 | -0.873398 | 1.000000 | -0.183007 |
Family_Large | -0.161210 | 0.170853 | 0.624627 | -0.063415 | -0.077748 | 0.699681 | -0.125147 | -0.092825 | -0.018423 | 0.093671 | … | -0.049336 | -0.046485 | -0.033009 | -0.016008 | -0.007148 | 0.056438 | 0.801623 | -0.318944 | -0.183007 | 1.000000 |
32 rows × 32 columns
#主要看与生存情况(Survived)的相关系数,ascending=False表示降序
corrDf['Survived'].sort_values(ascending=False)
Survived 1.000000
Mrs 0.344935
Miss 0.332795
Pclass_1 0.285904
Family_Small 0.279855
Fare 0.257307
Cabin_B 0.175095
Embarked_C 0.168240
Cabin_D 0.150716
Cabin_E 0.145321
Cabin_C 0.114652
Pclass_2 0.093349
Master 0.085221
Parch 0.081629
Cabin_F 0.057935
Royalty 0.033391
Cabin_A 0.022287
FamilySize 0.016639
Cabin_G 0.016040
Embarked_Q 0.003650
PassengerId -0.005007
Cabin_T -0.026456
Officer -0.031316
SibSp -0.035322
Age -0.070323
Family_Large -0.125147
Embarked_S -0.149683
Family_Single -0.203367
Cabin_U -0.316912
Pclass_3 -0.322308
Sex -0.543351
Mr -0.549199
Name: Survived, dtype: float64
可以看到头衔Mrs与生存情况存在强烈的正相关
在这里选择 头衔(titleDf)、船舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)作为模型输入
#特征选择
full_X=pd.concat([titleDf,
pclassDf,
familyDf,
full['Fare'],
cabinDf,
embarkedDf,
full['Sex']
],axis=1)
full_X.head()
Master | Miss | Mr | Mrs | Officer | Royalty | Pclass_1 | Pclass_2 | Pclass_3 | FamilySize | … | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | Embarked_C | Embarked_Q | Embarked_S | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | … | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
5 rows × 27 columns
#原始的数据集有891行
sourceRow=891
#原始数据集:特征
source_X=full_X.loc[0:sourceRow-1,:]
#原始数据集:标签
source_Y=full.loc[0:sourceRow-1,'Survived']
#预测数据集:特征
pred_X=full_X.loc[sourceRow:,:]
#确认选取的数据集
print('原始数据集有多少行:',source_X.shape[0])
print('预测数据集有多少行:',pred_X.shape[0])
原始数据集有多少行: 891
预测数据集有多少行: 418
#选择交叉验证
from sklearn.cross_validation import train_test_split
#建立模型用的训练数据集和测试数据集
train_X,test_X,train_Y,test_Y=train_test_split(source_X,source_Y,train_size=.8)
#输出数据集大小
print ('原始数据集特征:',source_X.shape,
'训练数据集特征:',train_X.shape ,
'测试数据集特征:',test_X.shape)
print ('原始数据集标签:',source_Y.shape,
'训练数据集标签:',train_Y.shape ,
'测试数据集标签:',test_Y.shape)
原始数据集特征: (891, 27) 训练数据集特征: (712, 27) 测试数据集特征: (179, 27)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)
#使用逻辑回归
#第一步:导入算法
from sklearn.linear_model import LogisticRegression
#第二步:创建模型:逻辑回归(logisic regression)
model=LogisticRegression()
#第三步:训练模型
model.fit(train_X,train_Y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
#通过score得到模型的准确率
model.score(test_X,test_Y)
0.8268156424581006
使用预测数据集得到预测结果,保存到csv文件中
#使用机器学习模型,对预测数据集中的生存情况进行预测
pred_Y=model.predict(pred_X)
'''
生成的预测值是浮点数
Kaggle要求提交的是整型
对数据类型转换
'''
pred_Y=pred_Y.astype(int)
#乘客Id
passenger_id=full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf=pd.DataFrame(
{'PassengerId':passenger_id,
'Survived':pred_Y})
predDf.shape
predDf.head()
#保存结果
predDf.to_csv(path+'/titanic_pred.csv',index=False)