目录
什么样的人在泰坦尼克号中更容易存活?
从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic
import numpy as np
import pandas as pd
import matplotlib as plt
#导入训练集
train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")
#导入测试集
test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")
print('训练集数据大小:',train.shape)
print('测试集数据大小:',test.shape)
训练集数据大小: (891, 12)
测试集数据大小: (418, 11)
#合并训练集和测试集,为数据处理做准备
full = train.append(test, ignore_index = True)
print('整体数据集大小:',full.shape)
整体数据集大小: (1309, 12)
#查看数据,了解各特征的表达含义:
'''
Age:年龄
Cabin:船舱号
Embarked:登船地点
Fare:船票价格
Name:乘客名字
Parch:不同代直系亲属数(父母,子女)
PassengerId:乘客编号
Pclass:舱位等级
Sex:性别
SibSp:同代直系亲属数(兄弟姐妹,配偶)
Survived:是否存活
Ticket:船票编码
'''
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
#查看具体统计信息
full.describe()
Age | Fare | Parch | PassengerId | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.881138 | 33.295479 | 0.385027 | 655.000000 | 2.294882 | 0.498854 | 0.383838 |
std | 14.413493 | 51.758668 | 0.865560 | 378.020061 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 21.000000 | 7.895800 | 0.000000 | 328.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 655.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 31.275000 | 0.000000 | 982.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 1309.000000 | 3.000000 | 8.000000 | 1.000000 |
#查看每一列的数据类型,和数据总数
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
所有数据总共有1309行。
其中的缺失数据有:
#年龄(age)
full['Age']=full['Age'].fillna(full['Age'].mean())
#船票价格(fare)
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
#登船港口:最频繁的值
full['Embarked'].describe()
count 1307
unique 3
top S
freq 914
Name: Embarked, dtype: object
full['Embarked']=full['Embarked'].fillna('S')
#船舱号:缺失较多,填充为unknown
full['Cabin']=full['Cabin'].fillna('U')
#查看缺失值填补后的信息
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 1309 non-null object
Embarked 1309 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
将12个因素通过其数据类型分为3类:
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 1309 non-null object
Embarked 1309 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
在乘客性别(Sex),登船港口(Embarked),客舱等级(Pclass)中,找出每个类别的分类标签进行分割,用0和1表示。
sex_mapDict = {
'male':1,'female':0}
#map:对series每个数据应用自定义的函数计算
full['Sex'] = full['Sex'].map(sex_mapDict)
embarkedDf = pd.DataFrame()
#get_dummies进行one_hot编码
embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
full = pd.concat([full,embarkedDf],axis=1)
full.drop('Embarked',axis=1,inplace=True)
pcalssDf = pd.DataFrame()
pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')
pcalssDf.head()
Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
full = pd.concat([full,pcalssDf],axis=1)
full.drop('Pclass',axis=1,inplace=True)
full['Name'].head()
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
#提取出头衔
def get_title(name):
str1 = name.split(',')[1]
str2 = str1.split('.')[0]
str3 = str2.strip()
#strip()用于移除字符串头尾指定字符,这里是移除头尾空格
return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(get_title)
titleDf.groupby('Title').count()
Title |
---|
Capt |
Col |
Don |
Dona |
Dr |
Jonkheer |
Lady |
Major |
Master |
Miss |
Mlle |
Mme |
Mr |
Mrs |
Ms |
Rev |
Sir |
the Countess |
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
'''
#姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
Master | Miss | Mr | Mrs | Officer | Royalty | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
full = pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)
full['Cabin'].head()
0 U
1 C85
2 U
3 C123
4 U
Name: Cabin, dtype: object
#客舱号的首字母是客舱的类别
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c : c[0])
full['Cabin'].head()
0 U
1 C
2 U
3 C
4 U
Name: Cabin, dtype: object
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )
cabinDf.head()
Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
full = pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace= True)
full.head()
Age | Fare | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | Embarked_C | ... | Royalty | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 3 | 1 | 1 | 0.0 | A/5 21171 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 38.0 | 71.2833 | 0 | 2 | 1 | 0 | 1 | 1.0 | PC 17599 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 26.0 | 7.9250 | 0 | 3 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 35.0 | 53.1000 | 0 | 4 | 1 | 0 | 1 | 1.0 | 113803 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 35.0 | 8.0500 | 0 | 5 | 3 | 1 | 0 | 0.0 | 373450 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
#存放家庭信息
familyDf = pd.DataFrame()
'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
(因为乘客自己也是家庭成员的一个,所以这里加1)
'''
familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1
familyDf['family_size'].describe()
count 1309.000000
mean 1.883881
std 1.583639
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 11.000000
Name: family_size, dtype: float64
%matplotlib notebook
familyDf['family_size'].plot()
'''
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''
familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)
familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)
familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)
familyDf.head()
family_size | family_single | family_small | family_large | |
---|---|---|---|---|
0 | 2 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 |
3 | 2 | 0 | 1 | 0 |
4 | 1 | 1 | 0 | 0 |
full = pd.concat([full,familyDf],axis=1)
full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)
full.head()
Age | Fare | PassengerId | Pclass | Sex | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | ... | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 1 | 3 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 38.0 | 71.2833 | 2 | 1 | 0 | 1.0 | PC 17599 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 26.0 | 7.9250 | 3 | 3 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
3 | 35.0 | 53.1000 | 4 | 1 | 0 | 1.0 | 113803 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 35.0 | 8.0500 | 5 | 3 | 1 | 0.0 | 373450 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5 rows × 28 columns
年龄和费用的数值范围相较于别的类别的数值范围(0,1)相差太大,遂对其进行scaling,使他们的取值范围落在[-1,1]上
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
len(full['Age'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
1309
age_scale_param = scaler.fit(full['Age'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
full.head()
Age | Fare | PassengerId | Pclass | Sex | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | Age_scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 1 | 3 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | -0.611972 |
1 | 38.0 | 71.2833 | 2 | 1 | 0 | 1.0 | PC 17599 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.630431 |
2 | 26.0 | 7.9250 | 3 | 3 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | -0.301371 |
3 | 35.0 | 53.1000 | 4 | 1 | 0 | 1.0 | 113803 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.397481 |
4 | 35.0 | 8.0500 | 5 | 3 | 1 | 0.0 | 373450 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0.397481 |
5 rows × 29 columns
full.drop([ 'Age'],axis=1, inplace=True)
full.head()
Fare | PassengerId | Pclass | Sex | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Master | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | Age_scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.2500 | 1 | 3 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | -0.611972 |
1 | 71.2833 | 2 | 1 | 0 | 1.0 | PC 17599 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.630431 |
2 | 7.9250 | 3 | 3 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | -0.301371 |
3 | 53.1000 | 4 | 1 | 0 | 1.0 | 113803 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.397481 |
4 | 8.0500 | 5 | 3 | 1 | 0.0 | 373450 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0.397481 |
5 rows × 28 columns
fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
"""Entry point for launching an IPython kernel.
full.head()
Fare | PassengerId | Pclass | Sex | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Master | ... | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | Age_scaled | Fare_scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.2500 | 1 | 3 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | -0.611972 | -0.503595 |
1 | 71.2833 | 2 | 1 | 0 | 1.0 | PC 17599 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.630431 | 0.734503 |
2 | 7.9250 | 3 | 3 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | -0.301371 | -0.490544 |
3 | 53.1000 | 4 | 1 | 0 | 1.0 | 113803 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.397481 | 0.382925 |
4 | 8.0500 | 5 | 3 | 1 | 0.0 | 373450 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0.397481 | -0.488127 |
5 rows × 29 columns
full.drop([ 'Fare'],axis=1, inplace=True)
full.head()
PassengerId | Pclass | Sex | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Master | Miss | ... | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | Age_scaled | Fare_scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | -0.611972 | -0.503595 |
1 | 2 | 1 | 0 | 1.0 | PC 17599 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.630431 | 0.734503 |
2 | 3 | 3 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | -0.301371 | -0.490544 |
3 | 4 | 1 | 0 | 1.0 | 113803 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.397481 | 0.382925 |
4 | 5 | 3 | 1 | 0.0 | 373450 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0.397481 | -0.488127 |
5 rows × 28 columns
#处理完毕后的数据特征信息
full.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
Embarked_C 1309 non-null uint8
Embarked_Q 1309 non-null uint8
Embarked_S 1309 non-null uint8
Master 1309 non-null uint8
Miss 1309 non-null uint8
Mr 1309 non-null uint8
Mrs 1309 non-null uint8
Officer 1309 non-null uint8
Royalty 1309 non-null uint8
Cabin_A 1309 non-null uint8
Cabin_B 1309 non-null uint8
Cabin_C 1309 non-null uint8
Cabin_D 1309 non-null uint8
Cabin_E 1309 non-null uint8
Cabin_F 1309 non-null uint8
Cabin_G 1309 non-null uint8
Cabin_T 1309 non-null uint8
Cabin_U 1309 non-null uint8
family_single 1309 non-null int64
family_small 1309 non-null int64
family_large 1309 non-null int64
Age_scaled 1309 non-null float64
Fare_scaled 1309 non-null float64
dtypes: float64(3), int64(6), object(1), uint8(18)
memory usage: 125.4+ KB
通过计算各个特征与survived之间的相关系数,选择和生存率有关的特征
#特征选择
corrDf = full.corr()
corrDf
PassengerId | Pclass | Sex | Survived | Embarked_C | Embarked_Q | Embarked_S | Master | Miss | Mr | ... | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | family_single | family_small | family_large | Age_scaled | Fare_scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 1.000000 | -0.038354 | 0.013406 | -0.005007 | 0.048101 | 0.011585 | -0.049836 | 0.002254 | -0.050027 | 0.014116 | ... | -0.008136 | 0.000306 | -0.045949 | -0.023049 | 0.000208 | 0.028546 | 0.002975 | -0.063415 | 0.025731 | 0.031416 |
Pclass | -0.038354 | 1.000000 | 0.124617 | -0.338481 | -0.269658 | 0.230491 | 0.091320 | 0.095257 | 0.024487 | 0.121492 | ... | -0.225649 | 0.013122 | 0.052133 | -0.042750 | 0.713857 | 0.147393 | -0.218303 | 0.127306 | -0.366371 | -0.558477 |
Sex | 0.013406 | 0.124617 | 1.000000 | -0.543351 | -0.066564 | -0.088651 | 0.115193 | 0.164375 | -0.672819 | 0.870678 | ... | -0.040340 | -0.006655 | -0.083285 | 0.020558 | 0.137396 | 0.284537 | -0.255196 | -0.077748 | 0.057397 | -0.185484 |
Survived | -0.005007 | -0.338481 | -0.543351 | 1.000000 | 0.168240 | 0.003650 | -0.149683 | 0.085221 | 0.332795 | -0.549199 | ... | 0.145321 | 0.057935 | 0.016040 | -0.026456 | -0.316912 | -0.203367 | 0.279855 | -0.125147 | -0.070323 | 0.257307 |
Embarked_C | 0.048101 | -0.269658 | -0.066564 | 0.168240 | 1.000000 | -0.164166 | -0.778262 | -0.014172 | -0.014351 | -0.065538 | ... | 0.027566 | -0.020010 | -0.031566 | -0.014095 | -0.258257 | -0.107874 | 0.159594 | -0.092825 | 0.076179 | 0.286241 |
Embarked_Q | 0.011585 | 0.230491 | -0.088651 | 0.003650 | -0.164166 | 1.000000 | -0.491656 | -0.009091 | 0.198804 | -0.080224 | ... | -0.042877 | -0.020282 | -0.019941 | -0.008904 | 0.142369 | 0.127214 | -0.122491 | -0.018423 | -0.012718 | -0.130054 |
Embarked_S | -0.049836 | 0.091320 | 0.115193 | -0.149683 | -0.778262 | -0.491656 | 1.000000 | 0.018297 | -0.113886 | 0.108924 | ... | 0.002960 | 0.030575 | 0.040560 | 0.018111 | 0.137351 | 0.014246 | -0.062909 | 0.093671 | -0.059153 | -0.169894 |
Master | 0.002254 | 0.095257 | 0.164375 | 0.085221 | -0.014172 | -0.009091 | 0.018297 | 1.000000 | -0.110595 | -0.258902 | ... | 0.001860 | 0.058311 | -0.013690 | -0.006113 | 0.041178 | -0.265355 | 0.120166 | 0.301809 | -0.363923 | 0.011596 |
Miss | -0.050027 | 0.024487 | -0.672819 | 0.332795 | -0.014351 | 0.198804 | -0.113886 | -0.110595 | 1.000000 | -0.585809 | ... | 0.008700 | -0.003088 | 0.061881 | -0.013832 | -0.004364 | -0.023890 | -0.018085 | 0.083422 | -0.254146 | 0.092051 |
Mr | 0.014116 | 0.121492 | 0.870678 | -0.549199 | -0.065538 | -0.080224 | 0.108924 | -0.258902 | -0.585809 | 1.000000 | ... | -0.032953 | -0.026403 | -0.072514 | 0.023611 | 0.131807 | 0.386262 | -0.300872 | -0.194207 | 0.165476 | -0.192192 |
Mrs | 0.033299 | -0.179945 | -0.571176 | 0.344935 | 0.098379 | -0.100374 | -0.022950 | -0.093887 | -0.212435 | -0.497310 | ... | 0.045538 | 0.013376 | 0.042547 | -0.011742 | -0.162253 | -0.354649 | 0.361247 | 0.012893 | 0.198091 | 0.139235 |
Officer | 0.002231 | -0.137341 | 0.087288 | -0.031316 | 0.003678 | -0.003212 | -0.001202 | -0.029567 | -0.066899 | -0.156611 | ... | -0.024048 | -0.017076 | -0.008281 | -0.003698 | -0.067030 | 0.013303 | 0.003966 | -0.034572 | 0.162818 | 0.028696 |
Royalty | 0.004400 | -0.104916 | -0.020408 | 0.033391 | 0.077213 | -0.021853 | -0.054250 | -0.015002 | -0.033945 | -0.079466 | ... | -0.012202 | -0.008665 | -0.004202 | -0.001876 | -0.071672 | 0.008761 | -0.000073 | -0.017542 | 0.059466 | 0.026214 |
Cabin_A | -0.002831 | -0.202143 | 0.047561 | 0.022287 | 0.094914 | -0.042105 | -0.056984 | -0.000711 | -0.035697 | 0.015372 | ... | -0.023510 | -0.016695 | -0.008096 | -0.003615 | -0.242399 | 0.045227 | -0.029546 | -0.033799 | 0.125177 | 0.020094 |
Cabin_B | 0.015895 | -0.353414 | -0.094453 | 0.175095 | 0.161595 | -0.073613 | -0.095790 | -0.017168 | 0.035069 | -0.096776 | ... | -0.041103 | -0.029188 | -0.014154 | -0.006320 | -0.423794 | -0.087912 | 0.084268 | 0.013470 | 0.113458 | 0.393743 |
Cabin_C | 0.006092 | -0.430044 | -0.077473 | 0.114652 | 0.158043 | -0.059151 | -0.101861 | -0.047456 | -0.013418 | -0.068072 | ... | -0.050016 | -0.035516 | -0.017224 | -0.007691 | -0.515684 | -0.137498 | 0.141925 | 0.001362 | 0.167993 | 0.401370 |
Cabin_D | 0.000549 | -0.265341 | -0.057396 | 0.150716 | 0.107782 | -0.061459 | -0.056023 | -0.042192 | -0.012516 | -0.030261 | ... | -0.034317 | -0.024369 | -0.011817 | -0.005277 | -0.353822 | -0.074310 | 0.102432 | -0.049336 | 0.132886 | 0.072737 |
Cabin_E | -0.008136 | -0.225649 | -0.040340 | 0.145321 | 0.027566 | -0.042877 | 0.002960 | 0.001860 | 0.008700 | -0.032953 | ... | 1.000000 | -0.022961 | -0.011135 | -0.004972 | -0.333381 | -0.042535 | 0.068007 | -0.046485 | 0.106600 | 0.073949 |
Cabin_F | 0.000306 | 0.013122 | -0.006655 | 0.057935 | -0.020010 | -0.020282 | 0.030575 | 0.058311 | -0.003088 | -0.026403 | ... | -0.022961 | 1.000000 | -0.007907 | -0.003531 | -0.236733 | 0.004055 | 0.012756 | -0.033009 | -0.072644 | -0.037567 |
Cabin_G | -0.045949 | 0.052133 | -0.083285 | 0.016040 | -0.031566 | -0.019941 | 0.040560 | -0.013690 | 0.061881 | -0.072514 | ... | -0.011135 | -0.007907 | 1.000000 | -0.001712 | -0.114803 | -0.076397 | 0.087471 | -0.016008 | -0.085977 | -0.022857 |
Cabin_T | -0.023049 | -0.042750 | 0.020558 | -0.026456 | -0.014095 | -0.008904 | 0.018111 | -0.006113 | -0.013832 | 0.023611 | ... | -0.004972 | -0.003531 | -0.001712 | 1.000000 | -0.051263 | 0.022411 | -0.019574 | -0.007148 | 0.032461 | 0.001179 |
Cabin_U | 0.000208 | 0.713857 | 0.137396 | -0.316912 | -0.258257 | 0.142369 | 0.137351 | 0.041178 | -0.004364 | 0.131807 | ... | -0.333381 | -0.236733 | -0.114803 | -0.051263 | 1.000000 | 0.175812 | -0.211367 | 0.056438 | -0.271918 | -0.507197 |
family_single | 0.028546 | 0.147393 | 0.284537 | -0.203367 | -0.107874 | 0.127214 | 0.014246 | -0.265355 | -0.023890 | 0.386262 | ... | -0.042535 | 0.004055 | -0.076397 | 0.022411 | 0.175812 | 1.000000 | -0.873398 | -0.318944 | 0.116675 | -0.274826 |
family_small | 0.002975 | -0.218303 | -0.255196 | 0.279855 | 0.159594 | -0.122491 | -0.062909 | 0.120166 | -0.018085 | -0.300872 | ... | 0.068007 | 0.012756 | 0.087471 | -0.019574 | -0.211367 | -0.873398 | 1.000000 | -0.183007 | -0.038189 | 0.197281 |
family_large | -0.063415 | 0.127306 | -0.077748 | -0.125147 | -0.092825 | -0.018423 | 0.093671 | 0.301809 | 0.083422 | -0.194207 | ... | -0.046485 | -0.033009 | -0.016008 | -0.007148 | 0.056438 | -0.318944 | -0.183007 | 1.000000 | -0.161210 | 0.170853 |
Age_scaled | 0.025731 | -0.366371 | 0.057397 | -0.070323 | 0.076179 | -0.012718 | -0.059153 | -0.363923 | -0.254146 | 0.165476 | ... | 0.106600 | -0.072644 | -0.085977 | 0.032461 | -0.271918 | 0.116675 | -0.038189 | -0.161210 | 1.000000 | 0.171521 |
Fare_scaled | 0.031416 | -0.558477 | -0.185484 | 0.257307 | 0.286241 | -0.130054 | -0.169894 | 0.011596 | 0.092051 | -0.192192 | ... | 0.073949 | -0.037567 | -0.022857 | 0.001179 | -0.507197 | -0.274826 | 0.197281 | 0.170853 | 0.171521 | 1.000000 |
27 rows × 27 columns
'''
查看各个特征与生成情况(Survived)的相关系数,
ascending=False表示按降序排列
'''
corrDf['Survived'].sort_values(ascending =False)
Survived 1.000000
Mrs 0.344935
Miss 0.332795
family_small 0.279855
Fare_scaled 0.257307
Cabin_B 0.175095
Embarked_C 0.168240
Cabin_D 0.150716
Cabin_E 0.145321
Cabin_C 0.114652
Master 0.085221
Cabin_F 0.057935
Royalty 0.033391
Cabin_A 0.022287
Cabin_G 0.016040
Embarked_Q 0.003650
PassengerId -0.005007
Cabin_T -0.026456
Officer -0.031316
Age_scaled -0.070323
family_large -0.125147
Embarked_S -0.149683
family_single -0.203367
Cabin_U -0.316912
Pclass -0.338481
Sex -0.543351
Mr -0.549199
Name: Survived, dtype: float64
full_x = pd.concat([ titleDf,#头衔
pcalssDf,#客舱等级
familyDf,#家庭大小
full['Fare_scaled'],#船票价格
full['Age_scaled'],
cabinDf,#船舱号
embarkedDf,#登船港口
full['Sex']#性别
],axis=1)
full_x.head()
Master | Miss | Mr | Mrs | Officer | Royalty | Pclass_1 | Pclass_2 | Pclass_3 | family_size | ... | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | Embarked_C | Embarked_Q | Embarked_S | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
5 rows × 28 columns
用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型
sourceRow = 891
source_x = full_x.loc[0:sourceRow-1, :]
source_y = full.loc[0:sourceRow-1,'Survived']
pred_x = full_x.loc[sourceRow:,:]
print('训练集数据大小:',source_x.shape)
训练集数据大小: (891, 28)
print('测试集数据大小:',pred_x.shape)
测试集数据大小: (418, 28)
'''
从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
'''
from sklearn.cross_validation import train_test_split
#建立模型用的训练数据sour集和测试数据集
train_x, test_x, train_y, test_y = train_test_split(source_x ,
source_y,
train_size=.8)
#输出数据集大小
print ('原始数据集特征:',source_x.shape,
'训练数据集特征:',train_x.shape ,
'测试数据集特征:',test_x.shape)
print ('原始数据集标签:',source_y.shape,
'训练数据集标签:',train_y.shape ,
'测试数据集标签:',test_y.shape)
原始数据集特征: (891, 28) 训练数据集特征: (712, 28) 测试数据集特征: (179, 28)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)
#逻辑回归
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(train_x, train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
model.score(test_x , test_y )
0.82681564245810057
#得上预测结果上传到kaggle
pred_y = model.predict(pred_x)
'''
生成的预测值是浮点数(0.0,1,0)
但是Kaggle要求提交的结果是整型(0,1)
所以要对数据类型进行转换
'''
pred_y=pred_y.astype(int)
#乘客id
passenger_id = full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf = pd.DataFrame(
{
'PassengerId': passenger_id , 'Survived': pred_y } )
predDf.shape
(418, 2)
predDf.head()
PassengerId | Survived | |
---|---|---|
891 | 892 | 0 |
892 | 893 | 1 |
893 | 894 | 0 |
894 | 895 | 0 |
895 | 896 | 1 |
predDf.to_csv( '/Users/qxh/Desktop/titanic/titanic_pred.csv' , index = False )