titanic_kaggle

利用逻辑回归预测泰坦尼克号生存率

目录

  1. 提出问题
  2. 理解数据
  • 采集数据
  • 导入数据
  • 查看数据集信息
  1. 数据清洗
  • 数据预处理
  • 特征工程
  1. 构建模型
  2. 模型评估
  3. 方案实施
  • 提交结果到Kaggle

1.提出问题

什么样的人在泰坦尼克号中更容易存活?

2.理解数据

2.1 采集数据

从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic

2.2导入数据

import numpy as np
import pandas as pd
import matplotlib as plt
#导入训练集
train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")
#导入测试集
test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")
print('训练集数据大小:',train.shape)
print('测试集数据大小:',test.shape)
训练集数据大小: (891, 12)
测试集数据大小: (418, 11)
#合并训练集和测试集,为数据处理做准备
full = train.append(test, ignore_index = True)
print('整体数据集大小:',full.shape)
整体数据集大小: (1309, 12)

2.3 查看数据集信息

#查看数据,了解各特征的表达含义:
'''
Age:年龄
Cabin:船舱号
Embarked:登船地点
Fare:船票价格
Name:乘客名字
Parch:不同代直系亲属数(父母,子女)
PassengerId:乘客编号
Pclass:舱位等级
Sex:性别
SibSp:同代直系亲属数(兄弟姐妹,配偶)
Survived:是否存活
Ticket:船票编码
'''
full.head()
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
#查看具体统计信息
full.describe()
Age Fare Parch PassengerId Pclass SibSp Survived
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 1309.000000 891.000000
mean 29.881138 33.295479 0.385027 655.000000 2.294882 0.498854 0.383838
std 14.413493 51.758668 0.865560 378.020061 0.837836 1.041658 0.486592
min 0.170000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 328.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 655.000000 3.000000 0.000000 0.000000
75% 39.000000 31.275000 0.000000 982.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 1309.000000 3.000000 8.000000 1.000000
#查看每一列的数据类型,和数据总数
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3. 数据清洗

3.1数据预处理

缺失值处理

所有数据总共有1309行。
其中的缺失数据有:

  • 年龄(Age)里面数据总数是1046条,缺失了263条数据,用平均值填补。
  • 船票价格(Fare)里面数据总数是1308条,缺失了1条数据,用平均值填补。
  • 登船港口(Embarked)里面数据总数是1308条,缺失了2条数据,用出现最频繁的值填补。
  • 船舱号(Cabin)里面数据总数是295条,缺失了1014条数据,缺失较多,增添新标记unknown进行填补。
#年龄(age)
full['Age']=full['Age'].fillna(full['Age'].mean())
#船票价格(fare)
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
#登船港口:最频繁的值
full['Embarked'].describe()
count     1307
unique       3
top          S
freq       914
Name: Embarked, dtype: object
full['Embarked']=full['Embarked'].fillna('S')
#船舱号:缺失较多,填充为unknown
full['Cabin']=full['Cabin'].fillna('U')
#查看缺失值填补后的信息
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2 特征工程

将12个因素通过其数据类型分为3类:

  1. 数值类型:
  • 乘客编号(PassengerId)
  • 年龄(Age)
  • 船票价格(Fare)
  • 同代直系亲属人数(SibSp)
  • 不同代直系亲属人数(Parch)
  1. 时间序列:无
  2. 分类数据(直接分类)
  • 乘客性别(Sex):男性male,女性female
  • 登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
  • 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
  1. 分类数据(字符串类型):可能从这里面提取出特征来
  • 乘客姓名(Name)
  • 客舱号(Cabin)
  • 船票编号(Ticket)
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2.1 分类数据(直接分类)

在乘客性别(Sex),登船港口(Embarked),客舱等级(Pclass)中,找出每个类别的分类标签进行分割,用0和1表示。

  • 性别(Sex)
sex_mapDict = {
     'male':1,'female':0}
#map:对series每个数据应用自定义的函数计算
full['Sex'] = full['Sex'].map(sex_mapDict)
  • 登陆港口(Embarked)
embarkedDf = pd.DataFrame()
#get_dummies进行one_hot编码
embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
full = pd.concat([full,embarkedDf],axis=1)
full.drop('Embarked',axis=1,inplace=True)
  • 客舱等级(Pclass)
pcalssDf = pd.DataFrame()
pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')
pcalssDf.head()
Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
full = pd.concat([full,pcalssDf],axis=1)
full.drop('Pclass',axis=1,inplace=True)

3.2.2分类数据(字符串类型)

  • 从名字中提取头衔
full['Name'].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
#提取出头衔
def get_title(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = str2.strip()
    #strip()用于移除字符串头尾指定字符,这里是移除头尾空格
    return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(get_title) 
titleDf.groupby('Title').count()
Title
Capt
Col
Don
Dona
Dr
Jonkheer
Lady
Major
Master
Miss
Mlle
Mme
Mr
Mrs
Ms
Rev
Sir
the Countess
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
'''
#姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
     
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }

titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
Master Miss Mr Mrs Officer Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
full = pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)
  • 客舱号
full['Cabin'].head()
0       U
1     C85
2       U
3    C123
4       U
Name: Cabin, dtype: object
#客舱号的首字母是客舱的类别
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c : c[0])
full['Cabin'].head()
0    U
1    C
2    U
3    C
4    U
Name: Cabin, dtype: object
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )
cabinDf.head()
Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1
full = pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace= True)
full.head()
Age Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Embarked_C ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 22.0 7.2500 0 1 3 1 1 0.0 A/5 21171 0 ... 0 0 0 0 0 0 0 0 0 1
1 38.0 71.2833 0 2 1 0 1 1.0 PC 17599 1 ... 0 0 0 1 0 0 0 0 0 0
2 26.0 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 0 ... 0 0 0 0 0 0 0 0 0 1
3 35.0 53.1000 0 4 1 0 1 1.0 113803 0 ... 0 0 0 1 0 0 0 0 0 0
4 35.0 8.0500 0 5 3 1 0 0.0 373450 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 27 columns

3.2.3 数据类型

  • 家庭人员和家庭类别
#存放家庭信息
familyDf = pd.DataFrame()

'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
(因为乘客自己也是家庭成员的一个,所以这里加1)
'''
familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

familyDf['family_size'].describe()
count    1309.000000
mean        1.883881
std         1.583639
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: family_size, dtype: float64
%matplotlib notebook
familyDf['family_size'].plot()


'''
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''

familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)
familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)
familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)
familyDf.head()
family_size family_single family_small family_large
0 2 0 1 0
1 2 0 1 0
2 1 1 0 0
3 2 0 1 0
4 1 1 0 0
full = pd.concat([full,familyDf],axis=1)
full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)
full.head()
Age Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S ... Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large
0 22.0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 ... 0 0 0 0 0 0 1 0 1 0
1 38.0 71.2833 2 1 0 1.0 PC 17599 1 0 0 ... 1 0 0 0 0 0 0 0 1 0
2 26.0 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 ... 0 0 0 0 0 0 1 1 0 0
3 35.0 53.1000 4 1 0 1.0 113803 0 0 1 ... 1 0 0 0 0 0 0 0 1 0
4 35.0 8.0500 5 3 1 0.0 373450 0 0 1 ... 0 0 0 0 0 0 1 1 0 0

5 rows × 28 columns

  • 年龄(Age)和船票费用(Fare)

年龄和费用的数值范围相较于别的类别的数值范围(0,1)相差太大,遂对其进行scaling,使他们的取值范围落在[-1,1]上

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
len(full['Age'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.





1309
age_scale_param = scaler.fit(full['Age'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full.head()
Age Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled
0 22.0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 ... 0 0 0 0 0 1 0 1 0 -0.611972
1 38.0 71.2833 2 1 0 1.0 PC 17599 1 0 0 ... 0 0 0 0 0 0 0 1 0 0.630431
2 26.0 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 ... 0 0 0 0 0 1 1 0 0 -0.301371
3 35.0 53.1000 4 1 0 1.0 113803 0 0 1 ... 0 0 0 0 0 0 0 1 0 0.397481
4 35.0 8.0500 5 3 1 0.0 373450 0 0 1 ... 0 0 0 0 0 1 1 0 0 0.397481

5 rows × 29 columns

full.drop([ 'Age'],axis=1, inplace=True)
full.head()
Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled
0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 0 ... 0 0 0 0 0 1 0 1 0 -0.611972
1 71.2833 2 1 0 1.0 PC 17599 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0.630431
2 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 -0.301371
3 53.1000 4 1 0 1.0 113803 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0.397481
4 8.0500 5 3 1 0.0 373450 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 0.397481

5 rows × 28 columns

fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full.head()
Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 0 ... 0 0 0 0 1 0 1 0 -0.611972 -0.503595
1 71.2833 2 1 0 1.0 PC 17599 1 0 0 0 ... 0 0 0 0 0 0 1 0 0.630431 0.734503
2 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 ... 0 0 0 0 1 1 0 0 -0.301371 -0.490544
3 53.1000 4 1 0 1.0 113803 0 0 1 0 ... 0 0 0 0 0 0 1 0 0.397481 0.382925
4 8.0500 5 3 1 0.0 373450 0 0 1 0 ... 0 0 0 0 1 1 0 0 0.397481 -0.488127

5 rows × 29 columns

full.drop([ 'Fare'],axis=1, inplace=True)
full.head()
PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master Miss ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
0 1 3 1 0.0 A/5 21171 0 0 1 0 0 ... 0 0 0 0 1 0 1 0 -0.611972 -0.503595
1 2 1 0 1.0 PC 17599 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0.630431 0.734503
2 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 1 ... 0 0 0 0 1 1 0 0 -0.301371 -0.490544
3 4 1 0 1.0 113803 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0.397481 0.382925
4 5 3 1 0.0 373450 0 0 1 0 0 ... 0 0 0 0 1 1 0 0 0.397481 -0.488127

5 rows × 28 columns

#处理完毕后的数据特征信息
full.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
PassengerId      1309 non-null int64
Pclass           1309 non-null int64
Sex              1309 non-null int64
Survived         891 non-null float64
Ticket           1309 non-null object
Embarked_C       1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
Master           1309 non-null uint8
Miss             1309 non-null uint8
Mr               1309 non-null uint8
Mrs              1309 non-null uint8
Officer          1309 non-null uint8
Royalty          1309 non-null uint8
Cabin_A          1309 non-null uint8
Cabin_B          1309 non-null uint8
Cabin_C          1309 non-null uint8
Cabin_D          1309 non-null uint8
Cabin_E          1309 non-null uint8
Cabin_F          1309 non-null uint8
Cabin_G          1309 non-null uint8
Cabin_T          1309 non-null uint8
Cabin_U          1309 non-null uint8
family_single    1309 non-null int64
family_small     1309 non-null int64
family_large     1309 non-null int64
Age_scaled       1309 non-null float64
Fare_scaled      1309 non-null float64
dtypes: float64(3), int64(6), object(1), uint8(18)
memory usage: 125.4+ KB

3.3特征选择

通过计算各个特征与survived之间的相关系数,选择和生存率有关的特征

#特征选择
corrDf = full.corr()
corrDf
PassengerId Pclass Sex Survived Embarked_C Embarked_Q Embarked_S Master Miss Mr ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
PassengerId 1.000000 -0.038354 0.013406 -0.005007 0.048101 0.011585 -0.049836 0.002254 -0.050027 0.014116 ... -0.008136 0.000306 -0.045949 -0.023049 0.000208 0.028546 0.002975 -0.063415 0.025731 0.031416
Pclass -0.038354 1.000000 0.124617 -0.338481 -0.269658 0.230491 0.091320 0.095257 0.024487 0.121492 ... -0.225649 0.013122 0.052133 -0.042750 0.713857 0.147393 -0.218303 0.127306 -0.366371 -0.558477
Sex 0.013406 0.124617 1.000000 -0.543351 -0.066564 -0.088651 0.115193 0.164375 -0.672819 0.870678 ... -0.040340 -0.006655 -0.083285 0.020558 0.137396 0.284537 -0.255196 -0.077748 0.057397 -0.185484
Survived -0.005007 -0.338481 -0.543351 1.000000 0.168240 0.003650 -0.149683 0.085221 0.332795 -0.549199 ... 0.145321 0.057935 0.016040 -0.026456 -0.316912 -0.203367 0.279855 -0.125147 -0.070323 0.257307
Embarked_C 0.048101 -0.269658 -0.066564 0.168240 1.000000 -0.164166 -0.778262 -0.014172 -0.014351 -0.065538 ... 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.107874 0.159594 -0.092825 0.076179 0.286241
Embarked_Q 0.011585 0.230491 -0.088651 0.003650 -0.164166 1.000000 -0.491656 -0.009091 0.198804 -0.080224 ... -0.042877 -0.020282 -0.019941 -0.008904 0.142369 0.127214 -0.122491 -0.018423 -0.012718 -0.130054
Embarked_S -0.049836 0.091320 0.115193 -0.149683 -0.778262 -0.491656 1.000000 0.018297 -0.113886 0.108924 ... 0.002960 0.030575 0.040560 0.018111 0.137351 0.014246 -0.062909 0.093671 -0.059153 -0.169894
Master 0.002254 0.095257 0.164375 0.085221 -0.014172 -0.009091 0.018297 1.000000 -0.110595 -0.258902 ... 0.001860 0.058311 -0.013690 -0.006113 0.041178 -0.265355 0.120166 0.301809 -0.363923 0.011596
Miss -0.050027 0.024487 -0.672819 0.332795 -0.014351 0.198804 -0.113886 -0.110595 1.000000 -0.585809 ... 0.008700 -0.003088 0.061881 -0.013832 -0.004364 -0.023890 -0.018085 0.083422 -0.254146 0.092051
Mr 0.014116 0.121492 0.870678 -0.549199 -0.065538 -0.080224 0.108924 -0.258902 -0.585809 1.000000 ... -0.032953 -0.026403 -0.072514 0.023611 0.131807 0.386262 -0.300872 -0.194207 0.165476 -0.192192
Mrs 0.033299 -0.179945 -0.571176 0.344935 0.098379 -0.100374 -0.022950 -0.093887 -0.212435 -0.497310 ... 0.045538 0.013376 0.042547 -0.011742 -0.162253 -0.354649 0.361247 0.012893 0.198091 0.139235
Officer 0.002231 -0.137341 0.087288 -0.031316 0.003678 -0.003212 -0.001202 -0.029567 -0.066899 -0.156611 ... -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 0.013303 0.003966 -0.034572 0.162818 0.028696
Royalty 0.004400 -0.104916 -0.020408 0.033391 0.077213 -0.021853 -0.054250 -0.015002 -0.033945 -0.079466 ... -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 0.008761 -0.000073 -0.017542 0.059466 0.026214
Cabin_A -0.002831 -0.202143 0.047561 0.022287 0.094914 -0.042105 -0.056984 -0.000711 -0.035697 0.015372 ... -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 0.045227 -0.029546 -0.033799 0.125177 0.020094
Cabin_B 0.015895 -0.353414 -0.094453 0.175095 0.161595 -0.073613 -0.095790 -0.017168 0.035069 -0.096776 ... -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 -0.087912 0.084268 0.013470 0.113458 0.393743
Cabin_C 0.006092 -0.430044 -0.077473 0.114652 0.158043 -0.059151 -0.101861 -0.047456 -0.013418 -0.068072 ... -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 -0.137498 0.141925 0.001362 0.167993 0.401370
Cabin_D 0.000549 -0.265341 -0.057396 0.150716 0.107782 -0.061459 -0.056023 -0.042192 -0.012516 -0.030261 ... -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.074310 0.102432 -0.049336 0.132886 0.072737
Cabin_E -0.008136 -0.225649 -0.040340 0.145321 0.027566 -0.042877 0.002960 0.001860 0.008700 -0.032953 ... 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.042535 0.068007 -0.046485 0.106600 0.073949
Cabin_F 0.000306 0.013122 -0.006655 0.057935 -0.020010 -0.020282 0.030575 0.058311 -0.003088 -0.026403 ... -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.004055 0.012756 -0.033009 -0.072644 -0.037567
Cabin_G -0.045949 0.052133 -0.083285 0.016040 -0.031566 -0.019941 0.040560 -0.013690 0.061881 -0.072514 ... -0.011135 -0.007907 1.000000 -0.001712 -0.114803 -0.076397 0.087471 -0.016008 -0.085977 -0.022857
Cabin_T -0.023049 -0.042750 0.020558 -0.026456 -0.014095 -0.008904 0.018111 -0.006113 -0.013832 0.023611 ... -0.004972 -0.003531 -0.001712 1.000000 -0.051263 0.022411 -0.019574 -0.007148 0.032461 0.001179
Cabin_U 0.000208 0.713857 0.137396 -0.316912 -0.258257 0.142369 0.137351 0.041178 -0.004364 0.131807 ... -0.333381 -0.236733 -0.114803 -0.051263 1.000000 0.175812 -0.211367 0.056438 -0.271918 -0.507197
family_single 0.028546 0.147393 0.284537 -0.203367 -0.107874 0.127214 0.014246 -0.265355 -0.023890 0.386262 ... -0.042535 0.004055 -0.076397 0.022411 0.175812 1.000000 -0.873398 -0.318944 0.116675 -0.274826
family_small 0.002975 -0.218303 -0.255196 0.279855 0.159594 -0.122491 -0.062909 0.120166 -0.018085 -0.300872 ... 0.068007 0.012756 0.087471 -0.019574 -0.211367 -0.873398 1.000000 -0.183007 -0.038189 0.197281
family_large -0.063415 0.127306 -0.077748 -0.125147 -0.092825 -0.018423 0.093671 0.301809 0.083422 -0.194207 ... -0.046485 -0.033009 -0.016008 -0.007148 0.056438 -0.318944 -0.183007 1.000000 -0.161210 0.170853
Age_scaled 0.025731 -0.366371 0.057397 -0.070323 0.076179 -0.012718 -0.059153 -0.363923 -0.254146 0.165476 ... 0.106600 -0.072644 -0.085977 0.032461 -0.271918 0.116675 -0.038189 -0.161210 1.000000 0.171521
Fare_scaled 0.031416 -0.558477 -0.185484 0.257307 0.286241 -0.130054 -0.169894 0.011596 0.092051 -0.192192 ... 0.073949 -0.037567 -0.022857 0.001179 -0.507197 -0.274826 0.197281 0.170853 0.171521 1.000000

27 rows × 27 columns

'''
查看各个特征与生成情况(Survived)的相关系数,
ascending=False表示按降序排列
'''
corrDf['Survived'].sort_values(ascending =False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
family_small     0.279855
Fare_scaled      0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Master           0.085221
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
Age_scaled      -0.070323
family_large    -0.125147
Embarked_S      -0.149683
family_single   -0.203367
Cabin_U         -0.316912
Pclass          -0.338481
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64
full_x = pd.concat([ titleDf,#头衔
                     pcalssDf,#客舱等级
                     familyDf,#家庭大小
                     full['Fare_scaled'],#船票价格
                     full['Age_scaled'],
                     cabinDf,#船舱号
                     embarkedDf,#登船港口
                     full['Sex']#性别
                    ],axis=1)
full_x.head()
Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 family_size ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Embarked_C Embarked_Q Embarked_S Sex
0 0 0 1 0 0 0 0 0 1 2 ... 0 0 0 0 0 1 0 0 1 1
1 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 1 0 0 0
2 0 1 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 1 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 1

5 rows × 28 columns

4.构建模型

用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型

4.1 建立训练数据集和测试数据集

sourceRow = 891

source_x = full_x.loc[0:sourceRow-1, :]
source_y = full.loc[0:sourceRow-1,'Survived']

pred_x = full_x.loc[sourceRow:,:]
print('训练集数据大小:',source_x.shape)
训练集数据大小: (891, 28)
print('测试集数据大小:',pred_x.shape)
测试集数据大小: (418, 28)
'''
从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
'''

from sklearn.cross_validation import train_test_split

#建立模型用的训练数据sour集和测试数据集
train_x, test_x, train_y, test_y = train_test_split(source_x ,
                                                    source_y,
                                                    train_size=.8)
#输出数据集大小
print ('原始数据集特征:',source_x.shape, 
       '训练数据集特征:',train_x.shape ,
      '测试数据集特征:',test_x.shape)

print ('原始数据集标签:',source_y.shape, 
       '训练数据集标签:',train_y.shape ,
      '测试数据集标签:',test_y.shape)
原始数据集特征: (891, 28) 训练数据集特征: (712, 28) 测试数据集特征: (179, 28)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)

4.2 选择机器学习算法

#逻辑回归

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

4.3 训练模型

model.fit(train_x, train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

5 模型评估

model.score(test_x , test_y )
0.82681564245810057

6.实施方案

#得上预测结果上传到kaggle
pred_y = model.predict(pred_x)

'''
生成的预测值是浮点数(0.0,1,0)
但是Kaggle要求提交的结果是整型(0,1)
所以要对数据类型进行转换
'''
pred_y=pred_y.astype(int)

#乘客id
passenger_id = full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf = pd.DataFrame( 
    {
      'PassengerId': passenger_id ,      'Survived': pred_y } )

predDf.shape
(418, 2)
predDf.head()
PassengerId Survived
891 892 0
892 893 1
893 894 0
894 895 0
895 896 1
predDf.to_csv( '/Users/qxh/Desktop/titanic/titanic_pred.csv' , index = False )

你可能感兴趣的:(python,机器学习,数据分析,逻辑回归)