特征工程(八)特征工程案例分析(2)—利用逻辑回归预测泰坦尼克号生存率

泰坦尼克号将乘客分为一等舱、二等舱、三等舱三个等级,等级不同决定了安全设施、娱乐设施、餐饮等的不同,对生存率有一定影响。
那是个绅士的年代,船难时,很多男士放弃逃生机会优先女士孩子逃生,然后慷慨赴死,性别年龄也是影响生存率的因素之一。 
根据背景初步判断船舱等级、乘客年龄、性别是影响生存率的因素。

一些人比其他人更有可能生存,比如妇女,儿童和上层阶级。什么样的人在泰坦尼克号中更容易存活?

下载数据地址如下:
https://www.kaggle.com/competitions/titanic/data

1、导入数据

import warnings
warnings.filterwarnings('ignore')

# 导入处理数据包
import numpy as np
import pandas as pd
# 导入数据
train_data = pd.read_csv("./titanic_data/train.csv")
test_data = pd.read_csv("./titanic_data/test.csv")

print('训练数据集:',train_data.shape,'测试数据集:',test_data.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
# 合并数据集,方便同时对两个数据集进行清洗
full_data = train_data.append(test_data,ignore_index=True)
print('合并后的数据集:',full_data.shape)
合并后的数据集: (1309, 12)

2、查看数据集的信息

# 查看数据
full_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 获取数据类型列的描述性统计信息
full_data.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 1309.000000 891.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 655.000000 0.383838 2.294882 29.881138 0.498854 0.385027 33.295479
std 378.020061 0.486592 0.837836 14.413493 1.041658 0.865560 51.758668
min 1.000000 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 328.000000 0.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 655.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 982.000000 1.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1309.000000 1.000000 3.000000 80.000000 8.000000 9.000000 512.329200

describe只能查看数据类型的描述统计信息,对于其他类型的数据不显示,比如字符串类型姓名(name),客舱号(Cabin)。
这很好理解,因为描述统计指标是计算数值,所以需要该列的数据类型是数据

# 查看每一列的数据类型和数据总数
full_data.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

我们发现数据总共有1309行。

其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:

  • 1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%

  • 2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据

字符串列:

  • 1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少

  • 2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大

这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。

3.数据清洗(Data Preparation )

3.1 数据预处理

缺失值处理:

在前面,理解数据阶段,我们发现数据总共有1309行。

  • 其中数据类型列:年龄(Age)、船票价格(Fare)里面有缺失数据。
  • 字符串列:登船港口(Embarked)、船舱号(Cabin)里面有缺失数据。

这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。很多机器学习算法为了训练模型,要求所传入的特征中不能有空值。

  • 如果是数值类型,用平均值取代
  • 如果是分类数据,用最常见的类别取代
  • 使用模型预测缺失值,例如:K-NN
# 1、对于数值类型年龄(Age)和船票价格(Fare)这两列数值类型,我们用平均值进行填充
full_data['Age'] = full_data['Age'].fillna(full_data['Age'].mean())

full_data['Fare'] = full_data['Fare'].fillna(full_data['Fare'].mean())

# 可以看到Age列和Fare列已经没有空值了
full_data.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 2、填充登船港口(Embarked) 这一列
'''
出发地点:  S=英国   南安普顿  Southampton
途径地点1: C=法国   瑟堡市    Cherbourg
途径地点2: Q=爱尔兰 昆士敦    Queenstown
'''
# 可以看到S类别是最常见的,我们将缺失值填充为最频繁出现的
full_data['Embarked'].value_counts()
S    914
C    270
Q    123
Name: Embarked, dtype: int64
# 将缺失值填充为最频繁出现的S
full_data['Embarked'] = full_data['Embarked'].fillna('S')

# 可以看到Embarked列已经没有空值了
full_data.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 3、填充船舱号(Cabin) 这一列
full_data['Cabin'].value_counts()

C23 C25 C27        6
G6                 5
B57 B59 B63 B66    5
C22 C26            4
F33                4
                  ..
A14                1
E63                1
E12                1
E38                1
C105               1
Name: Cabin, Length: 186, dtype: int64
# 缺失值比较多,填充为U,表示未知(unknown)
full_data['Cabin'] = full_data['Cabin'].fillna('U')


# 可以看到所有列已经没有空值了,Survived这一列是标签列,不需要进行处理
full_data.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# 查看数据是否正常
full_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 U S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 U S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 U S

3.2 特征提取

查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码

(1)数值类型:
乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)

(2)时间序列:无
(3) 分类数据:

  • 1)有直接类别的

      乘客性别(Sex):男性male,女性female
      登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
      客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
    
  • 2)字符串类型:可能从这里面提取出特征来,也归到分类数据中

      乘客姓名(Name)
      客舱号(Cabin)
      船票编号(Ticket)
    

3.2.1 直接类别的分类数据

# 1、将性别值映射为数值,男(male)对应数值1,女(female)对应数值0
sex_dict = {
    'male':1,
    'female':0
}

full_data['Sex'] = full_data['Sex'].map(sex_dict)
full_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U S
# 2、登船港口(Embarked)进行one-hot编码
'''
使用get_dummies进行one-hot编码,产生虚拟变量
'''
embarkedDf = pd.get_dummies(full_data['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,embarkedDf],axis=1)

'''
因为已经对Embarked进行了one-hot编码,产生了虚拟变量,因此我们把Embarked列删除

drop删除某一列代码解释:
因为drop(name,axis=1)里面指定了name是哪一列,比如指定的是A这一列,axis=1表示按行操作
那么结合起来就是把A列里面每一行删除,最终结果是删除了A这一列。
简单来说,使用drop删除某几列的方法记住这个语法就可以了: drop([列名1,列名2],axis=1)
'''
full_data.drop('Embarked',axis=1,inplace=True)

full_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U 0 0 1
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1 0 0
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 0 1
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U 0 0 1
# 3、客舱等级(Pclass)进行one-hot编码
# 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱


pclassDf = pd.get_dummies(full_data['Pclass'],prefix='Pclass')
pclassDf.head()
Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,pclassDf],axis=1)

full_data.drop('Pclass',axis=1,inplace=True)

full_data.head()
PassengerId Survived Name Sex Age SibSp Parch Ticket Fare Cabin Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3
0 1 0.0 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 U 0 0 1 0 0 1
1 2 1.0 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1 0 0 1 0 0
2 3 1.0 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 0 1 0 0 1
3 4 1.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 0 0 1 1 0 0
4 5 0.0 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 U 0 0 1 0 0 1

3.2.2 字符串类别的分类数据

# 1、从姓名列[Name]提取头衔
'''
注意到在乘客名字 (Name) 中,有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。
'''
full_data['Name'].head(10)
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object
'''
定义函数,从姓名中获取头衔
'''
def getTitle(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = "".join(str2.strip())
    return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full_data['Name'].map(getTitle)
titleDf
Title
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
... ...
1304 Mr
1305 Dona
1306 Mr
1307 Mr
1308 Master

1309 rows × 1 columns

'''
定义以下几种头衔类别:
Officer  政府官员
Royalty  王室
Mr       已婚男士
Mrs      已婚妇女
Miss     年轻未婚女子
Master   有技能的人/教师
'''

# 姓名中头衔字符串与定义头衔类别的映射关系
title_dict = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Don": "Royalty",
    "Sir": "Royalty",
    "Jonkheer": "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess": "Royalty",
    "Dona": "Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr": "Mr",
    "Mrs": "Mrs",
    "Miss": "Miss",
    "Master": "Master",
    "Lady": "Royalty"
}


titleDf['Title'] = titleDf['Title'].map(title_dict)

# one-hot编码
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
Master Miss Mr Mrs Officer Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
# 添加one-hot编码到full_data,bing'q并且删除Name这一列
full_data = pd.concat([full_data,titleDf],axis=1)

full_data.drop('Name',axis=1,inplace=True)
full_data
PassengerId Survived Sex Age SibSp Parch Ticket Fare Cabin Embarked_C ... Embarked_S Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty
0 1 0.0 1 22.000000 1 0 A/5 21171 7.2500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1 2 1.0 0 38.000000 1 0 PC 17599 71.2833 C85 1 ... 0 1 0 0 0 0 0 1 0 0
2 3 1.0 0 26.000000 0 0 STON/O2. 3101282 7.9250 U 0 ... 1 0 0 1 0 1 0 0 0 0
3 4 1.0 0 35.000000 1 0 113803 53.1000 C123 0 ... 1 1 0 0 0 0 0 1 0 0
4 5 0.0 1 35.000000 0 0 373450 8.0500 U 0 ... 1 0 0 1 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1304 1305 NaN 1 29.881138 0 0 A.5. 3236 8.0500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1305 1306 NaN 0 39.000000 0 0 PC 17758 108.9000 C105 1 ... 0 1 0 0 0 0 0 0 0 1
1306 1307 NaN 1 38.500000 0 0 SOTON/O.Q. 3101262 7.2500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1307 1308 NaN 1 29.881138 0 0 359309 8.0500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1308 1309 NaN 1 29.881138 1 1 2668 22.3583 U 1 ... 0 0 0 1 1 0 0 0 0 0

1309 rows × 21 columns

# 2、从Cabin列提取客舱号信息
full_data['Cabin'] = full_data['Cabin'].map(lambda c:c[0])
full_data.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Cabin Embarked_C ... Embarked_S Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 U 0 ... 1 0 0 1 0 0 1 0 0 0
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 C 1 ... 0 1 0 0 0 0 0 1 0 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 U 0 ... 1 0 0 1 0 1 0 0 0 0
3 4 1.0 0 35.0 1 0 113803 53.1000 C 0 ... 1 1 0 0 0 0 0 1 0 0
4 5 0.0 1 35.0 0 0 373450 8.0500 U 0 ... 1 0 0 1 0 0 1 0 0 0

5 rows × 21 columns

# 进行one-hot编码
cabinDf = pd.get_dummies(full_data['Cabin'],prefix='Cabin')
cabinDf.head()
Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1
full_data = pd.concat([full_data,cabinDf],axis=1)

full_data.drop('Cabin',axis=1,inplace=True)
full_data.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Embarked_C Embarked_Q ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 1 0 ... 0 0 0 1 0 0 0 0 0 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 4 1.0 0 35.0 1 0 113803 53.1000 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 5 0.0 1 35.0 0 0 373450 8.0500 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

# 3、建立家庭人数和家庭类别
familyDf = pd.DataFrame()

'''
家庭人数 = 同代直系亲属数(Parch) + 不同代直系亲属数(SibSp) + 乘客自己
'''


familyDf['FamilySize'] = full_data['Parch'] + full_data['SibSp'] + 1

familyDf.head()
FamilySize
0 2
1 2
2 1
3 2
4 1
'''
家庭类别
小家庭Family_Small:     家庭人数=1
中等家庭Family_Middle:  2<=家庭人数<=4
大家庭Family_Large:     家庭人数>=5
'''


familyDf['Family_Small']  =  familyDf['FamilySize'].map(lambda cnt: 1 if cnt == 1 else 0 )
familyDf['Family_Middle'] =  familyDf['FamilySize'].map(lambda cnt: 1 if 2 <= cnt <= 4 else 0 )
familyDf['Family_Large']  =  familyDf['FamilySize'].map(lambda cnt: 1 if cnt >= 5 else 0 )


familyDf.head()
FamilySize Family_Small Family_Middle Family_Large
0 2 0 1 0
1 2 0 1 0
2 1 1 0 0
3 2 0 1 0
4 1 1 0 0
# 拼接到full_data
full_data = pd.concat([full_data,familyDf],axis=1)

full_data.head()
PassengerId Survived Sex Age SibSp Parch Ticket Fare Embarked_C Embarked_Q ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Small Family_Middle Family_Large
0 1 0.0 1 22.0 1 0 A/5 21171 7.2500 0 0 ... 0 0 0 0 0 1 2 0 1 0
1 2 1.0 0 38.0 1 0 PC 17599 71.2833 1 0 ... 0 0 0 0 0 0 2 0 1 0
2 3 1.0 0 26.0 0 0 STON/O2. 3101282 7.9250 0 0 ... 0 0 0 0 0 1 1 1 0 0
3 4 1.0 0 35.0 1 0 113803 53.1000 0 0 ... 0 0 0 0 0 0 2 0 1 0
4 5 0.0 1 35.0 0 0 373450 8.0500 0 0 ... 0 0 0 0 0 1 1 1 0 0

5 rows × 33 columns

# 目前的特征
full_data.shape
(1309, 33)

3.3 特征选择

# 相关性矩阵
corrDf = full_data.corr()
corrDf
PassengerId Survived Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Small Family_Middle Family_Large
PassengerId 1.000000 -0.005007 0.013406 0.025731 -0.055224 0.008942 0.031416 0.048101 0.011585 -0.049836 ... 0.000549 -0.008136 0.000306 -0.045949 -0.023049 0.000208 -0.031437 0.028546 0.002975 -0.063415
Survived -0.005007 1.000000 -0.543351 -0.070323 -0.035322 0.081629 0.257307 0.168240 0.003650 -0.149683 ... 0.150716 0.145321 0.057935 0.016040 -0.026456 -0.316912 0.016639 -0.203367 0.279855 -0.125147
Sex 0.013406 -0.543351 1.000000 0.057397 -0.109609 -0.213125 -0.185484 -0.066564 -0.088651 0.115193 ... -0.057396 -0.040340 -0.006655 -0.083285 0.020558 0.137396 -0.188583 0.284537 -0.255196 -0.077748
Age 0.025731 -0.070323 0.057397 1.000000 -0.190747 -0.130872 0.171521 0.076179 -0.012718 -0.059153 ... 0.132886 0.106600 -0.072644 -0.085977 0.032461 -0.271918 -0.196996 0.116675 -0.038189 -0.161210
SibSp -0.055224 -0.035322 -0.109609 -0.190747 1.000000 0.373587 0.160224 -0.048396 -0.048678 0.073709 ... -0.015727 -0.027180 -0.008619 0.006015 -0.013247 0.009064 0.861952 -0.591077 0.253590 0.699681
Parch 0.008942 0.081629 -0.213125 -0.130872 0.373587 1.000000 0.221522 -0.008635 -0.100943 0.071881 ... -0.027385 0.001084 0.020481 0.058325 -0.012304 -0.036806 0.792296 -0.549022 0.248532 0.624627
Fare 0.031416 0.257307 -0.185484 0.171521 0.160224 0.221522 1.000000 0.286241 -0.130054 -0.169894 ... 0.072737 0.073949 -0.037567 -0.022857 0.001179 -0.507197 0.226465 -0.274826 0.197281 0.170853
Embarked_C 0.048101 0.168240 -0.066564 0.076179 -0.048396 -0.008635 0.286241 1.000000 -0.164166 -0.778262 ... 0.107782 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.036553 -0.107874 0.159594 -0.092825
Embarked_Q 0.011585 0.003650 -0.088651 -0.012718 -0.048678 -0.100943 -0.130054 -0.164166 1.000000 -0.491656 ... -0.061459 -0.042877 -0.020282 -0.019941 -0.008904 0.142369 -0.087190 0.127214 -0.122491 -0.018423
Embarked_S -0.049836 -0.149683 0.115193 -0.059153 0.073709 0.071881 -0.169894 -0.778262 -0.491656 1.000000 ... -0.056023 0.002960 0.030575 0.040560 0.018111 0.137351 0.087771 0.014246 -0.062909 0.093671
Pclass_1 0.026495 0.285904 -0.107371 0.362587 -0.034256 -0.013033 0.599956 0.325722 -0.166101 -0.181800 ... 0.275698 0.242963 -0.073083 -0.035441 0.048310 -0.776987 -0.029656 -0.126551 0.165965 -0.067523
Pclass_2 0.022714 0.093349 -0.028862 -0.014193 -0.052419 -0.010057 -0.121372 -0.134675 -0.121973 0.196532 ... -0.037929 -0.050210 0.127371 -0.032081 -0.014325 0.176485 -0.039976 -0.035075 0.097270 -0.118495
Pclass_3 -0.041544 -0.322308 0.116562 -0.302093 0.072610 0.019521 -0.419616 -0.171430 0.243706 -0.003805 ... -0.207455 -0.169063 -0.041178 0.056964 -0.030057 0.527614 0.058430 0.138250 -0.223338 0.155560
Master 0.002254 0.085221 0.164375 -0.363923 0.329171 0.253482 0.011596 -0.014172 -0.009091 0.018297 ... -0.042192 0.001860 0.058311 -0.013690 -0.006113 0.041178 0.355061 -0.265355 0.120166 0.301809
Miss -0.050027 0.332795 -0.672819 -0.254146 0.077564 0.066473 0.092051 -0.014351 0.198804 -0.113886 ... -0.012516 0.008700 -0.003088 0.061881 -0.013832 -0.004364 0.087350 -0.023890 -0.018085 0.083422
Mr 0.014116 -0.549199 0.870678 0.165476 -0.243104 -0.304780 -0.192192 -0.065538 -0.080224 0.108924 ... -0.030261 -0.032953 -0.026403 -0.072514 0.023611 0.131807 -0.326487 0.386262 -0.300872 -0.194207
Mrs 0.033299 0.344935 -0.571176 0.198091 0.061643 0.213491 0.139235 0.098379 -0.100374 -0.022950 ... 0.080393 0.045538 0.013376 0.042547 -0.011742 -0.162253 0.157233 -0.354649 0.361247 0.012893
Officer 0.002231 -0.031316 0.087288 0.162818 -0.013813 -0.032631 0.028696 0.003678 -0.003212 -0.001202 ... 0.006055 -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 -0.026921 0.013303 0.003966 -0.034572
Royalty 0.004400 0.033391 -0.020408 0.059466 -0.010787 -0.030197 0.026214 0.077213 -0.021853 -0.054250 ... -0.012950 -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 -0.023600 0.008761 -0.000073 -0.017542
Cabin_A -0.002831 0.022287 0.047561 0.125177 -0.039808 -0.030707 0.020094 0.094914 -0.042105 -0.056984 ... -0.024952 -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 -0.042967 0.045227 -0.029546 -0.033799
Cabin_B 0.015895 0.175095 -0.094453 0.113458 -0.011569 0.073051 0.393743 0.161595 -0.073613 -0.095790 ... -0.043624 -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 0.032318 -0.087912 0.084268 0.013470
Cabin_C 0.006092 0.114652 -0.077473 0.167993 0.048616 0.009601 0.401370 0.158043 -0.059151 -0.101861 ... -0.053083 -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 0.037226 -0.137498 0.141925 0.001362
Cabin_D 0.000549 0.150716 -0.057396 0.132886 -0.015727 -0.027385 0.072737 0.107782 -0.061459 -0.056023 ... 1.000000 -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.025313 -0.074310 0.102432 -0.049336
Cabin_E -0.008136 0.145321 -0.040340 0.106600 -0.027180 0.001084 0.073949 0.027566 -0.042877 0.002960 ... -0.034317 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.017285 -0.042535 0.068007 -0.046485
Cabin_F 0.000306 0.057935 -0.006655 -0.072644 -0.008619 0.020481 -0.037567 -0.020010 -0.020282 0.030575 ... -0.024369 -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.005525 0.004055 0.012756 -0.033009
Cabin_G -0.045949 0.016040 -0.083285 -0.085977 0.006015 0.058325 -0.022857 -0.031566 -0.019941 0.040560 ... -0.011817 -0.011135 -0.007907 1.000000 -0.001712 -0.114803 0.035835 -0.076397 0.087471 -0.016008
Cabin_T -0.023049 -0.026456 0.020558 0.032461 -0.013247 -0.012304 0.001179 -0.014095 -0.008904 0.018111 ... -0.005277 -0.004972 -0.003531 -0.001712 1.000000 -0.051263 -0.015438 0.022411 -0.019574 -0.007148
Cabin_U 0.000208 -0.316912 0.137396 -0.271918 0.009064 -0.036806 -0.507197 -0.258257 0.142369 0.137351 ... -0.353822 -0.333381 -0.236733 -0.114803 -0.051263 1.000000 -0.014155 0.175812 -0.211367 0.056438
FamilySize -0.031437 0.016639 -0.188583 -0.196996 0.861952 0.792296 0.226465 -0.036553 -0.087190 0.087771 ... -0.025313 -0.017285 0.005525 0.035835 -0.015438 -0.014155 1.000000 -0.688864 0.302640 0.801623
Family_Small 0.028546 -0.203367 0.284537 0.116675 -0.591077 -0.549022 -0.274826 -0.107874 0.127214 0.014246 ... -0.074310 -0.042535 0.004055 -0.076397 0.022411 0.175812 -0.688864 1.000000 -0.873398 -0.318944
Family_Middle 0.002975 0.279855 -0.255196 -0.038189 0.253590 0.248532 0.197281 0.159594 -0.122491 -0.062909 ... 0.102432 0.068007 0.012756 0.087471 -0.019574 -0.211367 0.302640 -0.873398 1.000000 -0.183007
Family_Large -0.063415 -0.125147 -0.077748 -0.161210 0.699681 0.624627 0.170853 -0.092825 -0.018423 0.093671 ... -0.049336 -0.046485 -0.033009 -0.016008 -0.007148 0.056438 0.801623 -0.318944 -0.183007 1.000000

32 rows × 32 columns

'''
查看各个特征与存活(Survived)的相关系数,倒序排列
'''
corrDf['Survived'].sort_values(ascending=False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
Pclass_1         0.285904
Family_Middle    0.279855
Fare             0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Pclass_2         0.093349
Master           0.085221
Parch            0.081629
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
FamilySize       0.016639
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
SibSp           -0.035322
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Small    -0.203367
Cabin_U         -0.316912
Pclass_3        -0.322308
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64

根据各个特征与Survived的相关系数大小,选择这几个特征作为模型的输入:

头衔(前面所在的数据集titleDf)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)

full_X = pd.concat(
    [
        titleDf,
        pclassDf,
        familyDf,
        full_data['Fare'],
        cabinDf,
        embarkedDf,
        full_data['Sex']
    ],axis=1
)

full_X.head()
Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 FamilySize ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Embarked_C Embarked_Q Embarked_S Sex
0 0 0 1 0 0 0 0 0 1 2 ... 0 0 0 0 0 1 0 0 1 1
1 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 1 0 0 0
2 0 1 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 1 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 1

5 rows × 27 columns

4、构建模型

  • 坦尼克号测试数据集因为是我们最后要提交给Kaggle的,里面没有生存情况的值,所以不能用于评估模型。

  • 使用Kaggle泰坦尼克号项目给的训练数据集,做为我们的原始数据集(记为source),从这个原始数据集中拆分出训练数据集(记为train:用于模型训练)和测试数据集(记为test:用于模型评估)

# 原始数据集有891行
source_row = 891


# 原始数据集的特征
source_X = full_X.loc[0:source_row-1,:]
# 原始数据集的标签
source_y = full_data.loc[0:source_row-1,'Survived']


# 预测数据集特征
pred_X = full_X.loc[source_row:,:]


print('原始数据集的大小:',source_X.shape[0])
print('预测数据集的大小:',pred_X.shape[0])
原始数据集的大小: 891
预测数据集的大小: 418
# 1、拆分原始数据集
from sklearn.model_selection import train_test_split


train_X,test_X,train_y,test_y  = train_test_split(
    source_X,
    source_y,
    test_size=0.2,
    train_size=0.8
)



# 2、选择机器学习算法,我们选择最基础的逻辑回归算法
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()


# 3、训练模型
lr.fit(train_X,train_y)
# 4、评估模型,用精确率进行评估
lr.score(test_X,test_y)
0.8156424581005587

5、上传到Kaggle

# 对预测数据集进行预测
pred_y = lr.predict(pred_X)

# 转换为kaggle要求是整形
pred_y = pred_y.astype(int)


# 乘客id
passenger_id = full_data.loc[source_row:,'PassengerId']

predDf = pd.DataFrame(
    {
        'PassengerId':passenger_id,
        'Survived':pred_y
    }
)

predDf.head()
PassengerId Survived
891 892 0
892 893 1
893 894 0
894 895 0
895 896 1
# 保存结果
predDf.to_csv('./titanic_data/titanic_pred.csv',index=False)

你可能感兴趣的:(#,特征工程,逻辑回归,python,机器学习)