第二章:第一节数据清洗及特征处理-课程

【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。

开始之前,导入numpy、pandas包和数据

#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
df=pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2 第二章:数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善

#写入代码
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#写入代码
df.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
df[df.Age.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
... ... ... ... ... ... ... ... ... ... ... ... ...
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

177 rows × 12 columns

df[df.Cabin.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

687 rows × 12 columns

df[df.Embarked.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#处理缺失值的一般思路:
#提醒:可使用的函数有--->dropna函数与fillna函数
df_drop=df.dropna(subset=['Age'])
df_drop[df_drop.Age.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
#写入代码
df_ffill=df.copy()
df_ffill.Age.fillna(method='ffill',inplace=True)
df_ffill[df_ffill.Age.isna()]
# df.Age.fillna(method='ffill')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
#写入代码
df_bfill=df.copy()
df_bfill.Age.fillna(method='bfill',inplace=True)
df_bfill[df_bfill.Age.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
#写入代码
df_meanfill=df.copy()
df_meanfill.Age.mean()
df_meanfill.Age.fillna(df_meanfill.Age.mean(),inplace=True)
df_meanfill[df_meanfill.Age.isna()]

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
df_meanfill.Age.mean()
29.699117647058763

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

dropna

  • (a)axis参数
  • (b)how参数(可以选all或者any,表示全为缺失去除和存在缺失去除)
  • (c)subset参数(即在某一组列范围中搜索缺失值)

fillna

  • (a)值填充与前后向填充(分别与ffill方法和bfill方法等价)method=

【思考2】检索空缺值用np.nan要比用None好,这是为什么?

#思考回答

  • 在所有的表格读取后,无论列是存放什么类型的数据,默认的缺失值全为np.nan类型
  • 但这里存在一个问题就是np.nan!=np.nan

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值

  • .duplicated()可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为True
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df[df.duplicated()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
df.drop_duplicates()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df[df.duplicated('Ticket')]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.0000 C23 C25 C27 S
117 118 0 2 Turpin, Mr. William John Robert male 29.0 1 0 11668 21.0000 NaN S
119 120 0 3 Andersson, Miss. Ellis Anna Maria female 2.0 4 2 347082 31.2750 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

210 rows × 12 columns

#写入代码

for name ,group in df.groupby('Ticket'):
    print(name)
    display(group)
    break
110152
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
257 258 1 1 Cherry, Miss. Gladys female 30.0 0 0 110152 86.5 B77 S
504 505 1 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5 B79 S
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 110152 86.5 B77 S

可以发现有很多人是使用了同样的票上船的,如果是要求一人一票的话那么可能这部分的数据就是存在问题的,也可能是逃票了或者其他原因。

2.2.2 任务二:对重复值进行处理

(1)重复值有哪些处理方式呢?

(2)处理我们数据的重复值

方法多多益善

  • .drop_duplicates(‘Class’)

#重复值有哪些处理方式:

  • 删除重复值
  • 单独对于重复值进行修改
#写入代码


df.drop_duplicates()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

2.2.3 任务三:将前面清洗的数据保存为csv格式

#写入代码
df.to_csv('test_clear.csv')


2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存,保存为csv格式

df['Age_cut']=pd.cut(df.Age,5, right=False, labels=[1,2,3,4,5])#
df['Age_cut']
0        2
1        3
2        2
3        3
4        3
      ... 
886      2
887      2
888    NaN
889      2
890      2
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
df.to_csv('test_ave.csv')
#分箱操作是什么:
df['Age_cut']=pd.cut(df.Age,[0,5,15,30,50,80], right=False, labels=[1,2,3,4,5])
df['Age_cut']
0        3
1        4
2        3
3        4
4        4
      ... 
886      3
887      3
888    NaN
889      3
890      4
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
df.to_csv('test_ave.csv')
#写入代码
df['Age_cut']=pd.qcut(df.Age,[0,.1,.3,0.5,0.7,0.9], labels=[1,2,3,4,5])
df['Age_cut']
0        2
1        5
2        3
3        4
4        4
      ... 
886      3
887      2
888    NaN
889      3
890      4
Name: Age_cut, Length: 891, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
#写入代码
df.to_csv('test_ave.csv')
#写入代码

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#写入代码
df.Sex.describe()
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object
df.Sex.value_counts()
male      577
female    314
Name: Sex, dtype: int64
df.Cabin.value_counts()
G6             4
C23 C25 C27    4
B96 B98        4
F33            3
F2             3
              ..
A5             1
B71            1
C111           1
B79            1
D48            1
Name: Cabin, Length: 147, dtype: int64
df.Embarked.value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_cut Sex_num
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4 1
#写入代码
df['Sex_num'] = df['Sex'].replace(r'male',1).replace(r'female',2)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_cut Sex_num
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4 1
#写入代码
df['Sex_num'] = df['Sex'].map({'male':1,'female':2})
df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_cut Sex_num
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4 1
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Sex','Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode_map"] = df[feat].map(label_dict)
    df[feat + "_labelEncode_lbl"] = lbl.fit_transform(df[feat].astype(str))

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_cut Sex_num Sex_labelEncode_map Sex_labelEncode_lbl Cabin_labelEncode_map Cabin_labelEncode_lbl Ticket_labelEncode_map Ticket_labelEncode_lbl
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 0 1 0.0 147 0 523
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 1 0 1.0 81 1 596
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2 1 0 0.0 147 2 669
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 2 1 0 2.0 55 3 49
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4 1 0 1 0.0 147 4 472
###  factorize方法
#### 该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes, uniques = pd.factorize(df['Sex'], sort=True)
df['Sex_factorize']=codes
codes, uniques = pd.factorize(df['Cabin'], sort=True)
df['Cabin_factorize']=codes
codes, uniques = pd.factorize(df['Embarked'], sort=True)
df['Embarked_factorize']=codes
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Sex_num Sex_labelEncode_map Sex_labelEncode_lbl Cabin_labelEncode_map Cabin_labelEncode_lbl Ticket_labelEncode_map Ticket_labelEncode_lbl Sex_factorize Cabin_factorize Embarked_factorize
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 ... 1 0 1 0.0 147 0 523 1 -1 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... 2 1 0 1.0 81 1 596 0 81 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 ... 2 1 0 0.0 147 2 669 0 -1 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... 2 1 0 2.0 55 3 49 0 55 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 ... 1 0 1 0.0 147 4 472 1 -1 2

5 rows × 23 columns

#one_hot
df.join(pd.get_dummies(df[['Sex', 'Cabin','Embarked']])).head()
#可选prefix参数添加前缀,prefix_sep添加分隔符
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Cabin_F G73 Cabin_F2 Cabin_F33 Cabin_F38 Cabin_F4 Cabin_G6 Cabin_T Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 ... 0 0 0 0 0 0 0 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... 0 0 0 0 0 0 0 1 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 ... 0 0 0 0 0 0 0 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... 0 0 0 0 0 0 0 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 175 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

#写入代码
df['Titles']=df['Name'].str.extract('(["Mr","Miss","Mrs"]+)\.')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Sex_labelEncode_map Sex_labelEncode_lbl Cabin_labelEncode_map Cabin_labelEncode_lbl Ticket_labelEncode_map Ticket_labelEncode_lbl Sex_factorize Cabin_factorize Embarked_factorize Titles
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 ... 0 1 0.0 147 0 523 1 -1 2 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... 1 0 1.0 81 1 596 0 81 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 ... 1 0 0.0 147 2 669 0 -1 2 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... 1 0 2.0 55 3 49 0 55 2 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 ... 0 1 0.0 147 4 472 1 -1 2 Mr

5 rows × 24 columns

#保存最终你完成的已经清理好的数据
# 保存上面的为最终结论
df.to_csv('test_fin.csv')

你可能感兴趣的:(数据分析)