#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
train_data = pd.read_csv('../titanic/train.csv')
我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。
我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢
(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
#写入代码
train_data.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
train_data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
tips:由以上可知,总共为891条数据,
Age缺失了大概1/4左右,考虑Age可以进行各种插值方法填充
Cabin缺失了7/9左右,考虑模型的噪音,要么直接将这列特征drop,要么想办法构建模型去对此列进行预测填充
Embarked只缺失了2个,插值或者直接drop都可,没太大影响
# 请查看Age, Cabin, Embarked列的数据
columns = ['Age','Cabin','Embarked']
train_data[columns].head()
Age | Cabin | Embarked | |
---|---|---|---|
0 | 22.0 | NaN | S |
1 | 38.0 | C85 | C |
2 | 26.0 | NaN | S |
3 | 35.0 | C123 | S |
4 | 35.0 | NaN | S |
(1)处理缺失值一般有几种思路
(2) 请尝试对Age列的数据的缺失值进行处理
(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理
# 对Age进行缺失值处理
age_median = train_data.Age.median()
train_data.Age.fillna(age_median,inplace=True)
train_data.Age.describe()
count 891.000000
mean 29.361582
std 13.019697
min 0.420000
25% 22.000000
50% 28.000000
75% 35.000000
max 80.000000
Name: Age, dtype: float64
# 更好的处理方法是按照性别分组,各自计算男性和女性的年龄的中位数然后再填充
age_median_sex = train_data.groupby('Sex').Age.median()
train_data.set_index('Sex',inplace = True)#设置原数据集的索引为'Sex'
# train_data.head()
age_median_sex
Sex
female 28.0
male 28.0
Name: Age, dtype: float64
"""
Pandas 的值在运算的过程中,会根据索引的值来进行自动的匹配。
在这里我们可以看到上一步骤的Series:age_median_sex的索引是 female 和 male 两个值,
所以需要把原始数据titanic_df中的性别也设置为索引,用 fillna 自动匹配相应的索引进行填充。
"""
train_data.Age.fillna(age_median_sex,inplace = True)
train_data.reset_index(inplace=True)
train_data.Age.describe()
count 891.000000
mean 29.361582
std 13.019697
min 0.420000
25% 22.000000
50% 28.000000
75% 35.000000
max 80.000000
Name: Age, dtype: float64
#写入代码
train_data.drop(['Cabin'],axis = 1,inplace = True)
train_data.head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | female | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | female | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | female | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
4 | male | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
train_data.describe(include=[np.object])#利用include=[np.object]查看分类型数据的描述性统计
Sex | Name | Ticket | Embarked | |
---|---|---|---|---|
count | 891 | 891 | 891 | 889 |
unique | 2 | 891 | 681 | 3 |
top | male | Nicholson, Mr. Arthur Ernest | CA. 2343 | S |
freq | 577 | 1 | 7 | 644 |
# 能看到‘S’出现的频数最多,咖位最高
# 其实这里也可以利用技术统计的方式,求出Embarked列频数最多的值
train_data.Embarked.value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
train_data.fillna({
'Embarked':'S'},inplace=True)
train_data['Embarked'].isnull().sum()
0
【思考1】dropna和fillna有哪些参数,分别如何使用呢?
使用DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
参数说明:
axis:
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise’)
使用DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢
#写入代码
train_data[train_data.duplicated()]
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked |
---|
(1)重复值有哪些处理方式呢?
(2)处理我们数据的重复值
方法多多益善
#重复值有哪些处理方式:
train_data.drop_duplicates().head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | female | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | female | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | female | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
4 | male | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。
(1) 分箱操作是什么?
(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
(5) 将上面的获得的数据分别进行保存,保存为csv格式
在很多网页分析系统中,0点之后会话将被强行切分,所以会话时长不可能超过1天。
1.2 在逻辑回归模型中,单变量离散化为N个哑变量后,每个哑变量有单独的权重,相当于为模型引入了非线性,能够提升模型表达能力,加大拟合;
1.3 缺失值也可以作为一类特殊的变量进入模型
1.4 分箱后降低模型运算复杂度,提升模型运算速度,对后后期生产上线较为友好
#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],5,labels = ['1','2','3','4','5'])
train_data.head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1 | 0 | 3 | Braund, Mr. Owen Harris | 2 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | female | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 3 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | female | 3 | 1 | 3 | Heikkinen, Miss. Laina | 2 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | female | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 3 | 1 | 0 | 113803 | 53.1000 | S |
4 | male | 5 | 0 | 3 | Allen, Mr. William Henry | 3 | 0 | 0 | 373450 | 8.0500 | S |
#将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
train_data.head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1 | 0 | 3 | Braund, Mr. Owen Harris | 3 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | female | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 4 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | female | 3 | 1 | 3 | Heikkinen, Miss. Laina | 3 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | female | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 0 | 113803 | 53.1000 | S |
4 | male | 5 | 0 | 3 | Allen, Mr. William Henry | 4 | 0 | 0 | 373450 | 8.0500 | S |
#将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
train_data['Age'] = pd.cut(train_data['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
train_data.head()
Unnamed: 0 | Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 1 | 0 | 3 | Braund, Mr. Owen Harris | NaN | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | 1 | female | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | NaN | 1 | 0 | PC 17599 | 71.2833 | C |
2 | 2 | female | 3 | 1 | 3 | Heikkinen, Miss. Laina | NaN | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | 3 | female | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | NaN | 1 | 0 | 113803 | 53.1000 | S |
4 | 4 | male | 5 | 0 | 3 | Allen, Mr. William Henry | NaN | 0 | 0 | 373450 | 8.0500 | S |
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示
#查看文本变量名及种类
train_data.describe(include = [np.object])
Sex | Name | Ticket | Embarked | |
---|---|---|---|---|
count | 891 | 891 | 891 | 891 |
unique | 2 | 891 | 681 | 3 |
top | male | Nicholson, Mr. Arthur Ernest | CA. 2343 | S |
freq | 577 | 1 | 7 | 646 |
train_data.Embarked.value_counts()
S 646
C 168
Q 77
Name: Embarked, dtype: int64
train_data.Sex.value_counts()
male 577
female 314
Name: Sex, dtype: int64
train_data.Sex.unique()
array(['male', 'female'], dtype=object)
#将文本变量Sex, Cabin ,Embarked用数值变量12345表示
train_data['Sex'] = train_data.Sex.replace(['male','female'],[1,2])
train_data.head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | 2 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | 2 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
4 | 1 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
train_data['Embarked'] = train_data.Embarked.replace(['S','C','Q'],[1,2,3])
train_data.head()
Sex | PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 1 |
1 | 2 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 2 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 1 |
3 | 2 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | 1 |
4 | 1 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | 1 |
# 法二进行map
# train_data = pd.read_csv('data_tmp.csv')
# train_data.head()
dict = {
'female':1,'male':2}
train_data['Sex'] = train_data.Sex.map(dict)
# train_data = pd.read_csv('../titanic/train.csv')
# train_data.Cabin.unique()
# train_data.Cabin.nunique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
'C148'], dtype=object)
#将文本变量Sex, Cabin ,Embarked用数值变量12345表示
import pandas as pd
train_data = pd.read_csv('../titanic/train.csv')
from sklearn.preprocessing import LabelEncoder
for f in ['Ticket','Cabin']:
lbl = LabelEncoder()
label_dict = dict(zip(train_data[f].unique(),range(train_data[f].nunique())))
train_data[f+'_labelEncoder'] = train_data[f].map(label_dict)
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Ticket_labelEncoder | Cabin_labelEncoder | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 0.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2 | 0.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 | 2.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 0.0 |
for feat in ['Ticket','Cabin']:
lbl = LabelEncoder()
train_data[feat+'_labelEncoder'] = lbl.fit_transform(train_data[f].astype(str))
# One-hotEncoder
for f in ['Age','Embarked']:
x = pd.get_dummies(train_data[f],prefix = f)
train_data = pd.concat([train_data,x],axis = 1)
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Age_65.0 | Age_66.0 | Age_70.0 | Age_70.5 | Age_71.0 | Age_74.0 | Age_80.0 | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 195 columns
#写入代码
train_data['title'] = train_data.Name.str.extract('([A-Za-z]+\.)',expand = False)
train_data.title[:20]
0 Mr.
1 Mrs.
2 Miss.
3 Mrs.
4 Mr.
5 Mr.
6 Mr.
7 Master.
8 Mrs.
9 Mrs.
10 Miss.
11 Miss.
12 Mr.
13 Mr.
14 Miss.
15 Mrs.
16 Master.
17 Mr.
18 Mrs.
19 Mrs.
Name: title, dtype: object