缺失值认定
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'A':['a1', 'a1', 'a2', 'a2'],
... 'B':['b1', 'b2', None, 'b2'],
... 'C':[1, 2, 3, 4],
... 'D':[5, 6, None, 8],
... 'E':[5, None, 7, 8]
... })
>>> df
A B C D E
0 a1 b1 1 5.0 5.0
1 a1 b2 2 6.0 NaN
2 a2 None 3 NaN 7.0
3 a2 b2 4 8.0 8.0
>>> df.isna()
A B C D E
0 False False False False False
1 False False False False True
2 False True False True False
3 False False False False False
>>> df.isna().sum()
A 0
B 1
C 0
D 1
E 1
dtype: int64
>>> df.isna().sum(1)
0 0
1 1
2 2
3 0
dtype: int64
>>> df.isna().sum().sum()
3
>>> df.loc[df.isna().any(1)]
A B C D E
1 a1 b2 2 6.0 NaN
2 a2 None 3 NaN 7.0
>>> df.loc[:, df.isna().any()]
B D E
0 b1 5.0 5.0
1 b2 6.0 NaN
2 None NaN 7.0
3 b2 8.0 8.0
>>> df.loc[~(df.isna().any(1))]
A B C D E
0 a1 b1 1 5.0 5.0
3 a2 b2 4 8.0 8.0
缺失值填充
>>> df.fillna(0)
A B C D E
0 a1 b1 1 5.0 5.0
1 a1 b2 2 6.0 0.0
2 a2 0 3 0.0 7.0
3 a2 b2 4 8.0 8.0
>>> df.replace({pd.NA:0})
A B C D E
0 a1 b1 1 5.0 5.0
1 a1 b2 2 6.0 NaN
2 a2 0 3 NaN 7.0
3 a2 b2 4 8.0 8.0
>>> import numpy as np
>>> df.replace({np.nan:0})
A B C D E
0 a1 b1 1 5.0 5.0
1 a1 b2 2 6.0 0.0
2 a2 0 3 0.0 7.0
3 a2 b2 4 8.0 8.0
插值填充
>>> df.interpolate()
A B C D E
0 a1 b1 1 5.0 5.0
1 a1 b2 2 6.0 6.0
2 a2 None 3 7.0 7.0
3 a2 b2 4 8.0 8.0
重复值及删除数据
keep = first
>>> df = pd.DataFrame({
... 'A': ['x', 'x', 'z'],
... 'B': ['x', 'x', 'x'],
... 'C': [1, 1, 2]
... })
>>> df
A B C
0 x x 1
1 x x 1
2 z x 2
>>> df.duplicated()
0 False
1 True
2 False
dtype: bool
>>> df[df.duplicated(keep = 'last')]
A B C
0 x x 1
>>> df.drop_duplicates()
A B C
0 x x 1
2 z x 2
>>> df.drop([0,2])
A B C
1 x x 1
>>> df.drop(['A','C'], axis = 1)
B
0 x
1 x
2 x
数据分箱(data binning,也称为离散组合或数据分桶)是一种数据预处理技术,它将原始数据分成几个小区间,即bin(小箱子),是一种量子化的形式。
具有平滑输入数据的作用,并且在小数据集的情况下还可以减少过拟合。
Pandas主要基于以两个函数实现连续数据的离散化处理。
df.Age.max(), df.Age.min()
(80.0, 0.0)
df['Age_num'] = pd.cut(df['Age'],bins = 5, labels = [1,2,3,4,5])
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Age_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 3 |
df.Age.groupby(pd.cut(df.Age, bins =4)).count()
Age
(-0.08, 20.0] 356
(20.0, 40.0] 385
(40.0, 60.0] 128
(60.0, 80.0] 22
Name: Age, dtype: int64
df.Age.groupby(pd.qcut(df.Age, 4)).count()
Age
(-0.001, 6.0] 224
(6.0, 24.0] 230
(24.0, 35.0] 220
(35.0, 80.0] 217
Name: Age, dtype: int64
查看变量种类
转换文本为数值变量
df.Sex.unique()
array(['male', 'female', 0], dtype=object)
# 利用 replace 函数
df['Sex'].replace(['male','female'],[1,2]).unique()
array([1, 2, 0], dtype=int64)
# 利用 map 函数
df['Sex'].map({'male':1, 'female':2}).unique()
array([ 1., 2., nan])
df.Cabin.nunique(), df.Ticket.nunique()
(135, 543)
# 多类别文本的处理
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
lbl = LabelEncoder()
label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
df[feat + "_labelEncode"] = df[feat].map(label_dict)
df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
df.Cabin_labelEncode.unique()
array([135, 74, 50, 0, 119, 133, 45, 101, 11, 57, 92, 20, 18,
73, 131, 129, 112, 10, 83, 89, 47, 33, 106, 97, 41, 130,
55, 15, 12, 62, 132, 25, 39, 7, 94, 85, 80, 71, 93,
76, 37, 124, 42, 52, 81, 49, 103, 28, 82, 56, 67, 115,
65, 32, 69, 114, 59, 14, 51, 78, 117, 134, 113, 95, 21,
121, 72, 43, 105, 64, 118, 8, 46, 48, 79, 116, 107, 123,
22, 58, 88, 38, 111, 96, 36, 23, 24, 17, 75, 70, 2,
44, 68, 1, 125, 26, 3, 87, 100, 104, 4, 30, 6, 98,
122, 35, 31, 99, 29, 16, 128, 66, 110, 77, 53, 60, 127,
13, 61, 91, 108, 84, 126, 19, 102, 40, 86, 9, 34, 120,
90, 109, 5, 63, 27, 54])
转换文本为 one-hot 编码
虚拟变量(Dummy Variable) 又称虚设变量、名义变量或哑变量,是一个用来反映质的属性的人工变量,是量化了的自变量,通常取值为0或1,常被用于one-hot特征提取。
# 转为 one-hot 编码: Age 先分组后转换
pd.get_dummies(pd.cut(df['Age'],4, labels = [0,1,2,3]), prefix = 'Age').tail()
Age_0 | Age_1 | Age_2 | Age_3 | |
---|---|---|---|---|
886 | 0 | 1 | 0 | 0 |
887 | 1 | 0 | 0 | 0 |
888 | 1 | 0 | 0 | 0 |
889 | 0 | 1 | 0 | 0 |
890 | 0 | 1 | 0 | 0 |
从纯文本Name特征里提取出Titles的特征
.str.extract()可以利用正则表达式将文本中的数据提取出来,形成单独的列。
#写入代码
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.Title.value_counts()
Mr 398
Miss 146
Mrs 108
Master 36
Rev 6
Dr 6
Mlle 2
Col 2
Major 2
Sir 1
Jonkheer 1
Mme 1
Countess 1
Don 1
Lady 1
Ms 1
Capt 1
Name: Title, dtype: int64