pandas-更新中-数据清洗与准备

处理缺失数据

判断缺失数据

isnull
pandas中使用NaN表示缺失数据,使用isnull()可以看出

>>> string_data = pd.Series(['aar', 'art', np.nan, 'avocado'])
>>> string_data
0        aar
1        art
2        NaN
3    avocado
dtype: object
>>> string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

python内置的None值用isnull()检查时也是True。notnull是相反的

丢弃

dropna
dropna默认丢弃包含NA的行或列。如果只丢弃全部为NA的行,可以加how='all’参数,如果只丢弃全NA的列,不丢弃全NA的行和部分NA的行列,那么使用参数:axis=1, how=“all”

>>> from numpy import nan as NA
>>> data = pd.Series([1, NA, 3.5, NA, 7])
>>> data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
>>> data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
# 等价于
>>> data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64
>>> data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> data.dropna()
     0    1    2
0  1.0  6.5  3.0

thresh参数表示某行至少要剩几个数字才会被保留:

>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> data.dropna(thresh=3)
     0    1    2
0  1.0  6.5  3.0

填充

fillna
fillna是会创建新对象的,如果希望就地,需要加inplace=True条件

>>> data.fillna(0)
     0    1    2
0  1.0  6.5  3.0
1  1.0  0.0  0.0
2  0.0  0.0  0.0
3  0.0  6.5  3.0

如果要给不同的列填充不同的值,需要传入一个字典,如{1:0.5, 2:0}会将1列的NA改为0.5,2列的NA改为0

fillna有一些method,比如pad和ffill是指用NA同列的上一个值来填充,如果没有就填NA。backfill和bfill是用下面一个值

>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> data.fillna(method='ffill')
     0    1    2
0  1.0  6.5  3.0
1  1.0  6.5  3.0
2  1.0  6.5  3.0
3  1.0  6.5  3.0
# 可以明显看到[1,1]和[2,1]的填充用的是[3,1]的值,
# 而[2,0]和[3,0]下方没有值了,只能是NA
>>> data.iloc[0,1] = 5
>>> data.fillna(method='backfill')
     0    1    2
0  1.0  5.0  3.0
1  1.0  6.5  3.0
2  NaN  6.5  3.0
3  NaN  6.5  3.0

如果希望只处理某一列的缺失值,需要加参数limit=列Index

另外,也可以令填充值为均值,比如 data.fillna(data.mean())

移除重复数据

默认情况下,第一次出现的行会保存,再次出现的会被删除,但是可以用keep='last’参数保留最后一次出现的,删除前面出现的重复值。

data = pd.DataFrame({'k1':['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4
# 查看重复情况。只有两行完全一样,重复处才会是True
print(data.duplicated())
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

print(data.drop_duplicates())
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

# 只丢弃某一列重复的
print(data.drop_duplicates(['k2'], keep='last'))
    k1  k2
1  two   1
2  one   2
4  one   3
6  two   4

利用函数转换数据

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
 'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'
}

data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
print(data)
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

利用函数修改index

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
# 创建转换函数
transform = lambda x: x[:4].upper()
# Map
data.index = data.index.map(transform)
print(data)
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

替换数据

fillna可以看作是替换数据的特殊情况,因为fillna是创建新对象,因此replace也是(这样好记)

data = pd.Series([1., -999., 2., -999., -1000., 3.])
# 将-999替换为nan
print(data.replace(-999, np.nan))
# 将多个值替换为1个值
print(data.replace([-999, -1000], np.nan))
# 传入俩列表,将keys替换为values,一一对应
print(data.replace([-999, -1000], [np.nan, 0]))
# 也可以传入字典
print(data.replace({-999:np.nan}))

重命名index

使用rename。注意rename也是要创建新对象的,如果不希望创建,加参数inplace=True。

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
# title就是只大写第一个字母
print(data.rename(index=str.title, columns=str.upper))
# 按照dict进行转换
print(data.rename(index={'Ohio':'India'}, columns={'three':'3'}))                 

划分bins

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
# 添加labels可以为分组起名
# pd.cut(ages, bins, labels=group_names)
cats = pd.cut(ages, bins)
# 显示组编号
print(cats.codes)
# [0 0 0 1 0 0 2 1 3 2 2 1]
print(cats.categories)
#IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
print(pd.value_counts(cats))
(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

如果没有确切的边界,那么可以传入分组数量,pandas会根据最大最小值计算等长的bins

data = np.random.rand(20)
# 精度是小数点后2位
print(pd.cut(data, 4, precision=2))
[(0.73, 0.98], (0.01, 0.25], (0.01, 0.25], (0.01, 0.25], (0.01, 0.25], ..., (0.49, 0.73], (0.01, 0.25], (0.01, 0.25], (0.73, 0.98], (0.49, 0.73]]
Length: 20
Categories (4, interval[float64, right]): [(0.01, 0.25] < (0.25, 0.49] < (0.49, 0.73] <
                                           (0.73, 0.98]]

如果希望每个bin有相同数量的点,可以使用qcut,也就是quartiles

cats = pd.qcut(data, 4) # 用pd.value_counts(cats)看,会分成4个组,每组数量一样
# 但是也可以自行指定quartiles
cat = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
print(pd.value_counts(cat))
(0.153, 0.552]     8
(0.552, 0.809]     8
(0.0934, 0.153]    2
(0.809, 0.99]      2
dtype: int64

检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000, 4))
print(data.describe())
# 选某一列绝对值大于3的行
col = data[2]
print(col[np.abs(col) > 3])
# 选出全部含有绝对值大于3的数据的行
print(data[(np.abs(data) > 3).any(1)])
# 可以将>3改为3,<-3改为-3
data[np.abs(data) > 3] = np.sign(data) * 3
print(data.describe())

你可能感兴趣的:(pandas,pandas)