isnull
pandas中使用NaN表示缺失数据,使用isnull()可以看出
>>> string_data = pd.Series(['aar', 'art', np.nan, 'avocado'])
>>> string_data
0 aar
1 art
2 NaN
3 avocado
dtype: object
>>> string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
python内置的None值用isnull()检查时也是True。notnull是相反的
dropna
dropna默认丢弃包含NA的行或列。如果只丢弃全部为NA的行,可以加how='all’参数,如果只丢弃全NA的列,不丢弃全NA的行和部分NA的行列,那么使用参数:axis=1, how=“all”
>>> from numpy import nan as NA
>>> data = pd.Series([1, NA, 3.5, NA, 7])
>>> data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
>>> data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
# 等价于
>>> data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
>>> data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> data.dropna()
0 1 2
0 1.0 6.5 3.0
thresh参数表示某行至少要剩几个数字才会被保留:
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> data.dropna(thresh=3)
0 1 2
0 1.0 6.5 3.0
fillna
fillna是会创建新对象的,如果希望就地,需要加inplace=True条件
>>> data.fillna(0)
0 1 2
0 1.0 6.5 3.0
1 1.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 6.5 3.0
如果要给不同的列填充不同的值,需要传入一个字典,如{1:0.5, 2:0}会将1列的NA改为0.5,2列的NA改为0
fillna有一些method,比如pad和ffill是指用NA同列的上一个值来填充,如果没有就填NA。backfill和bfill是用下面一个值
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> data.fillna(method='ffill')
0 1 2
0 1.0 6.5 3.0
1 1.0 6.5 3.0
2 1.0 6.5 3.0
3 1.0 6.5 3.0
# 可以明显看到[1,1]和[2,1]的填充用的是[3,1]的值,
# 而[2,0]和[3,0]下方没有值了,只能是NA
>>> data.iloc[0,1] = 5
>>> data.fillna(method='backfill')
0 1 2
0 1.0 5.0 3.0
1 1.0 6.5 3.0
2 NaN 6.5 3.0
3 NaN 6.5 3.0
如果希望只处理某一列的缺失值,需要加参数limit=列Index
另外,也可以令填充值为均值,比如 data.fillna(data.mean())
默认情况下,第一次出现的行会保存,再次出现的会被删除,但是可以用keep='last’参数保留最后一次出现的,删除前面出现的重复值。
data = pd.DataFrame({'k1':['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
# 查看重复情况。只有两行完全一样,重复处才会是True
print(data.duplicated())
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
print(data.drop_duplicates())
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
# 只丢弃某一列重复的
print(data.drop_duplicates(['k2'], keep='last'))
k1 k2
1 two 1
2 one 2
4 one 3
6 two 4
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
print(data)
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
利用函数修改index
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
# 创建转换函数
transform = lambda x: x[:4].upper()
# Map
data.index = data.index.map(transform)
print(data)
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
fillna可以看作是替换数据的特殊情况,因为fillna是创建新对象,因此replace也是(这样好记)
data = pd.Series([1., -999., 2., -999., -1000., 3.])
# 将-999替换为nan
print(data.replace(-999, np.nan))
# 将多个值替换为1个值
print(data.replace([-999, -1000], np.nan))
# 传入俩列表,将keys替换为values,一一对应
print(data.replace([-999, -1000], [np.nan, 0]))
# 也可以传入字典
print(data.replace({-999:np.nan}))
使用rename。注意rename也是要创建新对象的,如果不希望创建,加参数inplace=True。
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
# title就是只大写第一个字母
print(data.rename(index=str.title, columns=str.upper))
# 按照dict进行转换
print(data.rename(index={'Ohio':'India'}, columns={'three':'3'}))
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
# 添加labels可以为分组起名
# pd.cut(ages, bins, labels=group_names)
cats = pd.cut(ages, bins)
# 显示组编号
print(cats.codes)
# [0 0 0 1 0 0 2 1 3 2 2 1]
print(cats.categories)
#IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
print(pd.value_counts(cats))
(18, 25] 5
(25, 35] 3
(35, 60] 3
(60, 100] 1
dtype: int64
如果没有确切的边界,那么可以传入分组数量,pandas会根据最大最小值计算等长的bins
data = np.random.rand(20)
# 精度是小数点后2位
print(pd.cut(data, 4, precision=2))
[(0.73, 0.98], (0.01, 0.25], (0.01, 0.25], (0.01, 0.25], (0.01, 0.25], ..., (0.49, 0.73], (0.01, 0.25], (0.01, 0.25], (0.73, 0.98], (0.49, 0.73]]
Length: 20
Categories (4, interval[float64, right]): [(0.01, 0.25] < (0.25, 0.49] < (0.49, 0.73] <
(0.73, 0.98]]
如果希望每个bin有相同数量的点,可以使用qcut,也就是quartiles
cats = pd.qcut(data, 4) # 用pd.value_counts(cats)看,会分成4个组,每组数量一样
# 但是也可以自行指定quartiles
cat = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
print(pd.value_counts(cat))
(0.153, 0.552] 8
(0.552, 0.809] 8
(0.0934, 0.153] 2
(0.809, 0.99] 2
dtype: int64
data = pd.DataFrame(np.random.randn(1000, 4))
print(data.describe())
# 选某一列绝对值大于3的行
col = data[2]
print(col[np.abs(col) > 3])
# 选出全部含有绝对值大于3的数据的行
print(data[(np.abs(data) > 3).any(1)])
# 可以将>3改为3,<-3改为-3
data[np.abs(data) > 3] = np.sign(data) * 3
print(data.describe())