在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。
reindex()允许你修改、增加、删除指定轴上的索引,并返回一个数据副本。
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
'''
A B C D F E
2013-01-01 0.000000 0.000000 -0.541762 5 NaN 1.0
2013-01-02 -0.884117 -0.650741 0.217345 5 1.0 1.0
2013-01-03 0.220822 0.790527 0.692172 5 2.0 NaN
2013-01-04 1.260276 1.000297 0.809801 5 3.0 NaN
'''
剔除所有包含缺失值的行数据
df1.dropna(how='any')
'''
A B C D F E
2013-01-02 -0.884117 -0.650741 0.217345 5 1.0 1.0
'''
填充缺失值
df1.fillna(value=5)
'''
A B C D F E
2013-01-01 0.000000 0.000000 -0.541762 5 5.0 1.0
2013-01-02 -0.884117 -0.650741 0.217345 5 1.0 1.0
2013-01-03 0.220822 0.790527 0.692172 5 2.0 5.0
2013-01-04 1.260276 1.000297 0.809801 5 3.0 5.0
'''
获取值是否为nan的布尔标记
pd.isnull(df1)
'''
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
'''
运算过程中,通常不包含缺失值。
进行描述性统计
df.mean()
'''
A 0.474038
B 0.474008
C 0.442600
D 5.000000
F 3.000000
dtype: float64
'''
对其他轴进行同样的运算
df.mean(1)
'''
2013-01-01 1.114559
2013-01-02 0.936497
2013-01-03 1.740704
2013-01-04 2.214075
2013-01-05 2.501728
2013-01-06 2.384123
Freq: D, dtype: float64
'''
对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度计算。
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
'''
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
'''
df.sub(s, axis='index')
'''
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.779178 -0.209473 -0.307828 4.0 1.0
2013-01-04 -1.739724 -1.999703 -2.190199 2.0 0.0
2013-01-05 -3.320619 -3.531391 -4.639352 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
'''
通过apply()对函数作用
df.apply(np.cumsum)
'''
A B C D F
2013-01-01 0.000000 0.000000 -0.541762 5 NaN
2013-01-02 -0.884117 -0.650741 -0.324417 10 1.0
2013-01-03 -0.663295 0.139786 0.367755 15 3.0
2013-01-04 0.596982 1.140083 1.177556 20 6.0
2013-01-05 2.276362 2.608693 1.538204 25 10.0
2013-01-06 2.844229 2.844045 2.655599 30 15.0
'''
df.apply(lambda x:x.max()-x.min())
'''
A 2.563498
B 2.119350
C 1.659157
D 0.000000
F 4.000000
dtype: float64
'''
s = pd.Series(np.random.randint(0, 7, size=10))
s
'''
0 2
1 0
2 5
3 1
4 6
5 1
6 1
7 1
8 5
9 4
dtype: int32
'''
s.value_counts()
'''
1 4
5 2
6 1
4 1
2 1
0 1
dtype: int64
'''
对于Series对象,在其str属性中有着一系列的字符串处理方法。就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
'''
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
'''
pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行不同满足逻辑关系的合并操作
通过concat()来连接pandas对象
df = pd.DataFrame(np.random.randn(10,4))
df
'''
0 1 2 3
0 1.560690 2.253479 1.728586 1.224112
1 -1.237557 -1.571768 -1.687004 -0.226474
2 -0.591146 -0.054644 0.600806 0.076132
3 -0.567678 0.426496 -0.972487 0.200211
4 -2.073311 -1.566767 -0.533602 1.366468
5 2.244767 1.612232 1.934717 -0.403805
6 -2.640917 0.640549 1.257238 0.043773
7 1.545405 1.771884 -0.273687 2.441483
8 -0.440476 0.567536 2.379072 1.152354
9 -0.047853 -0.440427 -1.382389 0.647217
'''
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
'''
[ 0 1 2 3
0 1.560690 2.253479 1.728586 1.224112
1 -1.237557 -1.571768 -1.687004 -0.226474
2 -0.591146 -0.054644 0.600806 0.076132,
0 1 2 3
3 -0.567678 0.426496 -0.972487 0.200211
4 -2.073311 -1.566767 -0.533602 1.366468
5 2.244767 1.612232 1.934717 -0.403805
6 -2.640917 0.640549 1.257238 0.043773,
0 1 2 3
7 1.545405 1.771884 -0.273687 2.441483
8 -0.440476 0.567536 2.379072 1.152354
9 -0.047853 -0.440427 -1.382389 0.647217]
'''
pd.concat(pieces)
'''
0 1 2 3
0 1.560690 2.253479 1.728586 1.224112
1 -1.237557 -1.571768 -1.687004 -0.226474
2 -0.591146 -0.054644 0.600806 0.076132
3 -0.567678 0.426496 -0.972487 0.200211
4 -2.073311 -1.566767 -0.533602 1.366468
5 2.244767 1.612232 1.934717 -0.403805
6 -2.640917 0.640549 1.257238 0.043773
7 1.545405 1.771884 -0.273687 2.441483
8 -0.440476 0.567536 2.379072 1.152354
9 -0.047853 -0.440427 -1.382389 0.647217
'''
类似于SQL中的合并(merge)
left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left
'''
key lval
0 foo 1
1 foo 2
'''
right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right
'''
key lval
0 foo 4
1 foo 5
'''
pd.merge(left, right, on='key')
'''
key lval_x lval_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
'''
将若干行添加到dataFrame后面
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
'''
A B C D
0 0.415810 -1.106857 -0.687920 2.422911
1 0.696149 -1.235975 0.201409 1.424596
2 -0.540622 0.121096 -0.861667 -0.171690
3 0.163904 1.324567 -0.768324 -0.205520
4 -1.581152 -0.079061 0.251810 -0.195755
5 1.254246 1.604556 0.766464 -1.090743
6 0.608609 1.000765 -0.407980 0.034970
7 -3.111914 2.163344 0.619885 -0.705518
'''
s = df.iloc[3]
s
'''
A 0.163904
B 1.324567
C -0.768324
D -0.205520
Name: 3, dtype: float64
'''
df.append(s, ignore_index=True)
'''
A B C D
0 0.415810 -1.106857 -0.687920 2.422911
1 0.696149 -1.235975 0.201409 1.424596
2 -0.540622 0.121096 -0.861667 -0.171690
3 0.163904 1.324567 -0.768324 -0.205520
4 -1.581152 -0.079061 0.251810 -0.195755
5 1.254246 1.604556 0.766464 -1.090743
6 0.608609 1.000765 -0.407980 0.034970
7 -3.111914 2.163344 0.619885 -0.705518
8 0.163904 1.324567 -0.768324 -0.205520
'''
对于“group by”操作,我们通常是指以下一个或几个步骤:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
'''
A B C D
0 foo one 0.190663 0.589384
1 bar one 1.056331 0.035044
2 foo two 0.723645 0.372672
3 bar three -1.306869 0.435296
4 foo two 0.673661 -1.292242
5 bar two -0.823728 0.837556
6 foo one 0.638573 2.453041
7 bar three 0.508922 0.578740
'''
分组并对每个分组应用sum函数
df.groupby('A').sum()
'''
C D
A
bar -0.565344 1.886637
foo 2.226542 2.122855
'''
按多个列分组形成层级索引,然后应用函数
df.groupby(['A','B']).sum()
'''
C D
A B
bar one 1.056331 0.035044
three -0.797947 1.014036
two -0.823728 0.837556
foo one 0.829236 3.042425
two 1.397306 -0.919570
'''
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
'''
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
A B
'''
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
'''
A B
first second
bar one 0.033187 -1.575675
two -0.116085 0.463558
baz one -1.264759 -0.076262
two 0.290001 -0.908792
foo one 0.417612 -0.119032
two -0.041429 0.372586
qux one -1.233305 -2.380777
two -0.647859 -0.367107
'''
df2 = df[:4]
df2
'''
A B
first second
bar one 0.033187 -1.575675
two -0.116085 0.463558
baz one -1.264759 -0.076262
two 0.290001 -0.908792
'''
stack()方法对DataFrame的列“压缩”一个层级
stacked = df2.stack()
stacked
'''
first second
bar one A -1.696274
B -1.439440
two A -0.808252
B -0.111020
baz one A 0.976283
B -0.394769
two A 1.604678
B 0.499703
dtype: float64
'''
对于一个“堆叠过的”DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级
stacked.unstack()
'''
A B
first second
bar one 0.048974 -0.849294
two 0.256863 -2.006332
baz one -0.026727 0.296044
two -1.112311 -0.342600
'''
stacked.unstack(1)
'''
second one two
first
bar A 0.048974 0.256863
B -0.849294 -2.006332
baz A -0.026727 -1.112311
B 0.296044 -0.342600
'''
stacked.unstack(0)
'''
first bar baz
second
one A 0.048974 -0.026727
B -0.849294 0.296044
two A 0.256863 -1.112311
B -2.006332 -0.342600
'''
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
df
'''
A B C D E
0 one A foo 0.328414 -0.219345
1 one B foo -0.363916 1.015422
2 two C foo -0.413828 -0.346556
3 three A bar -0.817349 -0.561905
4 one B bar 0.421502 0.410673
5 one C bar 0.147630 -0.646454
6 two A foo 3.257885 1.025650
7 three B foo 0.664719 0.004742
8 one C foo 0.875158 -0.921567
9 one A bar -1.001131 0.867019
10 two B bar -0.260534 -0.202553
11 three C bar -0.142559 0.114470
'''
我们可以轻松地从这个数据得到透视表
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
'''
C bar foo
A B
one A -0.244547 -1.141382
B -0.784806 -1.055516
C 0.364710 -0.624468
three A 1.100998 NaN
B NaN -0.522151
C -0.442411 NaN
two A NaN -1.758408
B 1.586401 NaN
C NaN 0.601868
'''