在很多应用中,数据通常散落在不同的文件或数据库中,并不方便进行分析。数据加工就是对这些数据的统一。
- join:连接
- combine:合并
reshape:整形
merge:归并
concatenate:串联
pivot:旋转
- stack:堆叠
import pandas as pd
import numpy as np
合并数据集
pandas里有几种方法可以合并数据:
- pandas.merge 按一个或多个key把DataFrame中的行连接起来。这个和SQL或其他一些关系型数据库中的join操作相似。
- pandas.concat 在一个axis(轴)上,串联或堆叠(stack)多个对象。
- combine_first 实例方法(instance method)能合并相互之间有重复的数据,并用一个对象里的值填满缺失值
这里每一个都会给出一些例子。这些用法贯穿这本书。
1、Concatenating Along an Axis(沿着轴串联)
另一种结合方式被称为可互换的,比如concatenation, binding, or stacking(串联,绑定,堆叠)。Numpy中的concatenate函数可以作用于numpy 数组
arr = np.arange(12.).reshape((3, 4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
np.concatenate([arr, arr], axis=1)
array([[ 0., 1., 2., 3., 0., 1., 2., 3.],
[ 4., 5., 6., 7., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 8., 9., 10., 11.]])
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
df = pd.concat([s1, s2, s3], axis=1, keys=['one','two','three'])
df
|
one |
two |
three |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
pd.concat([s1, s2, s3], axis=1, keys=['one','two','three'], join_axes=[['a', 'c', 'b', 'e']])
|
one |
two |
three |
a |
0.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
b |
1.0 |
NaN |
NaN |
e |
NaN |
4.0 |
NaN |
df.stack()
a one 0.0
b one 1.0
c two 2.0
d two 3.0
e two 4.0
f three 5.0
g three 6.0
dtype: float64
2、Database-Style DataFrame Joins(数据库风格的DataFrame Joins)
Merge或join操作,能通过一个或多个key,把不同的数据集的行连接在一起。这种操作主要集中于关系型数据库。pandas中的merge函数是这种操作的主要切入点
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1
|
data1 |
key |
0 |
0 |
b |
1 |
1 |
b |
2 |
2 |
a |
3 |
3 |
c |
4 |
4 |
a |
5 |
5 |
a |
6 |
6 |
b |
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
df2
|
data2 |
key |
0 |
0 |
a |
1 |
1 |
b |
2 |
2 |
d |
df = pd.merge(df1, df2, on='key')
df
|
data1 |
key |
data2 |
0 |
0 |
b |
1 |
1 |
1 |
b |
1 |
2 |
6 |
b |
1 |
3 |
2 |
a |
0 |
4 |
4 |
a |
0 |
5 |
5 |
a |
0 |
如果每一个对象中的列名不一定,我们可以分别指定
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
|
data1 |
lkey |
data2 |
rkey |
0 |
0 |
b |
1 |
b |
1 |
1 |
b |
1 |
b |
2 |
6 |
b |
1 |
b |
3 |
2 |
a |
0 |
a |
4 |
4 |
a |
0 |
a |
5 |
5 |
a |
0 |
a |
pd.merge(df1, df2, how='outer')
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1.0 |
1 |
1.0 |
b |
1.0 |
2 |
6.0 |
b |
1.0 |
3 |
2.0 |
a |
0.0 |
4 |
4.0 |
a |
0.0 |
5 |
5.0 |
a |
0.0 |
6 |
3.0 |
c |
NaN |
7 |
NaN |
d |
2.0 |
这里是how的一些选项:
pd.merge(df1, df2, on='key', how='left')
|
data1 |
key |
data2 |
0 |
0 |
b |
1.0 |
1 |
1 |
b |
1.0 |
2 |
2 |
a |
0.0 |
3 |
3 |
c |
NaN |
4 |
4 |
a |
0.0 |
5 |
5 |
a |
0.0 |
6 |
6 |
b |
1.0 |
pd.merge(df1, df2, on='key', how='outer')
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1.0 |
1 |
1.0 |
b |
1.0 |
2 |
6.0 |
b |
1.0 |
3 |
2.0 |
a |
0.0 |
4 |
4.0 |
a |
0.0 |
5 |
5.0 |
a |
0.0 |
6 |
3.0 |
c |
NaN |
7 |
NaN |
d |
2.0 |
整形和旋转
1、Reshaping with Hierarchical Indexing(对多层级索引进行整形)
多层级索引提供一套统一的方法来整理DataFrame中数据。主要有两个操作:
stack 这个操作会把列旋转为行
unstack 这个会把行变为列
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
data
number |
one |
two |
three |
state |
|
|
|
Ohio |
0 |
1 |
2 |
Colorado |
3 |
4 |
5 |
result = data.stack()
result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
result.unstack(1)
number |
one |
two |
three |
state |
|
|
|
Ohio |
0 |
1 |
2 |
Colorado |
3 |
4 |
5 |
result.unstack(0)
state |
Ohio |
Colorado |
number |
|
|
one |
0 |
3 |
two |
1 |
4 |
three |
2 |
5 |
一种用来把多重时间序列数据存储在数据库和CSV中的格式叫long or stacked format(长格式或堆叠格式)。下面我们加载一些数据,处理一下时间序列文件并做一些数据清理工作:
data = pd.read_csv('../examples/macrodata.csv')
data.head()
|
year |
quarter |
realgdp |
realcons |
realinv |
realgovt |
realdpi |
cpi |
m1 |
tbilrate |
unemp |
pop |
infl |
realint |
0 |
1959.0 |
1.0 |
2710.349 |
1707.4 |
286.898 |
470.045 |
1886.9 |
28.98 |
139.7 |
2.82 |
5.8 |
177.146 |
0.00 |
0.00 |
1 |
1959.0 |
2.0 |
2778.801 |
1733.7 |
310.859 |
481.301 |
1919.7 |
29.15 |
141.7 |
3.08 |
5.1 |
177.830 |
2.34 |
0.74 |
2 |
1959.0 |
3.0 |
2775.488 |
1751.8 |
289.226 |
491.260 |
1916.4 |
29.35 |
140.5 |
3.82 |
5.3 |
178.657 |
2.74 |
1.09 |
3 |
1959.0 |
4.0 |
2785.204 |
1753.7 |
299.356 |
484.052 |
1931.3 |
29.37 |
140.0 |
4.33 |
5.6 |
179.386 |
0.27 |
4.06 |
4 |
1960.0 |
1.0 |
2847.699 |
1770.5 |
331.722 |
462.199 |
1955.5 |
29.54 |
139.6 |
3.50 |
5.2 |
180.007 |
2.31 |
1.19 |
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,name='date')
periods
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
columns
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns=columns)
data.head()
item |
realgdp |
infl |
unemp |
0 |
2710.349 |
0.00 |
5.8 |
1 |
2778.801 |
2.34 |
5.1 |
2 |
2775.488 |
2.74 |
5.3 |
3 |
2785.204 |
0.27 |
5.6 |
4 |
2847.699 |
2.31 |
5.2 |
data.index = periods.to_timestamp('D', 'end')
data[:10]
item |
realgdp |
infl |
unemp |
date |
|
|
|
1959-03-31 |
2710.349 |
0.00 |
5.8 |
1959-06-30 |
2778.801 |
2.34 |
5.1 |
1959-09-30 |
2775.488 |
2.74 |
5.3 |
1959-12-31 |
2785.204 |
0.27 |
5.6 |
1960-03-31 |
2847.699 |
2.31 |
5.2 |
1960-06-30 |
2834.390 |
0.14 |
5.2 |
1960-09-30 |
2839.022 |
2.70 |
5.6 |
1960-12-31 |
2802.616 |
1.21 |
6.3 |
1961-03-31 |
2819.264 |
-0.40 |
6.8 |
1961-06-30 |
2872.005 |
1.47 |
7.0 |
ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata[:10]
|
date |
item |
value |
0 |
1959-03-31 |
realgdp |
2710.349 |
1 |
1959-03-31 |
infl |
0.000 |
2 |
1959-03-31 |
unemp |
5.800 |
3 |
1959-06-30 |
realgdp |
2778.801 |
4 |
1959-06-30 |
infl |
2.340 |
5 |
1959-06-30 |
unemp |
5.100 |
6 |
1959-09-30 |
realgdp |
2775.488 |
7 |
1959-09-30 |
infl |
2.740 |
8 |
1959-09-30 |
unemp |
5.300 |
9 |
1959-12-31 |
realgdp |
2785.204 |
pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:10]
item |
infl |
realgdp |
unemp |
date |
|
|
|
1959-03-31 |
0.00 |
2710.349 |
5.8 |
1959-06-30 |
2.34 |
2778.801 |
5.1 |
1959-09-30 |
2.74 |
2775.488 |
5.3 |
1959-12-31 |
0.27 |
2785.204 |
5.6 |
1960-03-31 |
2.31 |
2847.699 |
5.2 |
1960-06-30 |
0.14 |
2834.390 |
5.2 |
1960-09-30 |
2.70 |
2839.022 |
5.6 |
1960-12-31 |
1.21 |
2802.616 |
6.3 |
1961-03-31 |
-0.40 |
2819.264 |
6.8 |
1961-06-30 |
1.47 |
2872.005 |
7.0 |
用于DataFrame,与pivot相反的操作是pandas.melt。相对于把一列变为多列的pivot,melt会把多列变为一列,产生一个比输入的DataFrame还要长的结果。
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]})
df
|
A |
B |
C |
key |
0 |
1 |
4 |
7 |
foo |
1 |
2 |
5 |
8 |
bar |
2 |
3 |
6 |
9 |
baz |
melted = pd.melt(df, ['key'])
melted
|
key |
variable |
value |
0 |
foo |
A |
1 |
1 |
bar |
A |
2 |
2 |
baz |
A |
3 |
3 |
foo |
B |
4 |
4 |
bar |
B |
5 |
5 |
baz |
B |
6 |
6 |
foo |
C |
7 |
7 |
bar |
C |
8 |
8 |
baz |
C |
9 |
reshaped = melted.pivot('key', 'variable', 'value')
reshaped
variable |
A |
B |
C |
key |
|
|
|
bar |
2 |
5 |
8 |
baz |
3 |
6 |
9 |
foo |
1 |
4 |
7 |
reshaped.reset_index()
variable |
key |
A |
B |
C |
0 |
bar |
2 |
5 |
8 |
1 |
baz |
3 |
6 |
9 |
2 |
foo |
1 |
4 |
7 |
分层索引
data = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
a 1 -0.505050
2 0.483053
3 -1.166091
b 1 1.580432
3 -0.733281
c 1 0.564248
2 0.455442
d 2 1.002617
3 -0.207445
dtype: float64
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
data['b']
1 1.580432
3 -0.733281
dtype: float64
data['b':'d']
b 1 1.580432
3 -0.733281
c 1 0.564248
2 0.455442
d 2 1.002617
3 -0.207445
dtype: float64
data.loc[['b', 'd'],3]
b 3 -0.733281
d 3 -0.207445
dtype: float64
data.loc[:,2]
a 0.483053
c 0.455442
d 1.002617
dtype: float64
分层索引的作用是改变数据的形状,以及做一些基于组的操作(group-based)比如做一个数据透视表(pivot table)。例子,我们可以用unstack来把数据进行重新排列,产生一个DataFrame
b = data.unstack()
b
|
1 |
2 |
3 |
a |
-0.505050 |
0.483053 |
-1.166091 |
b |
1.580432 |
NaN |
-0.733281 |
c |
0.564248 |
0.455442 |
NaN |
d |
NaN |
1.002617 |
-0.207445 |
b.stack()
a 1 -0.505050
2 0.483053
3 -1.166091
b 1 1.580432
3 -0.733281
c 1 0.564248
2 0.455442
d 2 1.002617
3 -0.207445
dtype: float64
对于dataframe,任何一个axis(轴)都可以有一个分层索引
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
|
|
Ohio |
Colorado |
|
|
Green |
Red |
Green |
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame['Ohio']
|
color |
Green |
Red |
key1 |
key2 |
|
|
a |
1 |
0 |
1 |
2 |
3 |
4 |
b |
1 |
6 |
7 |
2 |
9 |
10 |
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])
MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
labels=[[1, 1, 0], [0, 1, 0]],
names=['state', 'color'])
1、Reordering and Sorting Levels(重排序和层级排序)¶
有时候我们需要在一个axis(轴)上按层级进行排序,或者在一个层级上,根据值来进行排序。swaplevel会取两个层级编号或者名字,并返回一个层级改变后的新对象(数据本身并不会被改变)
frame.swaplevel('key1', 'key2')
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key2 |
key1 |
|
|
|
1 |
a |
0 |
1 |
2 |
2 |
a |
3 |
4 |
5 |
1 |
b |
6 |
7 |
8 |
2 |
b |
9 |
10 |
11 |
frame.sort_index(level=1)
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
b |
1 |
6 |
7 |
8 |
a |
2 |
3 |
4 |
5 |
b |
2 |
9 |
10 |
11 |
frame.sort_index(level='key1')
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame.swaplevel(0, 1).sort_index(level=0)
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key2 |
key1 |
|
|
|
1 |
a |
0 |
1 |
2 |
b |
6 |
7 |
8 |
2 |
a |
3 |
4 |
5 |
b |
9 |
10 |
11 |
2、Summary Statistics by Level (按层级来归纳统计数据)
在DataFrame和Series中,一些描述和归纳统计数据都是有一个level选项的,这里我们可以指定在某个axis下,按某个level(层级)来汇总。比如上面的DataFrame,我们可以按 行 或 列的层级来进行汇总
frame
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame.sum(level='key1')
state |
Ohio |
Colorado |
color |
Green |
Red |
Green |
key1 |
|
|
|
a |
3 |
5 |
7 |
b |
15 |
17 |
19 |
frame.sum(level='key2')
state |
Ohio |
Colorado |
color |
Green |
Red |
Green |
key2 |
|
|
|
1 |
6 |
8 |
10 |
2 |
12 |
14 |
16 |
frame.sum(level='color', axis=1)
|
color |
Green |
Red |
key1 |
key2 |
|
|
a |
1 |
2 |
1 |
2 |
8 |
4 |
b |
1 |
14 |
7 |
2 |
20 |
10 |
3、Indexing with a DataFrame’s columns(利用DataFrame的列来索引)
把DataFrame里的一列或多列作为行索引(row index)是一件很常见的事;另外,我们可能还希望把行索引变为列。
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
frame
|
a |
b |
c |
d |
0 |
0 |
7 |
one |
0 |
1 |
1 |
6 |
one |
1 |
2 |
2 |
5 |
one |
2 |
3 |
3 |
4 |
two |
0 |
4 |
4 |
3 |
two |
1 |
5 |
5 |
2 |
two |
2 |
6 |
6 |
1 |
two |
3 |
frame2 = frame.set_index(['c', 'd'], drop=False)
frame2
|
|
a |
b |
c |
d |
c |
d |
|
|
|
|
one |
0 |
0 |
7 |
one |
0 |
1 |
1 |
6 |
one |
1 |
2 |
2 |
5 |
one |
2 |
two |
0 |
3 |
4 |
two |
0 |
1 |
4 |
3 |
two |
1 |
2 |
5 |
2 |
two |
2 |
3 |
6 |
1 |
two |
3 |
另一方面,reset_index的功能与set_index相反,它会把多层级索引变为列
frame3 = frame.set_index(['c','d'])
frame3.reset_index()
|
c |
d |
a |
b |
0 |
one |
0 |
0 |
7 |
1 |
one |
1 |
1 |
6 |
2 |
one |
2 |
2 |
5 |
3 |
two |
0 |
3 |
4 |
4 |
two |
1 |
4 |
3 |
5 |
two |
2 |
5 |
2 |
6 |
two |
3 |
6 |
1 |