源
“group by”的含义包括以下的一或多个操作:
Combining
把处理结果组合成新的数据结构
最容易理解的是Splitting操作.在很多场景下,我们都是需要把数据分组并对分组做一些处理.在Applying阶段,我们可能期望的是如下操作:
Aggregation
对每个分组计算摘要或统计量,例如计算均值,和,数目等
Transformation
对每个分组做变换,返回 like-indexed object, 例如在组内做标准化(zscore)或填充NA
Filtration
依据某种准测删除分组,比如删除数据较少的分组或基于组的和或均值过滤分组
组合操作
上述三个操作的组合: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories
基于pandas数据结果的操作通常丰富而且直观,我们通常是把分组看作DataFrame,调用相关函数完成任务.熟悉基于SQL工具的读者应该比较熟悉GroupBy这个功能,类似如下的语句
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
pandas的操作和上面的代码一样简单易懂.以下将覆盖GroupBy的每个用法并提供一些复杂的例子.
高级的用法可以在cookbook中找到
pandas对象可以在任意轴做拆分.分组的定义是标签到分组名称的映射.可以用如下的代码获得GroupBy对象
# default is axis=0
>>> grouped = obj.groupby(key)
>>> grouped = obj.groupby(key, axis=1)
>>> grouped = obj.groupby([key1, key2])
这种映射可以通过如下的几种方式给出:
一般我们把分组的对象称为Key.例如下面的DataFrame
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
通过调用DataFrame的groupby()接口可以获得GroupBy对象,可以依据A或B列做分组,也可以同时依据A和B
In [3]: grouped = df.groupby('A')
In [4]: grouped = df.groupby(['A', 'B'])
上述代码将在索引(行)上做分组(译注:groupby(‘A’)将按”A”列的内容分组,’A’列有两个值foo和bar,所以df将被分成两组,一组的’A’只有foo,另一组的’A’只有bar,这是在索引轴上做分组),下面的代码按列分组
In [5]: def get_letter_type(letter):
...: if letter.lower() in 'aeiou': #列名字是{a,e,i,o,u}的分成一组,新组名vowel
...: return 'vowel'
...: else:
...: return 'consonant'
...:
In [6]: grouped = df.groupby(get_letter_type, axis=1)
pandas的Index对象支持重复的值.如果一个不唯一的索引值作为分组依据,相同索引将被划分为一个组,所以aggregation函数将不包括重复的索引值:
In [7]: lst = [1, 2, 3, 1, 2, 3]
In [8]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
In [9]: grouped = s.groupby(level=0)
#译注 : print s
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
In [10]: grouped.first()
Out[10]:
1 1
2 2
3 3
dtype: int64
In [11]: grouped.last()
Out[11]:
1 10
2 20
3 30
dtype: int64
In [12]: grouped.sum()
Out[12]:
1 11
2 22
3 33
dtype: int64
分组操作时延迟进行的,生成GroupBy对象只是验证传递的映射是否有效
默认分组会按照key排序,令sort=False可以节省排序的时间开销
In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
In [14]: df2.groupby(['X']).sum()
Out[14]:
Y
X
A 7
B 3
In [15]: df2.groupby(['X'], sort=False).sum()
Out[15]:
Y
X
B 3
A 7
groupby操作不会修改观测量在组内的次序,而是保持其在原始DataFrame中出现的次序
In [16]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [17]: df3.groupby(['X']).get_group('A')
Out[17]:
X Y
0 A 1
2 A 3
In [18]: df3.groupby(['X']).get_group('B')
Out[18]:
X Y
1 B 4
3 B 2
groups的属性是一个字典.字典的key是分组的标签,字典的值是每个标签对应的分组.
In [19]: df.groupby('A').groups
Out[19]:
{'bar': Int64Index([1, 3, 5], dtype='int64'),
'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
In [20]: df.groupby(get_letter_type, axis=1).groups
Out[20]:
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
'vowel': Index(['A'], dtype='object')}
调用python标准len函数将得到groups字典的大小
In [21]: grouped = df.groupby(['A', 'B'])
In [22]: grouped.groups
Out[22]:
{('bar', 'one'): Int64Index([1], dtype='int64'),
('bar', 'three'): Int64Index([3], dtype='int64'),
('bar', 'two'): Int64Index([5], dtype='int64'),
('foo', 'one'): Int64Index([0, 6], dtype='int64'),
('foo', 'three'): Int64Index([7], dtype='int64'),
('foo', 'two'): Int64Index([2, 4], dtype='int64')}
In [23]: len(grouped)
Out[23]: 6
命令行模式下,GroupBy对象输入TAB键将自动填充列名字和其他的属性
In [24]: df
Out[24]:
height weight gender
2000-01-01 42.849980 157.500553 male
2000-01-02 49.607315 177.340407 male
2000-01-03 56.293531 171.524640 male
2000-01-04 48.421077 144.251986 female
2000-01-05 46.556882 152.526206 male
2000-01-06 68.448851 168.272968 female
2000-01-07 70.757698 136.431469 male
2000-01-08 58.909500 176.499753 female
2000-01-09 76.435631 174.094104 female
2000-01-10 45.306120 177.540920 male
In [25]: gb = df.groupby('gender')
In [26]: gb.<TAB>
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
对于hierachically-indexed data,可以按照层次中的任意层分组. 先创建一个两层的MultiIndex
In [27]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [28]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [29]: s = pd.Series(np.random.randn(8), index=index)
In [30]: s
Out[30]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
按照s的一个层分组
In [31]: grouped = s.groupby(level=0)
In [32]: grouped.sum()
Out[32]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
如果MultiIndex被赋予了名字,可以用名字替换层数
In [33]: s.groupby(level='second').sum()
Out[33]:
second
one 0.980950
two 1.991575
dtype: float64
aggregation函数,比如sum函数,支持直接输入层数.另外结果索引将直接用选择的层命名
In [34]: s.sum(level='second')
Out[34]:
second
one 0.980950
two 1.991575
dtype: float64
基于多个层的分组也是支持的
In [35]: s
Out[35]:
first second third
bar doo one -1.131345
two -0.089329
baz bee one 0.337863
two -0.945867
foo bop one -0.932132
two 1.956030
qux bop one 0.017587
two -0.016692
dtype: float64
In [36]: s.groupby(level=['first', 'second']).sum()
Out[36]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
0.20版本新增支把层作为key
In [37]: s.groupby(['first', 'second']).sum()
Out[37]:
first second
bar doo -1.220674
baz bee -0.608004
foo bop 1.023898
qux bop 0.000895
dtype: float64
DataFrame可以同时按照列和索引分组,此时需要用字符串设置列名,用pd.Grouper对象设置索引
In [38]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [39]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [40]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
....: 'B': np.arange(8)},
....: index=index)
....:
In [41]: df
Out[41]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
下面的例子用A列和second索引分组
In [42]: df.groupby([pd.Grouper(level=1), 'A']).sum()
Out[42]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
索引也可以用名字设置
In [43]: df.groupby([pd.Grouper(level='second'), 'A']).sum()
Out[43]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
0.20新版本中允许直接把索引值作为key
In [44]: df.groupby(['second', 'A']).sum()
Out[44]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
获得GroupBy对象后,可能需要对不同的列做不同的处理.这时可以利用[]获得一列数据,如下所示:
In [45]: grouped = df.groupby(['A'])
In [46]: grouped_C = grouped['C']
In [47]: grouped_D = grouped['D']
以上是为了简化使用而设计的语法糖(译注:增加的一种语法,不影响功能,只是单纯的方便使用),其等价于下面的语句
In [48]: df['C'].groupby(df['A'])
Out[48]: .core.groupby.groupby.SeriesGroupBy object at 0x1c2f67b128>
一旦获得GroupBy对象,遍历分组十分方便,和itertools.groupby()类似:
In [49]: grouped = df.groupby('A')
In [50]: for name, group in grouped:
....: print(name)
....: print(group)
....:
bar
A B C D
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
foo
A B C D
0 foo one -0.575247 1.346061
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
如果是基于多索引的分组,分组名就是元祖
In [51]: for name, group in df.groupby(['A', 'B']):
....: print(name)
....: print(group)
....:
('bar', 'one')
A B C D
1 bar one 0.254161 1.511763
('bar', 'three')
A B C D
3 bar three 0.215897 -0.990582
('bar', 'two')
A B C D
5 bar two -0.077118 1.211526
('foo', 'one')
A B C D
0 foo one -0.575247 1.346061
6 foo one -0.408530 0.268520
('foo', 'three')
A B C D
7 foo three -0.862495 0.02458
('foo', 'two')
A B C D
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
这就是标准的python语法,而且可以在循环中展开元组:
for (k1,k2), group in grouped:
利用get_group()可以获得单个分组
In [52]: grouped.get_group('bar')
Out[52]:
A B C D
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
使用元组获得多列的分组
In [53]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[53]:
A B C D
1 bar one 0.254161 1.511763
有了GroupBy对象后,有一些方法可以用来处理分组数据.这些操作类似 aggregating API, windows function API 和 resample API.
常用的aggregation是利用aggregate(),其等价于agg()方法:
In [54]: grouped = df.groupby('A')
In [55]: grouped.aggregate(np.sum)
Out[55]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [56]: grouped = df.groupby(['A', 'B'])
In [57]: grouped.aggregate(np.sum)
Out[57]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
如上aggregation的结果中用分组名作为新的索引,对于多索引,默认结果是MultiIndex.但可以利用as_index选项修改默认值.
(译注:as_index=False把多层索引变成单层索引,方法是延展高层索引)
In [58]: grouped = df.groupby(['A', 'B'], as_index=False)
In [59]: grouped.aggregate(np.sum)
Out[59]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
In [60]: df.groupby('A', as_index=False).sum()
Out[60]:
A C D
0 bar 0.392940 1.732707
1 foo -1.796421 2.824590
利用DataFrame的reset_index()函数也可以达到相同目的
In [61]: df.groupby(['A', 'B']).sum().reset_index()
Out[61]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
另一个简单的aggregation例子是计算每个分组的大小,可以用GroupBy的size方法.其返回的是一个Series,分组名做为索引,分组大小作为值
In [62]: grouped.size()
Out[62]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [63]: grouped.describe()
Out[63]:
C ... D
count mean std min 25% 50% 75% ... mean std min 25% 50% 75% max
0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 0.254161 ... 1.511763 NaN 1.511763 1.511763 1.511763 1.511763 1.511763
1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 0.215897 ... -0.990582 NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 -0.077118 ... 1.211526 NaN 1.211526 1.211526 1.211526 1.211526 1.211526
3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 -0.450209 ... 0.807291 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061
4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 -0.862495 ... 0.024580 NaN 0.024580 0.024580 0.024580 0.024580 0.024580
5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 0.609240 ... 0.592714 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081
[6 rows x 16 columns]
注意:按列值Aggregation时,默认(as_index=True)不会返回进行分组的组,列只是返回的对象的索引 (译注:多层索引). 当as_index=False时会返回分组的group(译注:这个结果更加容易理解)
Aggregation函数降低了返回对象的维度,一些aggregating函数列举如下:
Function | Description |
---|---|
mean() | Compute mean of groups |
sum() | Compute sum of group values |
size() | Compute group sizes |
count() | Compute count of group |
std() | Standard deviation of groups |
var() | Compute variance of groups |
sem() | Standard error of the mean of groups |
describe() | Generates descriptive statistics |
first() | Compute first of group values |
last() | Compute last of group values |
nth() | Take nth value, or a subset if n is a list |
min() | Compute min of group values |
max() | Compute max of group values |
上述aggregating函数会排除NA. 任意可以把Series映射到标量的函数都可以,比如df.groupby(‘A’).agg(lambda ser:1).
Note that nth() can act as a reducer or a filter, see here
可以传入一个函数列表或字典进行aggregation,输出DataFrame(译注:否则输出Series)
In [64]: grouped = df.groupby('A')
In [65]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[65]:
sum mean std
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
如果是DataFrame的分组结果,传递一个函数list,agg的结果是分层索引,如下 (译注:对每一列都会被所有函数调用)
In [66]: grouped.agg([np.sum, np.mean, np.std])
Out[66]:
C D
sum mean std sum mean std
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
aggregations的结果以函数名命名,可以利用rename()函数传入字典重命名
In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'})
....: )
....:
Out[67]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
对于分组的DataFrame,可以用同样的方法重命名
In [68]: (grouped.agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'})
....: )
....:
Out[68]:
C D
foo bar baz foo bar baz
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
利用字典可以对不同列做不同的处理
In [69]: grouped.agg({'C' : np.sum,
....: 'D' : lambda x: np.std(x, ddof=1)})
....:
Out[69]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
函数名可以用字符串,但是其要么是GroupBy已经实现的,要么通过dispatching可以调用
In [70]: grouped.agg({'C' : 'sum', 'D' : 'std'})
Out[70]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
注意: 传递dict到agg函数,输出的次序有可能会被修改,只有传入OrderdDict才可以保证输出次序,如下所示
In [71]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[71]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
In [72]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[72]:
D C
A
bar 1.366330 0.130980
foo 0.884785 -0.359284
sum/std/sem这三个agg函数用Cython实现以提高速度
In [73]: df.groupby('A').sum()
Out[73]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [74]: df.groupby(['A', 'B']).mean()
Out[74]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.491888 0.807291
three -0.862495 0.024580
two 0.024925 0.592714
待续….