groupby是使用频率很高的,这里介绍三种用法:
import pandas as pd
pd.__version__
'0.23.4'
label = [0,0,1,1,0,1,0,1,0,1]
sex = ["male","female","male","female","male","female","male","female","male","female"]
age = [23,26,21,30,56,45,25,37,48,33]
province = ["江苏","湖北","河南","山东","江苏","湖北","河南","湖北","河南","山东"]
rawDf = pd.DataFrame({
"sex":sex,"age":age,"province":province,"label":label})
rawDf
sex | age | province | label | |
---|---|---|---|---|
0 | male | 23 | 江苏 | 0 |
1 | female | 26 | 湖北 | 0 |
2 | male | 21 | 河南 | 1 |
3 | female | 30 | 山东 | 1 |
4 | male | 56 | 江苏 | 0 |
5 | female | 45 | 湖北 | 1 |
6 | male | 25 | 河南 | 0 |
7 | female | 37 | 湖北 | 1 |
8 | male | 48 | 河南 | 0 |
9 | female | 33 | 山东 | 1 |
help(rawDf.groupby)
Help on method groupby in module pandas.core.generic:
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs) method of pandas.core.frame.DataFrame instance
Group series using mapper (dict or key function, apply given function
to group, return result as series) or by a series of columns.
Parameters
----------
by : mapping, function, label, or list of labels
Used to determine the groups for the groupby.
If ``by`` is a function, it's called on each value of the object's
index. If a dict or Series is passed, the Series or dict VALUES
will be used to determine the groups (the Series' values are first
aligned; see ``.align()`` method). If an ndarray is passed, the
values are used as-is determine the groups. A label or list of
labels may be passed to group by the columns in ``self``. Notice
that a tuple is interpreted a (single) key.
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular
level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively "SQL-style" grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default False
This only applies if any of the groupers are Categoricals
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
.. versionadded:: 0.23.0
Returns
-------
GroupBy object
Examples
--------
DataFrame results
>>> data.groupby(func, axis=0).mean()
>>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
Notes
-----
See the `user guide
`_ for more.
See also
--------
resample : Convenience method for frequency conversion and resampling
of time series.
rawDf.groupby(by="sex").count()
age | province | label | |
---|---|---|---|
sex | |||
female | 5 | 5 | 5 |
male | 5 | 5 | 5 |
直接groupby+统计函数,有两个比较麻烦的地方:
对于第一个问题,可通过设置参数as_index=False
处理,此时sex为第一列
rawDf.groupby(by="sex",as_index=False).count()
sex | age | province | label | |
---|---|---|---|---|
0 | female | 5 | 5 | 5 |
1 | male | 5 | 5 | 5 |
如果指向输出指定列的某些统计结果,可以通过列名限制
rawDf.groupby(by="sex")["province","label"].count()
province | label | |
---|---|---|
sex | ||
female | 5 | 5 |
male | 5 | 5 |
在SQL中,我们会统计多个汇总值,比如:
SELECT sex,COUNT(province),sum(label)
FROM tabledemo
GROUP BY sex
pandas可以借用apply方法实现。
rawDf.groupby(by="sex").apply(lambda x:pd.DataFrame({
"x1":[x.province.count()],"x2":[x.label.sum()]}))
x1 | x2 | ||
---|---|---|---|
sex | |||
female | 0 | 5 | 4 |
male | 0 | 5 | 1 |
当然也可以通过辅助函数的形式传入
def aggF(subdf):
provinceCnt = subdf.province.count()
provinceDnt = subdf.province.nunique()
labelSum = subdf.label.sum()
return pd.DataFrame({
"provinceCnt": [provinceCnt],
"provinceDnt": [provinceDnt],
"labelSum": [labelSum]
})
rawDf.groupby(by="sex",as_index=False).apply(lambda x:aggF(subdf=x))
provinceCnt | provinceDnt | labelSum | ||
---|---|---|---|---|
0 | 0 | 5 | 2 | 4 |
1 | 0 | 5 | 2 | 1 |
def aggF(subdf):
provinceCnt = subdf.province.count()
provinceDnt = subdf.province.nunique()
labelSum = subdf.label.sum()
return pd.DataFrame({
"provinceCnt": provinceCnt,
"provinceDnt": provinceDnt,
"labelSum": labelSum
},index=[0])
rawDf.groupby(by="sex",as_index=True).apply(lambda x:aggF(subdf=x))
provinceCnt | provinceDnt | labelSum | ||
---|---|---|---|---|
sex | ||||
female | 0 | 5 | 2 | 4 |
male | 0 | 5 | 2 | 1 |
这里有坑,groupby+apply之后,groupbykey不能作为其中一列了,目前只知道用for循环解决,比较笨,但是凑合用吧
def defaultFunction(subDf, groupIndex):
badCnt = subDf.query("label==1").shape[0]
goodCnt = subDf.query("label==0").shape[0]
totalCnt = badCnt + goodCnt
badRate = badCnt / totalCnt
return pd.DataFrame({
"groupIndex": [groupIndex],
"badCnt": [badCnt],
"goodCnt": [goodCnt],
"totalCnt": [totalCnt],
"badRate": [badRate]
})
def groupFunction(Df,groupKey):
result = pd.DataFrame()
for name,group in Df.groupby(by=[groupKey]):
result = pd.concat([result,defaultFunction(group,name)])
result.index = range(len(result))
return result
groupFunction(rawDf,"sex")
groupIndex | badCnt | goodCnt | totalCnt | badRate | |
---|---|---|---|---|---|
0 | female | 4 | 1 | 5 | 0.8 |
1 | male | 1 | 4 | 5 | 0.2 |
groupFunction(rawDf,"province")
groupIndex | badCnt | goodCnt | totalCnt | badRate | |
---|---|---|---|---|---|
0 | 山东 | 2 | 0 | 2 | 1.000000 |
1 | 江苏 | 0 | 2 | 2 | 0.000000 |
2 | 河南 | 1 | 2 | 3 | 0.333333 |
3 | 湖北 | 2 | 1 | 3 | 0.666667 |
2020-02-29 于南京市栖霞区