pandas 数据处理-Group by操作

使用 “group by” 方式我们通常会有以下一个或几个步骤:

  • Splitting:根据某一准则对数据分组

  • Applying :对每一分组数据运用某个方法

  • Combining :将结果组合为数据结构
    在上述步骤中,split 方法较直接,在 split 之后我们希望对分组数据做相关计算,在 apply 步骤中我们可能想对数据进行如下操作:

  • Aggregation::聚合操作,对分组数据做汇总统计,如计算sums 或 means、统计分组个数 counts

  • Transformation:对分组数据做特定操作,如:分组数据进行标准化、按照分组数据数据选择值填充控制

  • Filtration:过滤数据。对组计算 True 或 False,按组进弃数据。如:丢弃某些 counts 较小的组,根据组sums 或 means 过滤数据

1 Split

pandas objects 可以基于任何轴进行分割,group by 会创建一个 GroupBy object 对象

import numpy as np
import pandas as pd
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),)

grouped = df.groupby("class")
grouped = df.groupby("order", axis="columns")
grouped = df.groupby(["class", "order"])
grouped
Out[2]: 
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023F22DEA9C8>
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
# 设置 ["A", "B"] 为行索引
df2 = df.set_index(["A", "B"])

Out[2]: 
			C		D
A	B		
foo	one	0.194055	-0.087457
bar	one	-1.542546	-1.442626
foo	two	0.867688	-0.540060
bar	three	1.622831	0.331491
foo	two	-0.364909	0.639529
bar	two	0.771066	-0.675301
foo	one	1.071776	0.884663
	three	1.367875	1.474144

1.1 GroupBy sorting

groupby 之后数据会按照 key 值进行排序,可以显示声明不排序

df2.groupby(["X"], sort=False).sum()

1.2 GroupBy dropna

默认情况下, 在 groupby 操作中会排除 NA key 的统计,可以通过设置 dropna=False 去除这一限制

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
# dropna默认设置为 True
df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

1.3 GroupBy object attributes

groups 的属性是一个字典,key 为分组唯一值,value 为属于key的轴标签

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
gb= df.groupby(["A", "B"])
gb.groups
Out[32]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
# len 为字典键值个数
len(grouped)
# 可用的属性
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

1.4 Grouping DataFrame with Index levels and columns

arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
# 多标签
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

Out[53]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7
# 按照轴索引的第二个轴和 A 列分组df.groupby([pd.Grouper(level=1), "A"]).sum()

Out[54]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7
# 通过轴名称
df.groupby([pd.Grouper(level="second"), "A"]).sum()
# 直接根据名称
df.groupby(["second", "A"]).sum()

1.5 DataFrame column selection in GroupBy

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
# 上面两个操作过程可以简写为
df["C"].groupby(df["A"])

2 遍历组 Iterating through groups

# 如果 key 为多个, group name 为元组
grouped = df.groupby('A')
for name, group in grouped:
    print(name)
    print(group)
  
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

3 选择组 Selecting a group

grouped.get_group("bar")
# 多个key的选择,传入数组
df.groupby(["A", "B"]).get_group(("bar", "one"))

4 数据聚合 Aggregation

GroupBy object 对象创建之后我们可以对分组数据做计算

grouped = df.groupby("A")
grouped.aggregate(np.sum)
Out[69]: 
 		C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
Out[71]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

从上面结果可以看到,对分组数据做计算之后,分组 name 作为了新的索引。如果不想分组name 最为新索引,可以用 as_index 操作数据

grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
Out[73]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429
# sum 操作可以简写为
df.groupby("A", as_index=False).sum()
# reset_index 方法直接重新设置索引
df.groupby(["A", "B"]).sum().reset_index()
Out[75]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

统计每组个数时,返回为 Series ,group name 为索引,值为组的大小

grouped = df.groupby(["A", "B"], as_index=False)
grouped.size()
Out[76]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2
# 分组数据统计汇总
grouped.describe()
# 组内某个非
df4.groupby("A")["B"].nunique()

可以应用于分组数据的方法
pandas 数据处理-Group by操作_第1张图片
汇总返回的都为 Series 类型数据

4.1 对分组数据一次执行多个方法

执行单个汇总方法,返回的为 Series 类型数据。当传入多个统计方法时候,返回为 DataFrame

grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std])
Out[83]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
Out[84]: 
            C                             D                    
          sum      mean       std       sum      mean       std
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

应用 lambda 函数

grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[88]: 
     <lambda_0>  <lambda_1>
A                          
bar    0.331279    0.084917
foo    2.337259   -0.215962

4.2 命名聚合Named aggregation

animals = pd.DataFrame(
    {
        "kind": ["cat", "dog", "cat", "dog"],
        "height": [9.1, 6.0, 9.5, 34.0],
        "weight": [7.9, 7.5, 9.9, 198.0],
    }
)
# 针对列指定函数
animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column="height", aggfunc="min"),
    max_height=pd.NamedAgg(column="height", aggfunc="max"),
    average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
)
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75
# 可以简写为
animals.groupby("kind").agg(
    min_height=("height", "min"),
    max_height=("height", "max"),
    average_weight=("weight", np.mean),
)

# 不同列用不同函数
grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[95]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

# 用户自定义函数User Defined Function (UDF)
animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())

5 过滤 Filtration

dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
dff.groupby("B").filter(lambda x: len(x) > 2)
Out[142]: 
   A  B
2  2  b
3  3  b
4  4  b
5  5  b
dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False)
Out[143]: 
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

参考

Group by: split-apply-combine

你可能感兴趣的:(pandas,python,pandas,group,by)