Pandas_聚合数据_pivot_table()

聚合数据

pivot_table()

将列数据设定为行索引和列索引,并可以聚合运算。

(我总觉得,pivot_table 就是把分组key放到index和columns进行二维分组)

(pivot() 只能将列数据转换成行索引和列索引,不能运算,而且如果某项数据出现重复时,将无法执行。)

pivot_table() 既是顶级类函数,也是实例对象函数。

“一般的经验法则是,一旦使用多个“grouby”,那么需要评估此时使用透视表是否是一种好的选择。”

My general rule of thumb is that once you use multiple grouby you should evaluate whether a pivot table is a useful approach.——Chris Moffitt

df.pivot_table(values=Noneindex=Nonecolumns=Noneaggfunc='mean'fill_value=Nonemargins=Falsedropna=Truemargins_name='All')

pd.pivot_table(datavalues=Noneindex=Nonecolumns=Noneaggfunc='mean'fill_value=Nonemargins=Falsedropna=Truemargins_name='All')

参数 类型 说明
data DataFrame pd.pivot_table使用,设定需要操作的 DataFrame
values column

被计算的数据项

可选项

设定需要被聚合操作的列

index

array

column

grouper

list of the previous

行分组键

用于分组的列名或其他分组键,作为结果DataFrame的行索引

Keys to group by on the pivot table index

columns

array

column

grouper

list of the previous

列分组键

用于分组的列名或其他分组键,作为结果DataFrame的列索引

Keys to group by on the pivot table column

aggfunc

dict 

function

list of functions

numpy.mean 默认值

聚合函数或函数列表

如果aggfunc中出现list [ ],则在结果DataFrame中,list 内的函数名称肯定会出现在 columns中

  • aggfunc = np.sum
  • ☑ aggfunc = [ np.sum ]
  • aggfunc = [ np.sum,np.mean ] 
  • aggfunc = { 'Price':np.sum } 
  • aggfunc = { 'Price':[np.sum] } 
  • aggfunc = { 'Price':np.sum,'Quantity':len }
  • aggfunc = { 'Price':[np.sum],'Quantity':len }
  • aggfunc = { 'Price':[np.sum,np.mean],'Quantity':len }
  • ☑ 出现 ☒ 不出现
fill_value scalar

None 默认值

设定缺省值

dropna boolean

True 默认值

如果列的所有值都是NaN,将被删除;False时,被保留

margins boolean

False 默认值

True时,会添加行/列的总计

margins_name string

'All' 默认值

margins = True 时,设定margins 行/列的名称

>>> df.head()
    Account    Name    Rep    Manager    Product    Quantity    Price    Status
0   714466   Trantow   Craig 	Henley	CPU        	1	30000	presented
1   714466   Trantow   Craig 	Henley	Software	1	10000	presented
2   714466   Trantow   Craig 	Henley	Maintenance	2	5000	pending
3   737550   Fritsch   Craig 	Henley	CPU        	1	35000	declined
4   146832   Kiehn     Daniel	Henley	CPU        	2	65000	won

#将“Status”列定义为category,并设置顺序
#能够在分析数据的整个过程中,得到想要的顺序
>>> df["Status"] = df["Status"].astype("category")
>>> df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

-----------------------------------------------------------------------------  
#设置一个行索引
#index=["Name"]
#只保留能聚合的数据列,其他数据列未保留    
>>> df.pivot_table(index=["Name"]) /
>>> df.pivot_table(index=["Name"],values=["Account","Price","Quantity"])

	   Account	Price	Quantity
Name			
Barton 	   740150	35000	1.000000
Fritsch	   737550	35000	1.000000
Herman	   141962	65000	2.000000
Jerde	   412290	5000	2.000000
Kassulke   307599	7000	3.000000

#设置多个行索引
#index=["Name","Rep","Manager"]
>>> df.pivot_table(index=["Name","Rep","Manager"]) /
>>> df.pivot_table(index=["Name","Rep","Manager"],
                   values=["Account","Price","Quantity"])

			Account	Price	Quantity
Name	Rep	Manager			
Barton	John	Debra	740150	35000	1.000000
Fritsch	Craig	Debra	737550	35000	1.000000
Herman	Cedric	Fred	141962	65000	2.000000
Jerde	John	Debra	412290	5000	2.000000
Kassulke Wendy	Fred	307599	7000	3.000000

#设置更有实际意义的行索引
>>> df.pivot_table(index=["Manager","Rep"]) /
>>> df.pivot_table(index=["Manager","Rep"],
                   values=["Account","Price","Quantity"])
		Account		Price		Quantity
Manager	Rep			
Debra	Craig	720237.0	20000.000000	1.250000
     	Daniel	194874.0	38333.333333	1.666667
     	John	576220.0	20000.000000	1.500000
Fred	Cedric	196016.5	27500.000000	1.250000
    	Wendy	614061.5	44250.000000	3.000000
-----------------------------------------------------------------------------             
#准备聚合运算,保留需要聚合运算的"Price"列 
#"Price"列会自动计算数据的平均值 
# aggfunc=np.mean 默认
>>> df.pivot_table(index=["Manager","Rep"],values=["Price"])

		Price
Manager	Rep	
Debra	Craig	20000
	Daniel	38333
	John	20000
Fred	Cedric	27500
	Wendy	44250

#对"Price"列执行"求和"运算
#不使用默认的"平均值"运算
# aggfunc=np.sum
>>> df.pivot_table(index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

		Price
Manager	Rep	
Debra	Craig	80000
	Daniel	115000
	John	40000
Fred	Cedric	110000
	Wendy	177000

#可以对列同时执行多种运算
#对"Price"列执行"平均数"和"统计"运算
# aggfunc=[np.mean,len]
>>> df.pivot_table(index=["Manager","Rep"],
                   values=["Price"],aggfunc=[np.mean,len])

		mean	len
                Price	Price
Manager	Rep	
Debra	Craig	20000	4
	Daniel	38333	3
	John	20000	2
Fred	Cedric	27500	4
	Wendy	44250	4
-----------------------------------------------------------------------------             
#"Pirce"列的值是由多种"Product"项组成
#可以根据"Product"分组,将"Product"放置在 index 和 columns 会得到不同的数据视图
#将"Product"放置在 columns columns=["Product"] (得到横向数据视图,数据集很宽)
#将"Product"放置在 index   index=["Product"] (得到竖向数据视图,数据集很长)


#★首先columns=["Product"],将分组项移到列
#设定了columns 后,会产生多级列索引
#         columns 会是最内层的列索引
#         values  会是外层的列索引

#fill_value=0,将产生的缺省值 NaN 都设定为 0
>>> df.pivot_table(index=["Manager","Rep"],
                   values=["Price"],
                   columns=["Product"],aggfunc=[np.sum],fill_value=0)

                sum
                Price
        Product	CPU	Maintenance	Monitor	Software
Manager	Rep				
Debra	Craig	65000	5000	        0	10000
        Daniel	105000	0	        0	10000
        John	35000	5000	        0	0
Fred	Cedric	95000	5000	        0	10000
        Wendy	165000	7000	        5000	0

#增加"Quantity"列
#增加"Quantity"列后,columns 仍旧是最内层的列索引
#列索引排序 'values' >> 'columns'
>>> df.pivot_table(index=["Manager","Rep"],
                   values=["Price","Quantity"],
                   columns=["Product"],aggfunc=[np.sum],fill_value=0)

                sum
                Price                                       Quantity
        Product	CPU	Maintenance	Monitor	Software    CPU	Maintenance	Monitor	Software
Manager	Rep				
Debra	Craig	65000	5000	        0	10000        2	2	        0	1
        Daniel	105000	0	        0	10000        4	0	        0	1
        John	35000	5000	        0	0            1	2	        0	0
Fred	Cedric	95000	5000	        0	10000        3	1	        0	1
        Wendy	165000	7000	        5000	0            7	3	        2	0

#★然后index=["Product"],将分组项移到行
>>> df.pivot_table(index=["Manager","Rep","Product"],
                   values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0,dropna=True)

                                sum
                                Price	Quantity
Manager	Rep	Product		
Debra	Craig	CPU	        65000	2
                Maintenance	5000	2
                Software	10000	1
        Daniel	CPU	        105000	4
                Software	10000	1
        John	CPU	        35000	1
                Maintenance	5000	2
Fred	Cedric	CPU	        95000	3
                Maintenance	5000	1
                Software	10000	1
        Wendy	CPU	        165000	7
                Maintenance	7000	3
                Monitor	        5000	2

-----------------------------------------------------------------------------             
#将 aggfunc 设置为dict,不同的 values 可以执行不同的函数
#dict 的 key 是 value
#dict 的 value 是 函数名
#aggfunc={"Quantity":len,"Price":np.sum}

>>> df.pivot_table(index=["Manager","Status"],
                   columns=["Product"],
                   values=["Quantity","Price"],
                   aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)

    	            Price	                                Quantity
        Product	    CPU    Maintenance    Monitor    Software	CPU	Maintenance	Monitor	Software
Manager	Status								
Debra	declined    70000        0            0            0    2	0	        0	0
        pending     40000      10000	      0	           0    1	2	        0	0
        presented   30000	 0	      0	         20000  1	0	        0	2
        won         65000	 0	      0	           0    1	0	        0	0
Fred	declined    65000	 0	      0	           0    1	0	        0	0
        pending     0	       5000	      0	           0    0	1	        0	0
        presented   30000	 0	    5000	 10000  1	0	        1	1
        won         165000     7000	      0	           0    2	1	        0	0

#aggfunc={"Quantity":len,"Price":[np.sum,np.mean]}
>>> df.pivot_table(index=["Manager","Status"],
                   columns=["Product"],
                   values=["Quantity","Price"],
                   aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)

                                Price	                                                                                        Quantity
                                mean                                            sum	                                        len
                Product	        CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	        Status												
Debra Henley	declined	35000	0	        0	0	        70000	0	        0       0               2       0    	        0        0
                pending	        40000	5000	        0	0	        40000	10000    	0       0               1       2    	        0        0
                presented	30000	0	        0	10000	        30000	0	 	0	20000	        1       0    	        0        2
                won	        65000	0	        0	0	        65000	0	        0       0               1       0    	        0        0
Fred Anderson	declined	65000	0	        0	0	        65000	0	        0       0               1       0    	        0        0
                pending	        0	5000	        0	0	        0       5000            0	0               0       1    	        0        0
                presented	30000	0	        5000	10000	        30000	0	    	5000 	10000	 	1	0    	        1        1
                won	        82500	7000	        0	0	        165000	7000	 	0	0	 	2	1    	        0        0

-----------------------------------------------------------------------------             
#设置总计,自动计算行/列的总和
#margins=True

#margin 是根据"分组"情况而变化的( 也就是index / columns 的情况)
#这也是合理的,既然是"总计",前提是"肯定有分组了",如果没有分组,何来"总计"呢?
#存在 index 时,会出现底部的 margin (行分组时,存在底部的一个总计)
#存在 columns 时,会出现最右侧的 margin (列分组时,存在最右侧的一个总计)
#★所以只有 index 而没有 columns 时,只有最底部的一个 margin
>>> df.pivot_table(index="Manager",
                   values="Price",
                   margins=True,margins_name="New Sum",
                   aggfunc=np.sum,fill_value=0)

	        Price
Manager	
Debra Henley	235000
Fred Anderson	287000
New Sum	        522000

#★添加 columns 后,出现最右侧的 margin
>>> df.pivot_table(index="Manager"],
                   values="Price",
                   columns="Product",
                   margins=True,margins_name="New Sum",
                   aggfunc=np.sum,fill_value=0)

        Product	CPU	Maintenance	Monitor	Software	New Sum
Manager					
Debra Henley	205000	10000	        0	20000	        235000
Fred Anderson	260000	12000	        5000	10000	        287000
New Sum	        465000	22000	        5000	10000	        522000

#★而且每一个 values 最右侧都会有一个 margin
#前提是存在 columns(如果没有 columns,则只有最底部的 margin) 
>>> df.pivot_table(index=["Manager"],
                   values=["Price","Quantity"],
                   columns="Product",
                   margins=True,margins_name="New Sum",
                   aggfunc=np.sum,fill_value=0)

	        Price	                                                Quantity
        Product	CPU	Maintenance	Monitor	Software	New Sum	CPU	Maintenance	Monitor	Software	New Sum
Manager										
Debra Henley	205000	10000	        0	20000	        235000	7	4	        0	2	        13
Fred Anderson	260000	12000	        5000	10000    	287000	10	4	        2	1	        17
New Sum	        465000	22000	        5000	30000	        522000	17	8	        2	3	        30

#★margin 的运算规则
#不是简单地求和,而是与 aggfunc 的规则相同
#如果 aggfunc=np.mean,则 margin 的值也是"求平均值"
#如果 aggfunc=np.sum,则 margin 的值也是"求和"

-----------------------------------------------------------------------------  
#★注意 "Status" 列的排序,正是前面所设定的顺序
>>> df.pivot_table(index=["Manager","Status"],
                   values=["Price"],
                   aggfunc=[np.sum],fill_value=0,margins=True)

                                sum
                                Price
Manager        	Status	
Debra Henley	declined	70000
                pending	        50000
                presented	50000
                won	        65000
Fred Anderson	declined	65000
                pending	        5000
                presented	45000
                won	        172000
All                             522000


 

透视表的深入过滤

可以将生成的 DataFrame 存储在 Table 中,这样就可以使用函数对其分析
table = df.pivot_table(index=["Manager","Status"],columns=["Product"],
            values=["Quantity","Price"],
            aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)

table 
... ...
table.query('Manager == ["Debra Henley"]') 
or 
table.query('Status == ["pending","won"]')
 

 

Pandas_聚合数据_pivot_table()_第1张图片

本文的引用:

详解 Pandas 透视表(pivot_table)

Pandas Pivot Table Explained_pbpython.com

Pandas Pivot Table Explained_jupyter.org

☑√☒×☺☹

你可能感兴趣的:(Pandas_聚合数据)