本文叙述使用的“数据集”链接下载地址如下:
http://note.youdao.com/noteshare?id=5f44492149116cb6c52233786c1ca98d&sub=6C35AFC6AF9441648F245393DCAC61CB
不管是mysql,还是pandas,都是处理像excel那样的二维表格数据的。对于一个二维表,每一行都可以看作是一条记录,每一列都可以看作是字段。
学过mysql的人都知道,mysql在做数据处理和统计分析的时候,有一个很大的痛点:语法顺序和执行顺序不一致,这就导致很多初学者很容易写错sql语句。
业界处理像excel那样的二维表格数据,通常有如下两种风格:
语法顺序:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
WHERE Condition 1
GROUP BY Column1, Column2
HAVING Condition2
逻辑执行顺序:
from...where...group...select...having...limit
语法顺序和逻辑执行顺序:
df[Condition1].groupby([Column1,Column2],as_index=False).agg({Column3: "mean",Column4:"sum"})
select deptno,sum(sal) sums
from emp
group by deptno
having sums > 9000;
df = pd.read_excel(r"C:\Users\黄伟\Desktop\emp.xlsx")
display(df)
df = df.groupby("deptno",as_index=False).agg({"sal":"sum"})
display(df)
df1 = df[df["sal"]>9000]
display(df1)
我们可以通过groupby方法来对Series或DataFrame对象实现分组操作,该方法会返回一个分组对象。但是,如果直接查看(输出)该对象,并不能看到任何的分组信息。
x = {"name":["a","a","b","b","c","c","c"],"num":[2,4,0,5,5,10,15]}
df = pd.DataFrame(x)
display(df)
df.groupby("name",as_index=True).agg({"num":"sum"})
df.groupby("name",as_index=False).agg({"num":"sum"})
x = {"name":["a","a","b","b","c","c","c"],"num":[2,4,0,5,5,10,15]}
df = pd.DataFrame(x)
display(df)
df.groupby("deptno").groups
df.groupby("deptno").size()
x = {"name":["a","a","b","b","c","c","c"],"num":[2,4,0,5,5,10,15]}
df = pd.DataFrame(x)
display(df)
groupdf = df.groupby("name")
for (x,y) in groupdf:
display(x, y)
使用如下数据演示这4种分组参数:
df = pd.DataFrame({"部门":["A", "A", "B", "B"],
"小组":["g1", "g2", "g1", "g2"],
"利润":[10, 20, 15, 28],
"人员":["a", "b", "c", "d"],
"年龄":[20, 15, 18, 30]})
display(df)
g = df.groupby("部门")
display(g)
for (x,y) in g:
display(x, y)
g = df.groupby(["部门","小组"])
display(g)
for (x,y) in g:
display(x, y)
g = df.groupby({0:1, 1:1, 2:1, 3:2})
display(g)
for (x,y) in g:
display(x, y)
df = pd.DataFrame({"部门":["A", "A", "A", "B", "B", "B"],
"利润":[10, 32, 20, 15, 28, 10],
"销售量":[20, 15, 33, 18, 30, 22]})
display(df)
df["排名"] = df["销售量"].groupby(df["部门"]).rank()
df
df = pd.DataFrame({"部门":["A", "A", "B", "B", "C", "C"],
"小组":["g1", "g2", "g1", "g2", "g1", "g2"],
"利润":[10, 20, 15, 28, 12, 14],
"人员":["a", "b", "c", "d", "e", "f"],
"年龄":[20, 15, 18, 30, 23, 34]})
df = df.set_index("部门")
display(df)
def func(x):
if x=="A" or x=="B":
return 0
else:
return 1
g = df.groupby(func)
display(g)
for (x,y) in g:
display(x, y)
当使用了groupby()分组的时候,得到的就是一个分组对象。当没有使用groupby()分组的时候,整张表可以看成是一个组,也相当于是一个分组对象。
针对分组对象,我们既可以直接调用聚合函数sum()、mean()、count()、max()、min(),还可以调用分组对象的agg()方法,然后像agg()中传入指定的参数。
df = pd.DataFrame({"部门":["A", "A", "B", "B", "C", "C"],
"小组":["g1", "g2", "g1", "g2", "g1", "g2"],
"利润":[10, 20, 15, 28, 12, 14],
"人员":["a", "b", "c", "d", "e", "f"],
"年龄":[20, 15, 18, 30, 23, 34]})
display(df)
df["利润"].mean()
df[["年龄","利润"]].mean()
df = pd.DataFrame({"部门":["A", "A", "B", "B", "C", "C"],
"小组":["g1", "g2", "g1", "g2", "g1", "g2"],
"利润":[10, 20, 15, 28, 12, 14],
"人员":["a", "b", "c", "d", "e", "f"],
"年龄":[20, 15, 18, 30, 23, 34]})
display(df)
df.groupby("部门")["利润"].mean()
df.groupby("部门").mean()
下面知识的讲解,涉及到“聚合函数字符串”,这是我自己起的名字,类似于"sum"、“mean”、“count”、“max”、“min”,都叫做“聚合函数字符串”。同时还需要注意一点,agg()函数中还有一个axis参数,用于指定行、列。
df = pd.DataFrame({"部门":["A", "A", "B", "B"],
"利润":[10, 20, 15, 28],
"年龄":[20, 15, 18, 30]})
display(df)
df1 = df.groupby("部门").agg("mean")
display(df1)
df = pd.DataFrame({"部门":["A", "A", "B", "B"],
"利润":[10, 20, 15, 28],
"年龄":[20, 15, 18, 30]})
display(df)
df1 = df.groupby("部门").agg(["sum","mean"])
display(df1)
df = pd.DataFrame({"部门":["A", "A", "B", "B"],
"利润":[10, 20, 15, 28],
"年龄":[20, 15, 18, 30]})
display(df)
df1 = df.groupby("部门").agg({"利润":["sum","mean"],"年龄":["max","min"]})
display(df1)
df = pd.DataFrame({"部门":["A", "A", "A", "B", "B", "B"],
"利润":[10, 32, 20, 15, 28, 10],
"销售量":[20, 15, 33, 18, 30, 22]})
display(df)
df.groupby("部门").agg(lambda x:x.max()-x.min())