SparkSQL/Hive 提供了许多的分析函数,用于完成复杂统计分析。
sum、avg、min、max,分别用于计算分组内相关统计信息。
1、用SQL实现下面的操作
测试数据:
±-------±------------------±–+
|cookieid| createtime| pv|
±-------±------------------±–+
| cookie1|2015-04-10 00:00:00| 1|
| cookie1|2015-04-11 00:00:00| 5|
| cookie1|2015-04-12 00:00:00| 7|
| cookie1|2015-04-13 00:00:00| 3|
| cookie1|2015-04-14 00:00:00| 2|
| cookie1|2015-04-15 00:00:00| 4|
| cookie1|2015-04-16 00:00:00| 4|
| cookie2|2015-04-10 00:00:00| 2|
| cookie2|2015-04-11 00:00:00| 3|
| cookie2|2015-04-12 00:00:00| 5|
| cookie2|2015-04-13 00:00:00| 6|
| cookie2|2015-04-14 00:00:00| 3|
| cookie2|2015-04-15 00:00:00| 9|
| cookie2|2015-04-16 00:00:00| 7|
±-------±------------------±–+
根据以下要求实现新的列:
根据cookieid分组,按照createtime排序,新列的值为第一行到当前行pv的合计值
SQL实现:
val sql =
"""select cookieid,createtime,pv,
|sum(pv) over(partition by cookieid order by createtime) as pv1
|from t1""".stripMargin
spark.sql(sql).show()
结果:
计算方式:sum(pv)
分组方式:partition by cookieid
排序方式:order by createtime
计算范围:未指定,默认第一行到当前行
SQL实现pv2列指定计算范围:
val sql =
"""select cookieid,createtime,pv,
|sum(pv) over(partition by cookieid order by createtime) as pv1,
|sum(pv) over(partition by cookieid order by createtime
|rows between unbounded preceding and current row) as pv2
|from t1""".stripMargin
spark.sql(sql).show()
上面的操作得出pv1和pv2完全一样
2、分析函数一般语法结构
分析函数名(参数) over(partition by子句 order by子句 rows/range子句)
(1)分析函数名:sum、max、min、count、avg等聚集函数,或lead、lag行比较函数 或 排名函数等;
(2)over:关键字,表示前面的是分析函数,而不是普通的聚合函数
(3)分析子句:
over关键字后面挂号内的内容;分析子句又由以下三部分组成:
partition by:分组子句,表示分析函数的计算范围,不同的组互不相干;
order by:排序子句,表示分组后,组内的排序方式;
rows/range:窗口子句,是在分组(partition by)后,组内的子分组(也称窗口),是分析函数的计算范围窗口。窗口有两种,rows和range;
3、window子句
rows between也叫window子句,当同一个select查询中存在多个窗口函数时,他们相互之间是没有影响的.每个窗口函数应用自己的规则
rows between unbounded preceding and current row
rows between … and …(开始到结束,位置不能交换)
unbounded preceding:从第一行开始
current row:当前行
unbounded following:最后一行
n preceding:前n行
n following:后n行
4、rows和range的区别
range是逻辑窗口,是指定当前行对应值的范围取值,行数不固定,只要行值在范围内,对应列都包含在内。
rows是物理窗口,即根据order by 子句排序后,取的前N行及后N行的数据计算(与当前行的值无关,只与排序后的行号相关)。
测试range:
spark.sql("""SELECT cookieid, createtime, pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY pv
RANGE BETWEEN 1 preceding AND 2 following) AS pv2
FROM t1
""").show
可以得出结论:
当pv = 1时,sum为1-1 <= pv <= 1+2的和,sum = 1+2+3
当pv = 2是,sum为2-1 <= pv <= 2+2的和,sum = 1+2+3+4+4
…
以此类推下去,结果如上例中所示
测试rows:
spark.sql("""SELECT cookieid, createtime, pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY pv
ROWS BETWEEN 1 preceding AND 2 following) AS pv2
FROM t1
""").show
有结果可以很明显的看出pv2的结果为前一行到后两行的和
spark.sql("""
SELECT cookieid, createtime, pv,
row_number() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank1,
rank() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank2,
dense_rank() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank3,
ntile(3) OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank4
FROM t1
""").show
row_number() 是没有重复值的排序(即使两行记录相等也是不重复的)
rank() 是跳跃排序,两个第二名下来就是第四名
dense_rank() 是连续排序,两个第二名仍然跟着第三名
ntile(n),用于将分组数据按照顺序切分成n片
6、行函数
6.1 lag、lead
spark.sql("""select cookieid,createtime,pv,
|lag(pv) over (partition by cookieid order by pv) as col1,
|lag(pv,1,0) over (partition by cookieid order by pv) as col2,
|lag(pv,2) over (partition by cookieid order by pv) as col3
|from t1""".stripMargin).show()
lag(field,n,value)
表示获取字段field的前n行信息,如果没有就用value代替,如果没有写value则用null代替
spark.sql("""select cookieid,createtime,pv,
|lead(pv) over (partition by cookieid order by pv) as col1,
|lead(pv,1,0) over (partition by cookieid order by pv) as col2,
|lead(pv,2) over (partition by cookieid order by pv) as col3
|from t1""".stripMargin).show()
lead(field,n,value)
表示获取字段field的后n行信息,如果没有就用value代替,如果没有写value则用null代替
6.2 first_value、last_value
first_value,取分组内排序后,截止到当前行,第一个值
last_value,取分组内排序后,截止到当前行,最后一个值
spark.sql(
"""select cookieid,createtime,pv,
|row_number() over(partition by cookieid order by pv desc) as rank1,
|first_value(createtime) over(partition by cookieid order by pv desc) as rank2,
|first_value(pv) over(partition by cookieid order by pv desc) as rank3
|from t1""".stripMargin).show()
spark.sql(
"""select cookieid,createtime,pv,
|row_number() over(partition by cookieid order by pv desc) as rank1,
|last_value(createtime) over(partition by cookieid order by pv desc) as rank2,
|last_value(pv) over(partition by cookieid order by pv desc) as rank3
|from t1""".stripMargin).show()
备注:lag、lead、first_value、last_value 不支持窗口子句