hive上有个表,其中某列p_9的数据格式是用逗号分隔的字符串。通过下面的方式计算p_9列使用逗号分隔后元素的长度。
select rg,sum(size(split(p_9,","))) from ttengine_api_data where dt='2017-08-07' group by rg;
OK
0 137683
1 150155
如果p_9列不为空,那么计算是没问题的。如果是空(“”或者null),则计算后是有问题的。仔细查了一下,发现是size(split(p_9,",")) 有问题,即:
如果p_9是空或者null,那么split成数组后,在计算数据的长度居然是1。知道了原因,那么改起来很简单,使用下面的方式统计就没问题了:
select rg,sum(if(length(p_9)==0,0,size(split(p_9,",")))) from ttengine_api_data where dt='2017-08-07' group by rg;
OK
0 0
1 6373
有如下访客访问次数的统计表 t_access
访客 月份 访问次数
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
B 2015-02 10
B 2015-02 5
…… …… ……
要求输出每个客户在每个月的总访问次数,以及在当前月份之前所有月份的累积访问次数。
输出表
访客 月份 月访问总计 累计访问总计
A 2015-01 33 33
A 2015-02 10 43
……. ……. ……. ……. ……. ……. ……. ……. ……. …….
B 2015-01 30 30
B 2015-02 15 45
……. ……. ……. ……. ……. ……. ……. ……. ……. …….
【解析】
1)第一步,先求每个用户的月总访问次数
select username,month,sum(count) as salary from t_access_times group by username,month
+-----------+----------+---------+--+
| username | month | count |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
2)第二步,将月总访问次数表 自己连接 自己连接(内连接)
(select username,month,sum(count) as salary from t_access_times group by username,month) A
join
(select username,month,sum(count) as salary from t_access_times group by username,month) B
on
A.username=B.username
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
3)第三步,从上一步的结果中进行分组查询,分组的字段是a.username a.month,求月累计值: 将b.month <= a.month的所有b.salary求和即可
3.HQL
SELECT
A.username,
A.month,
MAX(A.count),
SUM(B.count)
FROM
(
SELECT
username,
MONTH,
SUM(COUNT) AS COUNT
FROM
t_accessgroup BY username,
MONTH
)
A
INNER JOIN
(
SELECT
username,
MONTH,
SUM(COUNT) AS COUNT
FROM
t_access
GROUP BY
username,
MONTH
)
B
ON
A.username = B.username
WHERE
B.month <= A.month
GROUP BY
A.username,
A.month
ORDER BY
A.username,
A.month;
HIVE直接读入json的函数有两个:
(1)get_json_object(string json_string, string path)
返回值: string
说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。
举例:
hive >
SELECT
get_json_object(‘{“store”:{“fruit”:[{"weight":8, "type":" apple"}, {"weight":9, "type":" pear"}], “bicycle”:{“price”:19.95, ”color”:”red”}}, “email”:”amy@only_for_json_udf_test.net”, “owner”:”amy” } ‘, ’$.owner’)
FROM
dual;
结果:amy
这个函数每次只能返回一个数据项。
(2)json_tuple(jsonStr, k1, k2, ...)
参数为一组键k1,k2……和JSON字符串,返回值的元组。该方法比 get_json_object 高效,因为可以在一次调用中输入多个键
SELECT
a.timestamp,
b.*
FROM
log a lateral VIEW json_tuple(a.appevent, 'eventid', 'eventname') b AS f1,
f2;
处理数据样例:
{"GPS_LAT":39.8965125, "GPS_LONG":116.3493225, "GPS_SPEED":20.9993625, "GPS_STATE":" A", "GPS_TIME":" 2014-01-02 00:00:16", "IMEI":" 508597", "after_oxygen_sensor":132, "air_condion_state":3, "bdoneNo_after_mileage":0, "bdoneNo_zero_mileage":8044, "db_speed":22, "direction_angle":358.2585, "front_oxygen_sensor":64, "instant_fuel":233, "speed":1210, "torque":33, "total_fuel":0}
处理HIVE语句:
CREATE TABLE 2014jrtest AS
SELECT
json_tuple(line, 'GPS_LAT', 'GPS_LONG', 'GPS_SPEED', 'GPS_STATE', 'GPS_TIME', 'IMEI', 'after_oxygen_sensor', 'air_condion_state', 'bdoneNo_after_mileage', 'bdoneNo_zero_mileage', 'db_speed', 'direction_angle', 'front_oxygen_sensor', 'instant_fuel', 'speed', 'torque', 'total_fuel')
FROM
2014test;
hive之Json解析(普通Json和Json数组)
SELECT
MAX(salary) - MIN(salary) AS growth
FROM
salaries
GROUP BY
emp_no
HAVING
emp_no = '10001' 【严谨的思路】
SELECT
(
(
SELECT
salary
FROM
salaries
WHERE
emp_no = 10001
ORDER BY
to_date DESC LIMIT 1
)
-
(
SELECT salary FROM salaries WHERE emp_no = 10001 ORDER BY to_date ASC LIMIT 1
)
) AS growth
2、【累计求和】所有聚合函数都能用做窗口函数,其语法和专用窗口函数完全相同——sum(<汇总列>) over(<排序列>) as 别名;
SELECT
emp_no,
salary,
SUM(salary) over(order by emp_no) AS running_total
FROM
salaries
WHERE
to_date = '9999-01-01';
【把所有小于等于当前编号的表s1和当前编号表s2联立起来,然后按照当前编号分组,计算出所有小于等于当前标号的工资总数】
SELECT
b.emp_no,
b.salary,
SUM(a.salary) AS running_total
FROM
salaries a
INNER JOIN salaries b
ON
a.emp_no <= b.emp_no
AND a.to_date = "9999-01-01"
AND b.to_date = "9999-01-01"
GROUP BY
b.emp_no,
b.salary
3、对于employees表中,给出奇数行的first_name
select e1.first_name from employees as e1
where (select count(e2.first_name) from employees as e2
where e1.first_name >= e2.first_name)%2 = 1;
4、 EXISTS用于检查子查询是否至少会返回一行数据,该子查询实际上并不返回任何数据,而是返回值True或False
SQL中EXISTS的用法
select * from employees
where not exists (select * from dept_emp where emp_no=employees.emp_no)
5、LIMIT 1 OFFSET 2 -- 去掉OFFSET排名倒数第一第二的时间,取倒数第三
【limit y,x和limit x offset y等价】
【LIMIT 后的数字代表返回几条记录,OFFSET 后的数字代表从第几条记录开始返回(第一条记录序号为0),也可理解为跳过多少条记录后开始返回】
【在 LIMIT X,Y 中,Y代表返回几条记录,X代表从第几条记录开始返回(第一条记录序号为0),切勿记反】
`以下的两种方式均表示取``2``,``3``,``4``三条条数据。`
`1``.select* from test LIMIT` `1``,``3``;`
`当limit后面跟两个参数的时候,第一个数表示要跳过的数量,后一位表示要取的数量。`
`2``.select * from test LIMIT` `3` `OFFSET` `1``;(在mysql` `5``以后支持这种写法)`
`当 limit和offset组合使用的时候,limit后面只能有一个参数,表示要取的的数量,offset表示要跳过的数量 。`
6、查找排除当前最大、最小salary之后的员工的平均工资avg_salary
select avg(salary) as avg_salary
from salaries
where to_date = '9999-01-01'
and salary<(select max(salary) from salaries)
and salary>(select min(salary) from salaries)
7、统计gmv分布
SELECT
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.95, 9999) AS q095,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.94, 9999) AS q094,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.93, 9999) AS q093,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.92, 9999) AS q092,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.91, 9999) AS q091,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.90, 9999) AS q090,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.85, 9999) AS q085,
percentile_approx(CAST(batch_gmv_amount AS DOUBLE), 0.80, 9999) AS q080
FROM
(
SELECT
id,
batch_gmv_amount
FROM
mytable
)
c
参考:
【SQL刷题】SQL语法学习与面试题练习