本文主要介绍Hive中子查询与WITH AS 语句的用法,为后续多表查询的介绍奠定基础。
SELECT 列名
FROM (
SELECT 列名
FROM 表名
) 子查询别名;
注意:
查询表t_od_use_cnt符合以下条件的平台及版本
语句如下:
SELECT platform
,app_version
FROM (
SELECT platform
,app_version
,count(user_id) AS num
FROM app.t_od_use_cnt
WHERE date_8 = 20190101
AND is_active = 1
GROUP BY platform
,app_version
HAVING num > 100
) a
GROUP BY platform
,app_version;
运行结果如下:
hive (app)> SELECT platform
> ,app_version
> FROM (
> SELECT platform
> ,app_version
> ,count(user_id) AS num
> FROM app.t_od_use_cnt
> WHERE date_8 = 20190101
> AND is_active = 1
> GROUP BY platform
> ,app_version
> HAVING num > 100
> ) a
> GROUP BY platform
> ,app_version;
platform app_version
1 1.1
如果子查询后没加别名a,报错如下:
hive (app)> SELECT platform
> ,app_version
> FROM (
> SELECT platform
> ,app_version
> ,count(user_id) AS num
> FROM app.t_od_use_cnt
> WHERE date_8 = 20190101
> AND is_active = 1
> GROUP BY platform
> ,app_version
> HAVING num > 100
> )
> GROUP BY platform
> ,app_version;
FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:10924)
at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:45850)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5335)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3741)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:45873)
at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:41516)
at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:41402)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:40413)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:40283)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1590)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 13:0 Failed to recognize predicate 'GROUP'. Failed rule: 'identifier' in subquery source
报错中间的大段日志都可以忽略,查看报错时直接看最后一行FAILED:后的内容即可。
由报错可知,脚本第13行报错,报错原因是子查询未定义。
这也是我鼓励大家尽量使用脚本美化的原因,因为报错会告诉你是第几行出错,美化后的代码可以让你很快定位问题所在。
SELECT 列名
FROM 表名1
WHERE 列名 [NOT] IN (
SELECT 列名
FROM 表名2
);
注意:
这里我们新建教程的第二张表,新增用户表t_od_new_user
建表语句如下:
CREATE TABLE t_od_new_user (
platform string comment '平台 1:android,2:ios'
,app_version string comment 'app版本'
,channel INT comment '渠道'
,user_id BIGINT comment '用户id'
) partitioned BY (date_8 INT comment '日期') row format delimited fields terminated BY ',';
建立一个新的存放数据文件的目录
mkdir -p /root/hive_practice_data/t_od_new_user
导入三个数据文件至该目录,然后将数据导入表t_od_new_user的20190101-20190103三个分区中,具体过程参见入门篇(十)
数据下载百度网盘链接如下:
https://pan.baidu.com/s/1wXU05_chobz_Ry8TdWLOug
提取码:wzk5
插入命令如下:
load data local inpath '/root/hive_practice_data/t_od_new_user/20190101_newuser.csv'
into table t_od_new_user partition (date_8=20190101);
load data local inpath '/root/hive_practice_data/t_od_new_user/20190102_newuser.csv'
into table t_od_new_user partition (date_8=20190102);
load data local inpath '/root/hive_practice_data/t_od_new_user/20190103_newuser.csv'
into table t_od_new_user partition (date_8=20190103);
然后我们正式开始进行WHERE后的子查询举例:
查询表t_od_use_cnt符合以下条件的平台、版本、用户数
脚本如下:
SELECT platform
,app_version
,count(user_id) AS num
FROM app.t_od_use_cnt
WHERE date_8 = 20190101
AND is_active = 1
AND user_id IN (
SELECT user_id
FROM app.t_od_new_user
WHERE date_8 = 20190101
)
GROUP BY platform
,app_version;
运行效果如下:
hive (app)> SELECT platform
> ,app_version
> ,count(user_id) AS num
> FROM app.t_od_use_cnt
> WHERE date_8 = 20190101
> AND is_active = 1
> AND user_id IN (
> SELECT user_id
> FROM app.t_od_new_user
> WHERE date_8 = 20190101
> )
> GROUP BY platform
> ,app_version;
platform app_version num
1 1.1 24
1 1.2 21
1 1.3 25
1 1.4 21
1 1.5 25
2 1.1 19
2 1.2 18
2 1.3 15
2 1.4 20
2 1.5 18
WITH a
AS (
SELECT 列名
FROM 表名1
)
,b
AS (
SELECT 列名
FROM a
)
SELECT 列名
FROM b;
Hive的WITH AS语句可以定义任意个HQL片段,每个HQL片段可以被整个SQL语句所用到,每个片段的别名可以自定义。这种用法优点很多:
综上,我强烈建议大家尽量用WITH AS结构来构造自己的脚本逻辑。
这里使用WITH AS语句重构上文1.1.2的脚本
WITH a
AS (
SELECT platform
,app_version
,count(user_id) AS num
FROM app.t_od_use_cnt
WHERE date_8 = 20190101
AND is_active = 1
GROUP BY platform
,app_version
HAVING num > 100
)
SELECT platform
,app_version
FROM a
GROUP BY platform
,app_version;
运行结果与上文一致。
能看到这里的同学,就右上角点个赞顺便关注我吧,3Q~