Hive中的操作符合函数,和关系型数据库的类似,本篇主要讲解Hive的一些函数。
一 函数分类
Hive中的函数可以分为以下几种:
- 数学函数:主要用于数学运算,例如:Randy()和E();
- 集合函数:主要用于查找Size、Keys和复杂类型的值,例如:Size(Array);
- 类型转换函数:主要是Cast和Binary,用于将一种类型转为另一种类型;
- 日期函数:用于执行与日期相关的操作,例如:Year(string date)和Month(string data);
- 条件函数:主要使用返回值来检查特定的条件,例如:Coalesce、IF和CASE WHEN;
- 字符函数:主要用于字符相关的操作,例如:Upper(string A)和Trim(string A);
- 聚合函数:主要用于聚合操作,例如:Sum()、Count(*);
- 表生成函数:主要用于将单个输入转换为多行输出,例如:Explode(Map)和Json_tuple(就送String ,k1,k2……);
- 自定义函数:由Java代码创建,作为Hive的函数的扩展。
二 函数查看
在Hive中,可以使用如下方法查看Hive支持的函数,以及函数的使用:
1 查看Hive中的函数
0: jdbc:hive2://localhost:10000/hive> show functions;
2 查看函数的使用
0: jdbc:hive2://localhost:10000/hive> desc function upper;
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
+----------------------------------------------------+
1 row selected (0.254 seconds)
0: jdbc:hive2://localhost:10000/hive> desc function extended upper;
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
| Synonyms: ucase |
| Example: |
| > SELECT upper('Facebook') FROM src LIMIT 1; |
| 'FACEBOOK' |
| Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper |
| Function type:BUILTIN |
+----------------------------------------------------+
7 rows selected (0.22 seconds)
注:后者查看的内容更详细。
三 函数示例
1 示例数据
0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name | employee.work_place | employee.sex_age | employee.skills_score | employee.depart_title |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Michael | ["Montreal","Toronto"] | {"sex":"Male","age":30} | {"DB":80} | {"Product":["Developer","Lead"]} |
| Will | ["Montreal"] | {"sex":"Male","age":35} | {"Perl":85} | {"Product":["Lead"],"Test":["Lead"]} |
| Shelley | ["New York"] | {"sex":"Female","age":27} | {"Python":80} | {"Test":["Lead"],"COE":["Architect"]} |
| Lucy | ["Vancouver"] | {"sex":"Female","age":57} | {"Sales":89,"HR":94} | {"Sales":["Lead"]} |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
4 rows selected (0.353 seconds)
2 复杂数据类型
1)SIZE
该函数主要用于计算Map、Array或者嵌套Map/Array的大小,如果大小未知,则返回-1.
0: jdbc:hive2://localhost:10000/hive> select size(work_place) as array_size,
. . . . . . . . . . . . . . . . . . > size(skills_score) as map_size,
. . . . . . . . . . . . . . . . . . > size(depart_title) as complex_size,
. . . . . . . . . . . . . . . . . . > size(depart_title['Product']) as nest_size
. . . . . . . . . . . . . . . . . . > from employee;
+-------------+-----------+---------------+------------+
| array_size | map_size | complex_size | nest_size |
+-------------+-----------+---------------+------------+
| 2 | 1 | 1 | 2 |
| 1 | 1 | 2 | 1 |
| 1 | 1 | 2 | -1 |
| 1 | 2 | 1 | -1 |
+-------------+-----------+---------------+------------+
4 rows selected (0.38 seconds)
2)ARRAY_CONTAINS / SORT_ARRAY
该函数用于检查一个数组是否包含某一个值,如果包含,则返回True,否则返回FALSE,而SORT_ARRAY用于对数组进行升序排序。
0: jdbc:hive2://localhost:10000/hive> select name,array_contains(work_place,'Toronto') as is_Toronto,
. . . . . . . . . . . . . . . . . . > sort_array(work_place) as sort_array
. . . . . . . . . . . . . . . . . . > from employee;
+----------+-------------+-------------------------+
| name | is_toronto | sort_array |
+----------+-------------+-------------------------+
| Michael | true | ["Montreal","Toronto"] |
| Will | false | ["Montreal"] |
| Shelley | false | ["New York"] |
| Lucy | false | ["Vancouver"] |
+----------+-------------+-------------------------+
4 rows selected (0.547 seconds)
3 日期函数
1)From_unixtime(unix_timestamp())
该函数类似于Oracle中的SYSDATE()函数,用于动态返回当前时间。
0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()) as current_time
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+----------------------+
| current_time |
+----------------------+
| 2018-07-09 11:04:31 |
+----------------------+
1 row selected (0.473 seconds)
2)unix_timestamp
该函数可用于比较两个日期,或者可以用于Order by后进行排序。
0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()),(unix_timestamp()-unix_timestamp('2018-07-09 7:00:00'))/60/60/24 as daydiff from employee limit 1;
+----------------------+----------------------+
| _c0 | daydiff |
+----------------------+----------------------+
| 2018-07-09 11:12:06 | 0.17506944444444442 |
+----------------------+----------------------+
1 row selected (0.905 seconds)
3)To_date
该函数用于返回年月日格式的日期。
0: jdbc:hive2://localhost:10000/hive> select to_date(from_unixtime(unix_timestamp())) as day from employee limit 1;
+-------------+
| day |
+-------------+
| 2018-07-09 |
+-------------+
1 row selected (0.485 seconds)
4 CASE
0: jdbc:hive2://localhost:10000/hive> select
. . . . . . . . . . . . . . . . . . > case when 1 is null then 'True' else 0 end as case_result
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+--------------+
| case_result |
+--------------+
| 0 |
+--------------+
1 row selected (0.37 seconds)
5 解析器和搜索
1)数据准备
0: jdbc:hive2://localhost:10000/hive> insert into employee
. . . . . . . . . . . . . . . . . . > select 'Steven' as name,array(null) as work_place,
. . . . . . . . . . . . . . . . . . > named_struct('sex','Male','age',30) as sex_age,
. . . . . . . . . . . . . . . . . . > map('Python',90) as skills_score,
. . . . . . . . . . . . . . . . . . > map('R&D',array('Developer')) as depart_title
. . . . . . . . . . . . . . . . . . > from employee limit 1;
No rows affected (194.183 seconds)
0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name | employee.work_place | employee.sex_age | employee.skills_score | employee.depart_title |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Steven | NULL | {"sex":"Male","age":30} | {"Python":90} | {"R&D":["Developer"]} |
| Michael | ["Montreal","Toronto"] | {"sex":"Male","age":30} | {"DB":80} | {"Product":["Developer","Lead"]} |
| Will | ["Montreal"] | {"sex":"Male","age":35} | {"Perl":85} | {"Product":["Lead"],"Test":["Lead"]} |
| Shelley | ["New York"] | {"sex":"Female","age":27} | {"Python":80} | {"Test":["Lead"],"COE":["Architect"]} |
| Lucy | ["Vancouver"] | {"sex":"Female","age":57} | {"Sales":89,"HR":94} | {"Sales":["Lead"]} |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
5 rows selected (0.786 seconds)
2)Lateral View
该函数会忽视Expode返回NULL的行。
0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score
. . . . . . . . . . . . . . . . . . > from employee
. . . . . . . . . . . . . . . . . . > lateral view explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
| name | workplace | skills | score |
+----------+------------+---------+--------+
| Michael | Montreal | DB | 80 |
| Michael | Toronto | DB | 80 |
| Will | Montreal | Perl | 85 |
| Shelley | New York | Python | 80 |
| Lucy | Vancouver | Sales | 89 |
| Lucy | Vancouver | HR | 94 |
+----------+------------+---------+--------+
6 rows selected (0.459 seconds)
3)Outer Lateral View
该函数会可以Expode返回NULL的行。
0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score
. . . . . . . . . . . . . . . . . . > from employee
. . . . . . . . . . . . . . . . . . > lateral view outer explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
| name | workplace | skills | score |
+----------+------------+---------+--------+
| Steven | NULL | Python | 90 |
| Michael | Montreal | DB | 80 |
| Michael | Toronto | DB | 80 |
| Will | Montreal | Perl | 85 |
| Shelley | New York | Python | 80 |
| Lucy | Vancouver | Sales | 89 |
| Lucy | Vancouver | HR | 94 |
+----------+------------+---------+--------+
7 rows selected (0.366 seconds)
4)Split
该函数用于将字符串拆分为数组。
0: jdbc:hive2://localhost:10000/hive> select split('/user/hive/warehouse/hive.db/employee','/') from employee limit 1;
+----------------------------------------------------+
| _c0 |
+----------------------------------------------------+
| ["","user","hive","warehouse","hive.db","employee"] |
+----------------------------------------------------+
1 row selected (0.303 seconds)
5)Reverse
该函数用于反转字符。
0: jdbc:hive2://localhost:10000/hive> select reverse(split(reverse('/user/hive/warehouse/hive.db/employee'),'/')[0]) from employee limit 1;
+-----------+
| _c0 |
+-----------+
| employee |
+-----------+
1 row selected (0.392 seconds)
6)Collect_set
该函数用于合并为一行,但会去掉重复的值。
0: jdbc:hive2://localhost:10000/hive> select collect_set(work_place[0]) from employee;
+--------------------------------------+
| _c0 |
+--------------------------------------+
| ["Montreal","New York","Vancouver"] |
+--------------------------------------+
1 row selected (140.198 seconds)
7)Collect_list
该函数和Collect_set类似,但不会去重。
0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
+-------------------------------------------------+
| _c0 |
+-------------------------------------------------+
| ["Montreal","Montreal","New York","Vancouver"] |
+-------------------------------------------------+
1 row selected (133.142 seconds)
更多内容,可参考Hive函数(Show Functions)。