为了进一步管理数据,我们可以在 Hive 中使用表达式,运算符,和函数来转换数据。Hive 维基百科提供了丰富了表达式和函数 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)。在这里我就不想重复所有的函数和表达式,挑出一些个人认为在工作中经常能用到的来跟大家分享一下。
Hive 定义了关系运算符,算术运算符,逻辑运算符,复合构造类型。对关系,逻辑,算术运算符来说,它们类似于 SQL/JAVA 中的标准运算符。所以在这里我就不重复了。对于复杂数据类型的运算符,可以参考:Hive_4. DML -- 数据转换 中分区表的示例。
Hive 中函数分类如下:
RAND()
and E()
.SIZE(Array)
.CAST
和 BINARY
函数来将一个类型转换成另一种类型YEAR(string date)
和 MONTH(string date)
.COALESCE
, IF
, 和 CASE WHEN
.UPPER(string A)
和 TRIM(string A)
.SUM()
, COUNT(*)
.EXPLODE(MAP)
和 JSON_TUPLE(jsonString, k1, k2,…)
.你可以在 Hive CLI 中输入以下命令来查看 Hive 内置的函数 & UDF:
SHOW FUNCTIONS; --列出所有函数
DESCRIBE FUNCTION ; --具体函数的细节
DESCRIBE FUNCTION EXTENDED ; --更多细节
接下来让我们看看如何去使用这些函数以及使用技巧:
MAP
, ARRAY
,或者MAP
/ARRAY
的嵌套类型。当大小未知的时候会返回 -1。具体实现如下:jdbc:hive2://> SELECT work_place, skills_score, depart_title . . . . . . .> FROM employee; +----------------------+--------------------+-------------------------------------+ | work_place | skills_score | depart_title | +----------------------+--------------------+-------------------------------------+ |["Montreal","Toronto"]|{"DB":80} |{"Product":["Developer","Lead"]} | |["Montreal"] |{"Perl":85} |{"Product":["Lead"],"Test":["Lead"]} | |["New York"] |{"Python":80} |{"Test":["Lead"],"COE":["Architect"]}| |["Vancouver"] |{"Sales":89,"HR":94}|{"Sales":["Lead"]} | +----------------------+--------------------+-------------------------------------+ 4 rows selected (0.084 seconds) jdbc:hive2://> SELECT SIZE(work_place) AS array_size, . . . . . . .> SIZE(skills_score) AS map_size, . . . . . . .> SIZE(depart_title) AS complex_size, . . . . . . .> SIZE(depart_title["Product"]) AS nest_size . . . . . . .> FROM employee; +-------------+-----------+---------------+------------+ | array_size | map_size | complex_size | nest_size | +-------------+-----------+---------------+------------+ | 2 | 1 | 1 | 2 | | 1 | 1 | 2 | 1 | | 1 | 1 | 2 | -1 | | 1 | 2 | 1 | -1 | +-------------+-----------+---------------+------------+ 4 rows selected (0.062 seconds)
ARRAY_CONTAINS
语句用来检测数组类型是否包含值,返回值为 TRUE
/ FALSE. SORT_ARRAY
语句用来升序排序数组中内容。具体应用如下:
jdbc:hive2://> SELECT ARRAY_CONTAINS(work_place, 'Toronto') AS is_Toronto, . . . . . . .> SORT_ARRAY(work_place) AS sorted_array . . . . . . .> FROM employee; +-------------+-------------------------+ | is_toronto | sorted_array | +-------------+-------------------------+ | true | ["Montreal","Toronto"] | | false | ["Montreal"] | | false | ["New York"] | | false | ["Vancouver"] | +-------------+-------------------------+ 4 rows selected (0.059 seconds)
SYSDATE
的功能一样。用来返回 Hive 服务器的当前日期-时间,具体应用如下:
jdbc:hive2://> SELECT . . . . . . .> FROM_UNIXTIME(UNIX_TIMESTAMP()) AS current_time . . . . . . .> FROM employee LIMIT 1; +----------------------+ | current_time | +----------------------+ | 2014-11-15 19:28:29 | +----------------------+ 1 row selected (0.047 seconds)
UNIX_TIMESTAMP()语句可以用来比较两个日期,或者转换为字符串类型放在 ORDER BY 后面进行排序: ORDER BY UNIX_TIMESTAMP(string_date, 'dd-MM-yyyy')
。具体应用如下:
--用来比较两个 不同的 日期
jdbc:hive2://> SELECT (UNIX_TIMESTAMP ('2015-01-21 18:00:00') . . . . . . .> - UNIX_TIMESTAMP('2015-01-10 11:00:00'))/60/60/24 . . . . . . .> AS daydiff FROM employee LIMIT 1; +---------------------+ | daydiff | +---------------------+ | 11.291666666666666 | +---------------------+ 1 row selected (0.093 seconds)
TO_DATE
语句用来去掉日期中的时、分、秒。用来检查一段时间的日期-时间类型的字段。例如 WHERE TO_DATE(update_datetime) BETWEEN '2014-11-01' AND '2014-11-31'。具体应用如下:
jdbc:hive2://> SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) . . . . . . .> AS current_date FROM employee LIMIT 1; +---------------+ | current_date | +---------------+ | 2014-11-15 | +---------------+ 1 row selected (0.153 seconds)
THEN
和 ELSE
的数据类型需要保持一致,否则会报出异常。例如 Else表达式后需要跟相同数据类型,如果期望的是 ”bigint“,但是返回的却是”int“。这种情况就会报错。这里可以通过 IF 进行解决。
具体应用如下:
jdbc:hive2://> SELECT . . . . . . .> CASE WHEN 1 IS NULL THEN 'TRUE' ELSE 0 END . . . . . . .> AS case_result FROM employee LIMIT 1; +--------------+ | case_result | +--------------+ | 0 | +--------------+ 1 row selected (0.063 seconds)
explode
函数可以跟 LATERAL VIEW 一起使用来处理 array 和 map 类型数据。如果字段中有null 值,则整行将会被过滤掉。例如下面的示例中,Steven
的那行就被过滤掉了。
SPLIT
,explode
,UDTF一起使用,能够将一列数据拆分成多行数据。在此基础上可以对拆分后的数据进行聚合。当然本文的案例已经很全面的解释LATERAL VIEW语句的具体应用。大家也可以参考以下博客.--数据准备
jdbc:hive2://> INSERT INTO TABLE employee
. . . . . . .> SELECT 'Steven' AS name, array(null) as work_place,
. . . . . . .> named_struct("sex","Male","age",30) as sex_age,
. . . . . . .> map("Python",90) as skills_score,
. . . . . . .> map("R&D",array('Developer')) as depart_title
. . . . . . .> FROM employee LIMIT 1;
No rows affected (28.187 seconds)
jdbc:hive2://> SELECT name, work_place, skills_score
. . . . . . .> FROM employee;
+----------+-------------------------+-----------------------+
| name | work_place | skills_score |
+----------+-------------------------+-----------------------+
| Michael | ["Montreal","Toronto"] | {"DB":80} |
| Will | ["Montreal"] | {"Perl":85} |
| Shelley | ["New York"] | {"Python":80} |
| Lucy | ["Vancouver"] | {"Sales":89,"HR":94} |
| Steven | NULL | {"Python":90} |
+----------+-------------------------+-----------------------+
5 rows selected (0.053 seconds)
--LATERAL VIEW 忽略掉返回的空值行
jdbc:hive2://> SELECT name, workplace, skills, score
. . . . . . .> FROM employee
. . . . . . .> LATERAL VIEW explode(work_place) wp -- 在这里会将数组字段拆分成多行(若含有null值,直接过滤掉整行)
. . . . . . .> AS workplace
. . . . . . .> LATERAL VIEW explode(skills_score) ss
. . . . . . .> AS skills, score;
+----------+------------+---------+--------+
| name | workplace | skills | score |
+----------+------------+---------+--------+
| Michael | Montreal | DB | 80 |
| Michael | Toronto | DB | 80 |
| Will | Montreal | Perl | 85 |
| Shelley | New York | Python | 80 |
| Lucy | Vancouver | Sales | 89 |
| Lucy | Vancouver | HR | 94 |
+----------+------------+---------+--------+
6 rows selected (24.733 seconds)
--OUTER LATERAL VIEW 保留查询返回的空值行
jdbc:hive2://> SELECT name, workplace, skills, score . . . . . . .> FROM employee . . . . . . .> LATERAL VIEW OUTER explode(work_place) wp -- 在这里会将数组字段拆分成多行 . . . . . . .> AS workplace . . . . . . .> LATERAL VIEW explode(skills_score) ss . . . . . . .> AS skills, score; +----------+------------+---------+--------+ | name | workplace | skills | score | +----------+------------+---------+--------+ | Michael | Montreal | DB | 80 | | Michael | Toronto | DB | 80 | | Will | Montreal | Perl | 85 | | Shelley | New York | Python | 80 | | Lucy | Vancouver | Sales | 89 | | Lucy | Vancouver | HR | 94 | | Steven | None | Python | 90 | +----------+------------+---------+--------+ 7 rows selected (24.573 seconds)
字符串反转函数:REVERSE
语句用来反转字符串每个字母的顺序。SPLIT
语句可以使用特定的分词器来标记字符串。以下示例显示从一个 Linux 路径下获得文件名:
jdbc:hive2://> SELECT . . . . . . .> reverse(split(reverse('/home/user/employee.txt'),'/')[0]) . . . . . . .> AS linux_file_name FROM employee LIMIT 1; +------------------+ | linux_file_name | +------------------+ | employee.txt | +------------------+ 1 row selected (0.1 seconds)
然而,reverse
将array / map 中的每个元素输出为单独的一行,collect_set
和 collect_list
的作用刚好相反,它会返回每行元素的结果集。collect_set
语句会对结果集进行去重操作。但是collect_list
没有改功能。具体应用如下:
jdbc:hive2://> SELECT collect_set(work_place[0]) . . . . . . .> AS flat_workplace0 FROM employee; +--------------------------------------+ | flat_workplace0 | +--------------------------------------+ | ["Vancouver","Montreal","New York"] | +--------------------------------------+ 1 row selected (43.455 seconds) jdbc:hive2://> SELECT collect_list(work_place[0]) . . . . . . .> AS flat_workplace0 FROM employee; +-------------------------------------------------+ | flat_workplace0 | +-------------------------------------------------+ | ["Montreal","Montreal","New York","Vancouver"] | +-------------------------------------------------+ 1 row selected (45.488 seconds)
INPUT__FILE__NAME
(表示一个mapper任务的文件名) 和 BLOCK__OFFSET__INSIDE__FILE
(表示当前全局文件的偏移量。对于块压缩文件,就是当前块的文件偏移量,即当前块的第一个字节在文件中的偏移量,对排查出现不符合预期或者 Null 结果的查询是很有帮助的)
。下面通过显示数据在 HDFS 中的物理路径来展示如何使用 虚拟列。对桶表和分区表来说使用虚拟列是很重要的。代码如下:
--简单示例:(摘抄自:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns) select INPUT__FILE__NAME, key, BLOCK__OFFSET__INSIDE__FILE from src; select key, count(INPUT__FILE__NAME) from src group by key order by key; select * from src where BLOCK__OFFSET__INSIDE__FILE > 12000 order by key;
jdbc:hive2://> SELECT INPUT__FILE__NAME, . . . . . . .> BLOCK__OFFSET__INSIDE__FILE AS OFFSIDE . . . . . . .> FROM employee_id_buckets; +---------------------------------------------------------+----------+ | input__file__name | offside | +---------------------------------------------------------+----------+ | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 0 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 55 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 120 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 175 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 240 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 295 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 360 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 415 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 480 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 535 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 592 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 657 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 712 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 769 | | hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 834 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 0 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 57 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 122 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 177 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 234 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 291 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 348 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 405 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 462 | | hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 517 | +---------------------------------------------------------+----------+ 25 rows selected (0.073 seconds)
Hive 维基百科中未提到的函数:jdbc:hive2://> SELECT INPUT__FILE__NAME FROM employee_partitioned; +-------------------------------------------------------------------------+ | input__file__name | +-------------------------------------------------------------------------+ |hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0 |hdfs://warehouse_URI/employee_partitioned/year=2012/month=11/000000_0 |hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt |hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt |hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt |hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0 |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0 |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0 |hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0 +-------------------------------------------------------------------------+ 10 rows selected (0.47 seconds)
--检查空值函数
jdbc:hive2://> SELECT work_place, isnull(work_place) is_null,
. . . . . . .> isnotnull(work_place) is_not_null FROM employee;
+-------------------------+----------+--------------+
| work_place | is_null | is_not_null |
+-------------------------+----------+--------------+
| ["Montreal","Toronto"] | false | true |
| ["Montreal"] | false | true |
| ["New York"] | false | true |
| ["Vancouver"] | false | true |
| NULL | true | false |
+-------------------------+----------+--------------+
5 rows selected (0.058 seconds)
--assert_true:如果条件为 false,则抛出异常
--elt(n, str1, str2, ...), 返回指定位置(n-th)的字符串jdbc:hive2://> SELECT assert_true(work_place IS NULL) . . . . . . .> FROM employee; Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. (state=,code=0)
jdbc:hive2://> SELECT elt(2,'New York','Montreal','Toronto') . . . . . . .> FROM employee LIMIT 1; +-----------+ | _c0 | +-----------+ | Montreal | +-----------+ 1 row selected (0.055 seconds)
--返回当前 Hive 数据库名字(Hive 0.13.0开始)
Transactionsjdbc:hive2://> SELECT current_database(); +----------+ | _c0 | +----------+ | default | +----------+ 1 row selected (0.057 seconds)
在Hive 0.13.0 版本之前,Hive 不支持行级的事务。这也导致了无法去更新、插入、删除数据行。因此,数据覆写只会发生在表和分区中。使得 Hive 的并行读写和数据清理十分繁琐。
从 Hive 0.13.0 版本开始,Hive通过 ACID(Atomicity, Consistency, Isolation, and Durability) 可以全面支持 行级事务.所有的事务在 ORC(Optimized Row Columnar) 文件格式 和 桶表中都自动支持。关于 Hive ACID,请参考一篇翻译博客:Hive ACID 特性 - 0.13.0
为了让 Hive 支持事务,需要配置以下参数:
SET hive.support.concurrency = true; SET hive.enforce.bucketing = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; SET hive.compactor.initiator.on = true; SET hive.compactor.worker.threads = 1;
SHOW TRANSACTIONS
命令是从 Hive 0.13.0 版本才添加的,用来显示当前系统开启 / 关闭的事务:
jdbc:hive2://> SHOW TRANSACTIONS; +-----------------+--------------------+-------+-----------+ | txnid | state | user | host | +-----------------+--------------------+-------+-----------+ | Transaction ID | Transaction State | User | Hostname | +-----------------+--------------------+-------+-----------+ 1 row selected (15.209 seconds)
从 Hive 0.14.0 版本开始, INSERT VALUE
, UPDATE
, 和 DELETE
命令通过以下语法来支持到行运算:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row …]; UPDATE tablename SET column = value [, column = value ...] [WHERE expression] DELETE FROM tablename [WHERE expression]