Hive_4. DML -- Functions & Operators & Transactions

今天跟大家总结一下 Hive 中常见的函数,以及事务操作。

为了进一步管理数据,我们可以在 Hive 中使用表达式,运算符,和函数来转换数据。Hive 维基百科提供了丰富了表达式和函数 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)。在这里我就不想重复所有的函数和表达式,挑出一些个人认为在工作中经常能用到的来跟大家分享一下。

Hive 定义了关系运算符,算术运算符,逻辑运算符,复合构造类型。对关系,逻辑,算术运算符来说,它们类似于 SQL/JAVA 中的标准运算符。所以在这里我就不重复了。对于复杂数据类型的运算符,可以参考:Hive_4. DML -- 数据转换 中分区表的示例。
Hive 中函数分类如下:

  • 数学运算函数(Mathematical functions):该函数更多的用来执行数学计算,比如说 RAND() and E().
  • 集合函数(Collection functions):这类函数用来为复杂数据类型发现大小,Keys,Values。比如说 SIZE(Array).
  • 类型转换函数(Type conversion functions):常使用CAST 和 BINARY 函数来将一个类型转换成另一种类型
  • 日期函数(Date functions):这类函数常用来做日期相关的计算,例如 YEAR(string date) 和 MONTH(string date).
  • 条件函数(Conditional functions):这类函数常用来检查返回的值是否满足条件,比如说 COALESCEIF, 和 CASE WHEN.
  • 字符串函数(String functions):这类函数通常用来执行字符串相关的运算,比如说 UPPER(string A) 和 TRIM(string A).
  • 聚合函数(Aggregate functions):这类函数用来执行聚合操作,例如 SUM()COUNT(*).
  • 表生成函数(Table-generating functions):该类函数用来将单行输入转换成多行输出。比如说 EXPLODE(MAP) 和 JSON_TUPLE(jsonString, k1, k2,…).
  • 自定义函数(Customized functions):该类函数使用 JAVA 代码来扩展 Hive 的功能,将会在 Hive UDF 篇做专门介绍。

你可以在 Hive CLI 中输入以下命令来查看 Hive 内置的函数 & UDF:

 
     
SHOW FUNCTIONS; --列出所有函数
DESCRIBE FUNCTION ; --具体函数的细节 
DESCRIBE FUNCTION EXTENDED ; --更多细节 

接下来让我们看看如何去使用这些函数以及使用技巧:

复杂数据类型函数技巧:
SIZE() 函数主要用来计算 MAPARRAY,或者MAP/ARRAY的嵌套类型。当大小未知的时候会返回 -1。具体实现如下:
jdbc:hive2://> SELECT work_place, skills_score, depart_title 
. . . . . . .> FROM employee;
+----------------------+--------------------+-------------------------------------+
| work_place | skills_score | depart_title |
+----------------------+--------------------+-------------------------------------+
|["Montreal","Toronto"]|{"DB":80} |{"Product":["Developer","Lead"]} |
|["Montreal"] |{"Perl":85} |{"Product":["Lead"],"Test":["Lead"]} |
|["New York"] |{"Python":80} |{"Test":["Lead"],"COE":["Architect"]}|
|["Vancouver"] |{"Sales":89,"HR":94}|{"Sales":["Lead"]} |
+----------------------+--------------------+-------------------------------------+
4 rows selected (0.084 seconds)

jdbc:hive2://> SELECT SIZE(work_place) AS array_size, 
. . . . . . .> SIZE(skills_score) AS map_size, 
. . . . . . .> SIZE(depart_title) AS complex_size, 
. . . . . . .> SIZE(depart_title["Product"]) AS nest_size 
. . . . . . .> FROM employee;
+-------------+-----------+---------------+------------+
| array_size | map_size | complex_size | nest_size |
+-------------+-----------+---------------+------------+
| 2 | 1 | 1 | 2 |
| 1 | 1 | 2 | 1 |
| 1 | 1 | 2 | -1 |
| 1 | 2 | 1 | -1 |
+-------------+-----------+---------------+------------+
4 rows selected (0.062 seconds)

ARRAY_CONTAINS 语句用来检测数组类型是否包含值,返回值为 TRUE / FALSESORT_ARRAY 语句用来升序排序数组中内容。具体应用如下:

jdbc:hive2://> SELECT ARRAY_CONTAINS(work_place, 'Toronto') AS is_Toronto,
. . . . . . .> SORT_ARRAY(work_place) AS sorted_array 
. . . . . . .> FROM employee;
+-------------+-------------------------+
| is_toronto | sorted_array |
+-------------+-------------------------+
| true | ["Montreal","Toronto"] |
| false | ["Montreal"] |
| false | ["New York"] |
| false | ["Vancouver"] |
+-------------+-------------------------+
4 rows selected (0.059 seconds)
日期函数技巧:
FROM_UNIXTIME(UNIX_TIMESTAMP())语句跟 Oracle 中SYSDATE的功能一样。用来返回 Hive 服务器的当前日期-时间,具体应用如下:
jdbc:hive2://> SELECT 
. . . . . . .> FROM_UNIXTIME(UNIX_TIMESTAMP()) AS current_time 
. . . . . . .> FROM employee LIMIT 1;
+----------------------+
| current_time |
+----------------------+
| 2014-11-15 19:28:29 |
+----------------------+
1 row selected (0.047 seconds)

UNIX_TIMESTAMP()语句可以用来比较两个日期,或者转换为字符串类型放在 ORDER BY 后面进行排序: ORDER BY UNIX_TIMESTAMP(string_date, 'dd-MM-yyyy')。具体应用如下:

--用来比较两个 不同的 日期
jdbc:hive2://> SELECT (UNIX_TIMESTAMP ('2015-01-21 18:00:00') 
. . . . . . .> - UNIX_TIMESTAMP('2015-01-10 11:00:00'))/60/60/24 
. . . . . . .> AS daydiff FROM employee LIMIT 1;
+---------------------+
| daydiff |
+---------------------+
| 11.291666666666666 |
+---------------------+
1 row selected (0.093 seconds)

TO_DATE 语句用来去掉日期中的时、分、秒。用来检查一段时间的日期-时间类型的字段。例如 WHERE TO_DATE(update_datetime) BETWEEN '2014-11-01' AND '2014-11-31'。具体应用如下:

jdbc:hive2://> SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) 
. . . . . . .> AS current_date FROM employee LIMIT 1;
+---------------+
| current_date |
+---------------+
| 2014-11-15 |
+---------------+
1 row selected (0.153 seconds)
针对不同数据类型 -- CASE
在 Hive 0.13.0 版本以前,THEN 和 ELSE 的数据类型需要保持一致,否则会报出异常。例如 Else表达式后需要跟相同数据类型,如果期望的是 ”bigint“,但是返回的却是”int“。这种情况就会报错。这里可以通过 IF 进行解决。
Hive 0.13.0 版本解决了该问题,具体应用如下:
jdbc:hive2://> SELECT 
. . . . . . .> CASE WHEN 1 IS NULL THEN 'TRUE' ELSE 0 END 
. . . . . . .> AS case_result FROM employee LIMIT 1;
+--------------+
| case_result |
+--------------+
| 0 |
+--------------+
1 row selected (0.063 seconds)
解析 & 搜索技巧 -- (行转列?)--可以肯定的是LATERAL VIEW 语句能将一行拆分为多行
LATERAL VIEW 语句利用用户自定义的表生成函数,例如使用 EXPLODE() 来扁平化 map 或者 array 类型的字段。explode 函数可以跟 LATERAL VIEW 一起使用来处理 array 和 map 类型数据。如果字段中有null 值,则整行将会被过滤掉。例如下面的示例中,Steven 的那行就被过滤掉了。
从 Hive 0.12.0 开始,便可以使用 OUTER LATERAL VIEW来保留空值的行。
在实际应用中,我们会发现,LATERAL VIEW 语句更多的是跟SPLIT explode UDTF一起使用,能够将一列数据拆分成多行数据。在此基础上可以对拆分后的数据进行聚合。当然本文的案例已经很全面的解释LATERAL VIEW语句的具体应用。大家也可以参考以下博客.
在 Hive中利用LATERAL VIEW explode语句来进行WordCount:
http://www.codesec.net/view/149295.html
--数据准备
jdbc:hive2://> INSERT INTO TABLE employee
. . . . . . .> SELECT 'Steven' AS name, array(null) as work_place,
. . . . . . .> named_struct("sex","Male","age",30) as sex_age, 
. . . . . . .> map("Python",90) as skills_score, 
. . . . . . .> map("R&D",array('Developer')) as depart_title
. . . . . . .> FROM employee LIMIT 1;
No rows affected (28.187 seconds)

jdbc:hive2://> SELECT name, work_place, skills_score 
. . . . . . .> FROM employee;
+----------+-------------------------+-----------------------+
| name | work_place | skills_score |
+----------+-------------------------+-----------------------+
| Michael | ["Montreal","Toronto"] | {"DB":80} |
| Will | ["Montreal"] | {"Perl":85} |
| Shelley | ["New York"] | {"Python":80} |
| Lucy | ["Vancouver"] | {"Sales":89,"HR":94} |
| Steven | NULL | {"Python":90} |
+----------+-------------------------+-----------------------+
5 rows selected (0.053 seconds)

--LATERAL VIEW 忽略掉返回的空值行 
jdbc:hive2://> SELECT name, workplace, skills, score
. . . . . . .> FROM employee
. . . . . . .> LATERAL VIEW explode(work_place) wp    -- 在这里会将数组字段拆分成多行(若含有null值,直接过滤掉整行)
. . . . . . .> AS workplace
. . . . . . .> LATERAL VIEW explode(skills_score) ss 
. . . . . . .> AS skills, score;
+----------+------------+---------+--------+
| name | workplace | skills | score |
+----------+------------+---------+--------+
| Michael | Montreal | DB | 80 |
| Michael | Toronto | DB | 80 |
| Will | Montreal | Perl | 85 |
| Shelley | New York | Python | 80 |
| Lucy | Vancouver | Sales | 89 |
| Lucy | Vancouver | HR | 94 |
+----------+------------+---------+--------+
6 rows selected (24.733 seconds)

--OUTER LATERAL VIEW 保留查询返回的空值行
jdbc:hive2://> SELECT name, workplace, skills, score
. . . . . . .> FROM employee
. . . . . . .> LATERAL VIEW OUTER explode(work_place) wp   -- 在这里会将数组字段拆分成多行
. . . . . . .> AS workplace
. . . . . . .> LATERAL VIEW explode(skills_score) ss 
. . . . . . .> AS skills, score;
+----------+------------+---------+--------+
| name | workplace | skills | score |
+----------+------------+---------+--------+
| Michael | Montreal | DB | 80 |
| Michael | Toronto | DB | 80 |
| Will | Montreal | Perl | 85 |
| Shelley | New York | Python | 80 |
| Lucy | Vancouver | Sales | 89 |
| Lucy | Vancouver | HR | 94 |
| Steven | None | Python | 90 |
+----------+------------+---------+--------+
7 rows selected (24.573 seconds)

字符串反转函数:REVERSE 
语句用来反转字符串每个字母的顺序。SPLIT 语句可以使用特定的分词器来标记字符串。以下示例显示从一个 Linux 路径下获得文件名:

jdbc:hive2://> SELECT
. . . . . . .> reverse(split(reverse('/home/user/employee.txt'),'/')[0])
. . . . . . .> AS linux_file_name FROM employee LIMIT 1;
+------------------+
| linux_file_name |
+------------------+
| employee.txt |
+------------------+
1 row selected (0.1 seconds)

然而,reverse 将array / map 中的每个元素输出为单独的一行,collect_set 和 collect_list  的作用刚好相反,它会返回每行元素的结果集。collect_set 语句会对结果集进行去重操作。但是collect_list 没有改功能。具体应用如下:

jdbc:hive2://> SELECT collect_set(work_place[0]) 
. . . . . . .> AS flat_workplace0 FROM employee;
+--------------------------------------+
| flat_workplace0 |
+--------------------------------------+
| ["Vancouver","Montreal","New York"] |
+--------------------------------------+
1 row selected (43.455 seconds)

jdbc:hive2://> SELECT collect_list(work_place[0]) 
. . . . . . .> AS flat_workplace0 FROM employee;
+-------------------------------------------------+
| flat_workplace0 |
+-------------------------------------------------+
| ["Montreal","Montreal","New York","Vancouver"] |
+-------------------------------------------------+
1 row selected (45.488 seconds)
虚拟列:
在 Hive 中,虚拟列是一种特殊的函数类型。目前Hive提供两个虚拟列:INPUT__FILE__NAME (表示一个mapper任务的文件名) 和 BLOCK__OFFSET__INSIDE__FILE(表示当前全局文件的偏移量。对于块压缩文件,就是当前块的文件偏移量,即当前块的第一个字节在文件中的偏移量,对排查出现不符合预期或者 Null 结果的查询是很有帮助的)。下面通过显示数据在 HDFS 中的物理路径来展示如何使用 虚拟列。对桶表和分区表来说使用虚拟列是很重要的。代码如下:
--简单示例:(摘抄自:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns)
select INPUT__FILE__NAME, key, BLOCK__OFFSET__INSIDE__FILE from src;
select key, count(INPUT__FILE__NAME) from src group by key order by key;
select * from src where BLOCK__OFFSET__INSIDE__FILE > 12000 order by key;
--对桶表使用虚拟列
jdbc:hive2://> SELECT INPUT__FILE__NAME, 
. . . . . . .> BLOCK__OFFSET__INSIDE__FILE AS OFFSIDE 
. . . . . . .> FROM employee_id_buckets;
+---------------------------------------------------------+----------+
| input__file__name | offside |
+---------------------------------------------------------+----------+
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 0 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 55 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 120 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 175 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 240 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 295 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 360 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 415 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 480 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 535 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 592 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 657 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 712 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 769 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000000_0 | 834 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 0 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 57 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 122 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 177 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 234 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 291 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 348 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 405 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 462 |
| hdfs://hive_warehouse_URI/employee_id_buckets/000001_0 | 517 |
+---------------------------------------------------------+----------+
25 rows selected (0.073 seconds)

--对分区表使用虚拟列(根据时间进行分区)
jdbc:hive2://> SELECT INPUT__FILE__NAME FROM employee_partitioned;
+-------------------------------------------------------------------------+
| input__file__name |
+-------------------------------------------------------------------------+
|hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0     
|hdfs://warehouse_URI/employee_partitioned/year=2012/month=11/000000_0    
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.txt
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0    
+-------------------------------------------------------------------------+
10 rows selected (0.47 seconds)

Hive 维基百科中未提到的函数:
以下是 Hive 在维基百科中未曾涉及的函数: 

--检查空值函数  
jdbc:hive2://> SELECT work_place, isnull(work_place) is_null, 
. . . . . . .> isnotnull(work_place) is_not_null FROM employee;
+-------------------------+----------+--------------+
| work_place | is_null | is_not_null |
+-------------------------+----------+--------------+
| ["Montreal","Toronto"] | false | true |
| ["Montreal"] | false | true |
| ["New York"] | false | true |
| ["Vancouver"] | false | true |
| NULL | true | false |
+-------------------------+----------+--------------+
5 rows selected (0.058 seconds)

--assert_true:如果条件为 false,则抛出异常 
jdbc:hive2://> SELECT assert_true(work_place IS NULL) 
. . . . . . .> FROM employee;
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. (state=,code=0)

--elt(n, str1, str2, ...), 返回指定位置(n-th)的字符串 
jdbc:hive2://> SELECT elt(2,'New York','Montreal','Toronto')
. . . . . . .> FROM employee LIMIT 1;
+-----------+
| _c0 |
+-----------+
| Montreal |
+-----------+
1 row selected (0.055 seconds)

--返回当前 Hive 数据库名字(Hive 0.13.0开始)
jdbc:hive2://> SELECT current_database();
+----------+
| _c0 |
+----------+
| default |
+----------+
1 row selected (0.057 seconds)

Transactions

在Hive 0.13.0 版本之前,Hive 不支持行级的事务。这也导致了无法去更新、插入、删除数据行。因此,数据覆写只会发生在表和分区中。使得 Hive 的并行读写和数据清理十分繁琐。

从 Hive 0.13.0 版本开始,Hive通过 ACID(Atomicity, Consistency, Isolation, and Durability) 可以全面支持 行级事务.所有的事务在 ORC(Optimized Row Columnar) 文件格式 和 桶表中都自动支持。关于 Hive ACID,请参考一篇翻译博客:Hive ACID 特性 - 0.13.0

为了让 Hive 支持事务,需要配置以下参数:

SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
SHOW TRANSACTIONS 命令是从 Hive 0.13.0 版本才添加的,用来显示当前系统开启 / 关闭的事务:
jdbc:hive2://> SHOW TRANSACTIONS;
+-----------------+--------------------+-------+-----------+
| txnid | state | user | host |
+-----------------+--------------------+-------+-----------+
| Transaction ID | Transaction State | User | Hostname |
+-----------------+--------------------+-------+-----------+
1 row selected (15.209 seconds)

从 Hive 0.14.0 版本开始, INSERT VALUEUPDATE, 和 DELETE命令通过以下语法来支持到行运算:

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] 
VALUES values_row [, values_row …];
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE FROM tablename [WHERE expression]


你可能感兴趣的:(Hive)