1、创建表 CREATE EXTERNAL TABLE employees( ID STRING, name STRING, AGE INT, BIRTHDAY DATE, subordinates ARRAY score MAP address STRUCT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';
2、插入数据 2.1 vi employees.txt g201425003 wangwu1 5500 20 1987-07-12 zhaoliu1,wangwu1 Chinese:90,English:88 国营1号,西直门,北京 g201425004 wangwu2 6400 20 1987-07-12 zhaoliu2,wangba2 Japana:85,English:60 国营2号,西直门2,北京 g201425005 wangwu3 8400 20 1987-07-12 zhaoliu3,wangba3 Japana:80,English:70 高碑店,保定市,河北 g201425006 wangwu4 8400 20 1987-07-12 zhaoliu4,wangba4 Japana:80,English:70 高碑店,保定市,河北
2.2 上传hdfs hdfs dfs -put employees.txt 3、加载数据到employees中 load data inpath 'employees.txt' OVERWRITE into table employees partition(country='China'); 4、查询数据 4.1 一般查询 select * from employees; 4.2 对复合字段获取 select name ,subordinates[0] subordinate ,score['English'] English ,address.street,address.city from employees;
4.3 复合函数字段查看 select name ,subordinates ,score ,address from employees;
查询结果: name subordinates score address zhangsan ["lisi","wangwu"] {"Chinese":90.0,"English":88.0} {"street":"国营1号","city":"西直门","province":"北京"} wangwu ["zhaoliu","wangba"] {"Japana":90.0,"English":88.0} {"street":"国营2号","city":"西直门2","province":"北京"} wangwu ["zhaoliu","wangba"] {"Japana":90.0,"English":88.0} {"street":"高碑店","city":"保定市","province":"河北"}
|
通过以上结果得出结果:
l ARRAY字段通过集合方式展示,通过下表获取每个字段的数值,是有序存储。
l Map和STRUCT字段通过json方式存储,其中:Map通过[‘key’],STRUCT通过.来获取字段内容。
select upper(name) from employees;
select map_values(score) from employees;
最常用的两类聚合函数: count和avg
select count(id),round(avg(salary),2) avg_salary from employees;
对ARRAY和Map符合类型的字段通过explode函数可以让一行转化成多行。
explode(ARRAY array)
explode(Map map)
语法: lpad(STRING s,INT len,STRING pad)
S:输入字符串
len:输出结果的字符串长度
pad:显示len的长度,不够长度通过lpad(rpad)从左(右)开始补齐
select name,lpad(name,8,'*') lname, rpad(name,8,'*') rname from employees;
查询结果:
name lname rname
zhangsan zhangsan zhangsan
wangwu **wangwu wangwu**
wangwu **wangwu wangwu**
l 拼接多个字符串
语法: concat(STRING s1, STRINGs2,STRING s3.....)
示例: select name,concat(name,'_china','_sx') from employees;
输出结果:
name _c1
zhangsan zhangsan_china_sx
wangwu wangwu_china_sx
wangwu wangwu_china_sx
l 使用分隔符进行拼接字符串(功能同concat类似)
语法: concat_ws()
示例:select name,concat_ws('|',name,'nan','2000') from employees;
结果:
name _c1
zhangsan zhangsan|nan|2000
wangwu wangwu|nan|2000
wangwu wangwu|nan|200
语法:reverse(STRING s)
示例:select name,reverse(name) from employees;
结果:
name _c1 zhangsan nasgnahz wangwu uwgnaw wangwu uwgnaw |
语法:size(ARRAY array) size(MAP map)
示例:
select subordinates,size(subordinates) array_count,score,size(score) map_count from employees;
结果:
subordinates array_count score map_count ["lisi","wangwu"] 2 {"Chinese":90.0,"English":88.0} 2 ["zhaoliu","wangba"] 2 {"Japana":85.0,"English":60.0} 2 ["zhaoliu","wangba"] 2 {"Japana":80.0,"English":70.0} 2 |
语法:split(STRING s,STRING pattern)
按照正则表达式pattern分割 字符串,返回分割后的字符串数组ARRAY
示例:select split("I|am|a|student",'\\|') from employees limit 1;
结果:
["I","am","a","student"] |
语法:parse_url(STRING url,STRING partname[,STRINGkey])
示例:
select parse_url('http://item.jd.com/1856588.html','HOST') host, parse_url('http://item.jd.com/1856588.html','PROTOCOL') PROTOCOL from employees limit 1;
结果:
host protocol item.jd.com http |
语法:str_to_map(STRING s,STRING delimi1,STRING delimi2)
第一参数: 要转换的字符串
第二参数:键值对之间的分割符
第三参数:键和值之间的分割符
示例:
select score,str_to_map('Chinese:90.0,English:88.0',',',':') strMap,str_to_map('Chinese:90.0,English:88.0',',',':')['Chinese'] Chinese from employees;
|
结果:
{"Chinese":90.0,"English":88.0} {"Chinese":"90.0","English":"88.0"} 90.0 |
语法:substr(STRINGs,开始下标,截取长度)
示例:select substr('20160106112134432',0,8) day from employees limit 1;
结果:20160106
l 获取当前时间
语法:unix_timestamp()获取当前时间的时间戳
l 指定时间戳转为指定的格式字符串
语法:from_unixtime(BIGINT unixtime,String format)
按照format的格式对时间戳进行格式化,返回STRING字符串
示例:
select unix_timestamp() currentUnixTime,from_unixtime(unix_timestamp(),'yyyy-MM-dd') formatCurrentTime from employees limit 1;
结果:
1452051408 2016-01-05 19:36:48
l 指定格式字符串转为时间戳
语法:unix_timestamp(STRING date,STRING pattern)
将指定格式字符串转为时间戳
示例:
select unix_timestamp('2016-01-05 19:36:48','yyyy-MM-dd HH:mm:ss') from employees limit 1;
结果:
1452051408
l 获取指定字符串的日期,年,月,日,时,分,秒
示例:
select d ,to_date(d) ri,year(d) y,month(d) m,day(d) ri,hour(d) h,minute(d) m,second(d) s from t1;
结果:
d ri y m ri h m s
2016-01-05 19:36:48 2016-01-05 2016 1 5 19 36 48
l 计算开始时间和结束时间相差的天数
语法:datediff(STRING enddate,STRING startdate)
通过测试,仅计算yyyy-MM-dd格式的字符串
示例:select datediff('2016-01-05','2016-01-02') from t1;
结果:3
典型的查询会返回多行数据。Limit子句用于限制返回的行数:
select * from employees limit 2;
对于嵌套语句来说,使用别名时候非常有用的。下面,我们使用前面的示例作为一个嵌套查询:
from ( select name,salary, address.street as street,address.city as city,address.province as provice from employees ) e select e.name,e.salary,e.city where e.salary>1000 |
name salary street city provice
zhangsan 1200.5 国营1号 西直门 北京
wangwu 890.4 国营2号 西直门2 北京
wangwu 800.0 高碑店 保定市 河北
该表达方式和if语句类似,用于处理多个列的查询结果。
select name,salary,case when salary<6000 then 'low' WHEN salary >= 6400 AND salary<8000 THEN 'middle' WHEN salary >= 8000 AND salary<9000 THEN 'high' else 'other' end as bracket from employees; |
输出结果:
name salary bracket wangwu1 5500.0 low wangwu2 6400.0 middle wangwu3 8400.0 high wangwu4 8400.0 high |
如果用户进行执行一般的HQL的话,可能会注意到大多数情况查询都会触发一个MapReduce任务(job)。Hive中对于某些情况的查询可以不必要使用MapReduce,也就是所谓的本地模式。例如:
select * from t1;
在这种情况下,Hive可以读取t1对应的存储目录下的文件,然后输出格式化后的内容到控制台。
对于where语句中过滤条件知识分区字段这种情况无需转为MapReduce过程(一般要设计好分区,这样查看可以避免MapReduce)。例如:
select *from employees where country='China';
此外,通过设置Hive的本地模式,可以提高Hive查询的效率。但本地模式需要一些条件。当一个job满足如下条件才能真正使用本地模式:
1.job的输入数据大小必须小于参数:hive.exec.mode.local.auto.inputbytes.max(默认128MB)
2.job的map数必须小于参数:hive.exec.mode.local.auto.tasks.max(默认4)
3.job的reduce数必须为0或者1。
实验1: 非本地模式执行
hive (default)> select name from employees; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1452047254290_0045, Tracking URL = http://mycluster:8088/proxy/application_1452047254290_0045/ Kill Command = /home/hadoop/app/hadoop-2.6.0/bin/hadoop job -kill job_1452047254290_0045 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-01-05 22:41:36,714 Stage-1 map = 0%, reduce = 0% 2016-01-05 22:41:46,020 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.21 sec MapReduce Total cumulative CPU time: 1 seconds 210 msec Ended Job = job_1452047254290_0045 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.21 sec HDFS Read: 649 HDFS Write: 32 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 210 msec OK name wangwu1 wangwu2 wangwu3 wangwu4 Time taken: 27.942 seconds, Fetched: 4 row(s)
|
实验2: 设置本地模式执行
hive> set hive.exec.mode.local.auto=true; hive> set hive.exec.mode.local.auto.inputbytes.max=50000000; hive> set hive.exec.mode.local.auto.tasks.max=10;
hive (default)> select name from employees; Automatically selecting local only mode for query Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2016-01-05 22:43:11,738 null map = 100%, reduce = 0% Ended Job = job_local2070598612_0001 Execution completed successfully MapredLocal task succeeded OK name wangwu1 wangwu2 wangwu3 wangwu4 Time taken: 10.77 seconds, Fetched: 4 row(s)
|
注意: Hive的本地模式,最好将set hive.exec.mode.local.auto=true; 这个设置增加到你的$HOME/.hiverc配置文件中。(其中:$HOME 是hive当前按照目录)