hive

hive用途:将HQL转换为mapreduce程序(自动),会写hql甚至是sql就行

mapreduce

Map的本质实际上是拆解,比如说有辆红色的小汽车,有一群工人,把它拆成零件了,这就是Map

Reduce就是组合,我们有很多汽车零件,还有很多其他各种装置零件,把他们一阵拼装,变成变形金刚,这就是Reduce

如何统计1TB或1PB文件里的单词数呢?

 

我们输入很多文档,文档的每一行有很多不同的单词,我们找不同的worker拿到自己手上,就是Split过程,那怎么切分呢?每一行文档就切分成单词和它出现的次数,每次出现的次数是1,就写1。接下来Shuffle就是把不同的单词继续放到同样的盒子里面,Bear放一起,Car放一起,这可以由Shuffle写的时候算法来决定。当然现在很多智能都不用做了,有时候还需要随机采样的方式来实现。等到这个结果以后,最后一步Reduce,就是把相同的数据放一起,比如Car有3个就写3,River是2个就写2,最后再放到一起,这样便于提供服务,得到最终结果。大家可以看到最后箭头指过来是乱序的,也就是说这个执行过程实际上是高度并行的,也不用等待每个都完成,所以说这是一个很好的优化过程。

hive_第1张图片

hive_第2张图片

 

创建数据库

hive(default)> creat database hive_db;                           生成hive_db.db

在hive_db建表

hive(default)> creat table hive_db.test(id int)                hive中无论是表还是库,对应于都是文件夹

更换保存路径

hive(default)> creat database hive_db2 location ' / hive_db2.db';               /指的是根目录,' '中指定名字hive_db2.db

 

查询数据库

hive(default)> show database;

hive(default)> creat database hive_db if not exists hive_db;                   避免报错 

模糊查询

hive(default)> show database like 'hive*';

 

查询表的详情

hive(default)> desc xxx;

查询数据库的详情

hive(default)> desc database hive_db;          基础信息,不可修改

hive(default)> desc database extended hive_db;           更详细,可显示额外信息

 

修改

hive(default)> alter database hive_db set dbproperties("CTtime"="2020=6-22");     增加属性,键值对形式,可通过extended查看

删除

hive(default)> drop database hive_db;       删除空的database

hive(default)> drop database hive_db cascade;    强制删除(不为空也可以)

 

创建表

#[ ]代表可写可不写   EXTERNAL外部,不加的话默认内部

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table name

[(col_name data_type [COMMENT col_comment列的注释],...)]

[COMMENT table_comment]

 

hive(default)> create table student1 as select * from student;   复制表结构和数据

hive(default)>  create table student2 like student;            复制表结构

 

显示表内数据

hive(default)> sekect * from test;

hive(default)> desc student;

hive(default)> desc extends student;

hive(default)> desc formatted student;            详细的格式化信息

 

内部表 删除,元数据和原数据都删除

外部表 Hive并不认为其完全拥有这份数据 删除外部表并不删除数据,只会删除描述表的元数据信息

hive(default)> 

 

 

表数据去重

1.distinct

2.group by                 group by 是先排序后分组

3.row_number() over(partition by .... )(选择分区条件中第几条数据输出,若该序列没有,则不输出)

    

下面以班级成绩表t2来说明其应用

t2表信息如下:
name    class    s
cfe         2        74
dss         1        95
ffd         1        95
fda        1        80
gds        2        92
gf           3       99
ddd       3        99
adf         3        45
asdf        3        55
3dd         3        78
    获取每个班级的最高分,只保留一条记录
    select*from                                                                      
    ( select  class, s, row_number()over(partition by class order by s desc) mm from t2
    )                                                                           
    where mm=1;
1       95        1  --95有两名但是只显示一个
2       92        1
3       99        1 --99有两名但也只显示一个

 

 

 

and or:

where 后面如果有and,or的条件,则or自动会把左右的查询条件分开,即先执行and,再执行or。原因就是:and的执行优先级最高

关系型运算符优先级高到低为:not and or

问题的解决办法是:

用()来改变执行顺序

 

#!/usr/bin/env bash
day_two=`date -d "-2day" "+%Y%m%d"`  
day_one=`date -d "-1day" "+%Y%m%d"`  #no space here
echo $day_two>>bigdata.txt
hive -e"
select count(a.mid),count(b.click_time),count(b.click_time)/count(a.mid),c.group_id
from
(select * from relate_push_cuiwei5_traindata_real_push
where dt='${day_two}' and (substr(luicode,1,8) = '10000323') and ((substr(sid,1,5) = 'hdfs2') or (substr(sid,1,6) = 'hdfs-2'))) a
left outer join
(select * from relate_push_cuiwei5_traindata_push_click_log
where (dt='${day_one}' or dt='${day_two}') and (substr(previous_uicode,1,8) = '10000323') and (substr(lfid,1,6) = '0000b0')) b
on (a.uid = b.uid and a.mid = b.mid)
left outer join
(select *,
case when substr(uid,-2,1)='7' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '3' then 'a1'
when substr(uid,-2,1)='7' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '6' then 'a2'
when substr(uid,-2,1)='7' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='7' and substr(uid,-6,1)='0' then 'a3'
when substr(uid,-2,1)='8' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '3' then 'b1'
when substr(uid,-2,1)='8' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '6' then 'b2'
when substr(uid,-2,1)='8' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='8' and substr(uid,-6,1)='0' then 'b3'
when substr(uid,-2,1)='9' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '3' then 'c1'
when substr(uid,-2,1)='9' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '6' then 'c2'
when substr(uid,-2,1)='9' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '0' then 'c3'
when substr(uid,-2,1)='0' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '3' then 'd1'
when substr(uid,-2,1)='0' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '6' then 'd2'
when substr(uid,-2,1)='0' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '0' then 'd3'
when substr(uid,-2,1)='1' or substr(uid,-2,1)='2' or substr(uid,-2,1)='3' then 'b'
when substr(uid,-2,1)='4' or substr(uid,-2,1)='5' or substr(uid,-2,1)='6' then 'c'
else 'str' end as group_id
from relate_push_cuiwei5_traindata_real_push) c
on (a.uid = c.uid and a.mid = c.mid and a.dt = c.dt and a.luicode = c.luicode and a.sid = c.sid)
group by c.group_id;
" >>bigdata.txt

 

 

crontab -e

0 23 * * * source ~/.bash_profile && cd /data_new/yichen9 && sh hive_bigdata.sh >>hive_bigdata.log 2>&1 &
分 时                                                 目录                                    执行                             日志

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(hive)