mapreduce
Map的本质实际上是拆解,比如说有辆红色的小汽车,有一群工人,把它拆成零件了,这就是Map
Reduce就是组合,我们有很多汽车零件,还有很多其他各种装置零件,把他们一阵拼装,变成变形金刚,这就是Reduce
我们输入很多文档,文档的每一行有很多不同的单词,我们找不同的worker拿到自己手上,就是Split过程,那怎么切分呢?每一行文档就切分成单词和它出现的次数,每次出现的次数是1,就写1。接下来Shuffle就是把不同的单词继续放到同样的盒子里面,Bear放一起,Car放一起,这可以由Shuffle写的时候算法来决定。当然现在很多智能都不用做了,有时候还需要随机采样的方式来实现。等到这个结果以后,最后一步Reduce,就是把相同的数据放一起,比如Car有3个就写3,River是2个就写2,最后再放到一起,这样便于提供服务,得到最终结果。大家可以看到最后箭头指过来是乱序的,也就是说这个执行过程实际上是高度并行的,也不用等待每个都完成,所以说这是一个很好的优化过程。
创建数据库
hive(default)> creat database hive_db; 生成hive_db.db
在hive_db建表
hive(default)> creat table hive_db.test(id int) hive中无论是表还是库,对应于都是文件夹
更换保存路径
hive(default)> creat database hive_db2 location ' / hive_db2.db'; /指的是根目录,' '中指定名字hive_db2.db
查询数据库
hive(default)> show database;
hive(default)> creat database hive_db if not exists hive_db; 避免报错
模糊查询
hive(default)> show database like 'hive*';
查询表的详情
hive(default)> desc xxx;
查询数据库的详情
hive(default)> desc database hive_db; 基础信息,不可修改
hive(default)> desc database extended hive_db; 更详细,可显示额外信息
修改
hive(default)> alter database hive_db set dbproperties("CTtime"="2020=6-22"); 增加属性,键值对形式,可通过extended查看
删除
hive(default)> drop database hive_db; 删除空的database
hive(default)> drop database hive_db cascade; 强制删除(不为空也可以)
创建表
#[ ]代表可写可不写 EXTERNAL外部,不加的话默认内部
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table name
[(col_name data_type [COMMENT col_comment列的注释],...)]
[COMMENT table_comment]
hive(default)> create table student1 as select * from student; 复制表结构和数据
hive(default)> create table student2 like student; 复制表结构
显示表内数据
hive(default)> sekect * from test;
hive(default)> desc student;
hive(default)> desc extends student;
hive(default)> desc formatted student; 详细的格式化信息
内部表 删除,元数据和原数据都删除
外部表 Hive并不认为其完全拥有这份数据 删除外部表并不删除数据,只会删除描述表的元数据信息
hive(default)>
表数据去重
1.distinct
2.group by group by 是先排序后分组
3.row_number() over(partition by .... )(选择分区条件中第几条数据输出,若该序列没有,则不输出)
下面以班级成绩表t2来说明其应用
t2表信息如下:
name class s
cfe 2 74
dss 1 95
ffd 1 95
fda 1 80
gds 2 92
gf 3 99
ddd 3 99
adf 3 45
asdf 3 55
3dd 3 78
获取每个班级的最高分,只保留一条记录
select*from
( select class, s, row_number()over(partition by class order by s desc) mm from t2
)
where mm=1;
1 95 1 --95有两名但是只显示一个
2 92 1
3 99 1 --99有两名但也只显示一个
where 后面如果有and,or的条件,则or自动会把左右的查询条件分开,即先执行and,再执行or。原因就是:and的执行优先级最高
关系型运算符优先级高到低为:not and or
问题的解决办法是:
用()来改变执行顺序
#!/usr/bin/env bash day_two=`date -d "-2day" "+%Y%m%d"` day_one=`date -d "-1day" "+%Y%m%d"` #no space here echo $day_two>>bigdata.txt hive -e" select count(a.mid),count(b.click_time),count(b.click_time)/count(a.mid),c.group_id from (select * from relate_push_cuiwei5_traindata_real_push where dt='${day_two}' and (substr(luicode,1,8) = '10000323') and ((substr(sid,1,5) = 'hdfs2') or (substr(sid,1,6) = 'hdfs-2'))) a left outer join (select * from relate_push_cuiwei5_traindata_push_click_log where (dt='${day_one}' or dt='${day_two}') and (substr(previous_uicode,1,8) = '10000323') and (substr(lfid,1,6) = '0000b0')) b on (a.uid = b.uid and a.mid = b.mid) left outer join (select *, case when substr(uid,-2,1)='7' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '3' then 'a1' when substr(uid,-2,1)='7' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '6' then 'a2' when substr(uid,-2,1)='7' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='7' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='7' and substr(uid,-6,1)='0' then 'a3' when substr(uid,-2,1)='8' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '3' then 'b1' when substr(uid,-2,1)='8' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '6' then 'b2' when substr(uid,-2,1)='8' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='8' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='8' and substr(uid,-6,1)='0' then 'b3' when substr(uid,-2,1)='9' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '3' then 'c1' when substr(uid,-2,1)='9' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '6' then 'c2' when substr(uid,-2,1)='9' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='9' and substr(uid,-6,1)= '0' then 'c3' when substr(uid,-2,1)='0' and substr(uid,-6,1)= '1' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '2' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '3' then 'd1' when substr(uid,-2,1)='0' and substr(uid,-6,1)= '4' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '5' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '6' then 'd2' when substr(uid,-2,1)='0' and substr(uid,-6,1)= '7' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '8' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '9' or substr(uid,-2,1)='0' and substr(uid,-6,1)= '0' then 'd3' when substr(uid,-2,1)='1' or substr(uid,-2,1)='2' or substr(uid,-2,1)='3' then 'b' when substr(uid,-2,1)='4' or substr(uid,-2,1)='5' or substr(uid,-2,1)='6' then 'c' else 'str' end as group_id from relate_push_cuiwei5_traindata_real_push) c on (a.uid = c.uid and a.mid = c.mid and a.dt = c.dt and a.luicode = c.luicode and a.sid = c.sid) group by c.group_id; " >>bigdata.txt
crontab -e
0 23 * * * source ~/.bash_profile && cd /data_new/yichen9 && sh hive_bigdata.sh >>hive_bigdata.log 2>&1 &
分 时 目录 执行 日志