Hive写SQL语句

现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

题目要求:

数据要求:
(1)写shell脚本清洗数据/使用RegexSerde的方式去抽取三份数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)

清洗数据:
sed -i “s/::/%/g” users.dat
sed -i “s/::/%/g” ratings.dat
sed -i “s/::/%/g” movies.dat

Hive要求:
(1)正确建表,导入数据(三张表,三份数据)
create table users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) row format delimited fields terminated by “%”;
load data local inpath “/home/hadoop/moviedata/users.dat” into table users;

create table ratings(UserID BigInt, MovieID BigInt, rate Double, ts String) row format delimited fields terminated by “%”;
load data local inpath “/home/hadoop/moviedata/ratings.dat” into table ratings;

create table movies(MovieID BigInt, Title String, type String) row format delimited fields terminated by “%”;
load data local inpath “/home/hadoop/moviedata/movies.dat” into table movies;

select * from movies limit 5;
select * from ratings limit 5;
select * from users limit 5;

使用指定RegexSerde的方式进行数据的处理:
create table users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String)
row format serde “SerDe” with serdeproperties(“input.regex”=”(.)::(.)::(.)::(.)::(.*)”, “output.format.string” = “%1 s s %3 s s %5$s”);

(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
思路:评分数据在ratings表,求出的字段电影名在movies表。 ratings a join movies b

HQL:
create table answer2 as select b.title title, count(a.rate) as totalCount from ratings a join movies b on a.movieid = b.movieid group by b.title order by totalCount desc limit 10;

(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,评分)
思路:分别求男性最喜欢的10部电影, + 女性最喜欢的 10 部电影
先求出男性评分中每部电影的平均分 ratings —- movieid,userid, rate

ratings a join users b on a.userid = b.userid
ratings a join movies c on a.movieid = c.movieid;
where b.gender = “F” / where b.gender = “M”

HQL:
求女性最评分最高的10部电影:
create table answer3_F as select b.gender gender, c.title moviename, avg(a.rate) as avgrate from
ratings a join users b on a.userid = b.userid join movies c on a.movieid = c.movieid
where b.gender = “F”
group by c.title, b.gender order by avgrate desc limit 10;

求男性最评分最高的10部电影:
create table answer3_M as select b.gender gender, c.title moviename, avg(a.rate) as avgrate from
ratings a join users b on a.userid = b.userid join movies c on a.movieid = c.movieid
where b.gender = “M”
group by c.title, b.gender order by avgrate desc limit 10;

(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,评分)
思路;
总共就只有7个年龄:select distinct age from users; // 最后的结果条数就是年龄的个数

HQL: group by age where movieid = 2116;
create table answer4 as select b.age age, avg(a.rate) as avgrate from ratings a join users b on a.userid = b.userid where a.movieid = 2116 group by b.age order by age;

假如没有2116这个条件:
create table answer4_all as select a.movieid as movieid, b.age age, avg(a.rate) as avgrate from ratings a join users b on a.userid = b.userid group by a.movieid, b.age order by movieid, age;

(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(人,电影名,影评)
需求拆解:
1、求出最喜欢看电影的女性
2、求出该女性评分最高的10部电影
3、求该10部电影的平均分

HQL:
create table answer5 as select a.userid, count(a.rate) as totalCount from ratings a join users b on a.userid = b.userid where b.gender = “F” group by a.userid order by totalCount desc limit 1; ==== 1150

select rate, movieid from ratings where userid = 1150 order by rate desc limit 10;

create table answer5_3 as select a.movieid as movieid, avg(a.rate) as avgrate from ratings a left semi join (select rate, movieid from ratings where userid = 1150 order by rate desc limit 10) b on a.movieid = b.movieid group by a.movieid;

(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
思路:
1、求所有的电影的平均分
2、求好片最多的年份
3、求该年最好看的10部电影

2160::Rosemary’s Baby (1968)::Horror|Thriller
select substring(“Rosemary Baby (1968)”, -5, 4);

HQL:
求出四个字段值: movieid, moviename, movieyear, avgrate
create table answer6 as select a.movieid as movieid, b.title as moviename, substring(b.title, -5 ,4) as movieyear, avg(rate) as avgrate from ratings a join movies b on a.movieid = b.movieid group by a.movieid, b.title;

求出好片个数最多的年份
create table answer6_2 as select movieyear, count(*) as totalCount from answer6 where avgrate >= 4 group by movieyear order by totalCount desc limit 1;

create table answer6_3 as select movieid, moviename, movieyear, avgrate from answer6 where movieyear = 1998 order by avgrate desc limit 10;

(7)求1997年上映的电影中,评分最高的10部Comedy类电影
思路:
求出一张表,包含年份,评分,电影ID,电影名,电影的类型
得确定 每一部电影是不是 喜剧电影
从上面这张表统计出97年当中评分最高的10 部喜剧电影

1、求出四个字段值: movieid, moviename, movieyear, avgrate, type

create table answer7 as select a.movieid movieid, a.moviename moviename, a.movieyear movieyear, a.avgrate avgrate, b.type type from answer6 a join movies b on a.movieid = b.movieid;

HQL:
create table answer7_1 as select movieid, moviename, avgrate from answer7 where movieyear = 1997 and lcase(type) like ‘%comedy%’ order by avgrate desc limit 10;

UDF:具体编写省略

(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)(TopN)

1 Toy Story (1995) 1995 4.146846413095811 Animation|Children’s|Comedy

1 Toy Story (1995) 1995 4.146846413095811 Animation
1 Toy Story (1995) 1995 4.146846413095811 Children’s
1 Toy Story (1995) 1995 4.146846413095811 Comedy

测试:
select explode(array(1,2,3));
select “huangbo”,explode(array(1,2,3)); XXXXXXXXX
select “huangbo”, ss.id from dual lateral view explode(array(1,2,3)) ss as id;

lateral view explode(split(genres, ‘\|’)) ss as movietype;

思路:

answer7的五个字段:movieid, moviename, movieyear, avgrate, type
movietype === Animation|Children’s|Comedy
split(movietype, ‘\|’) = [Animation,Children’s,Comedy]

第一步:转换 answer7这张表中的 movietype字段,形成一张新的表:拆分movietype
answer8的字段 : movieid, moviename, movieyear, avgrate, movietype

create table answer8 as
select movieid, moviename, movieyear, avgrate, ss.movietype from answer7
lateral view explode(split(type, ‘\|’)) ss as movietype;

第二步:求出每种电影类型最好看的5部电影
select movietype from answer8 group by movietype order by avgrate limit 5; XXXXXXX
注意:如果结合group by 和 limit 5, 那么结果并不是每组里面取前五,而是所有取前五

分区 XXXX√√√√√
分桶 原理:相同字段的记录肯定会在同一个桶,一个桶里面会有多个字段值
distribute by department sort by age desc;

dpt age rn
CS 34 1
CS 32 2
CS 30 3
CS 22 4
CS 18 5
CS 16 6

MA 54 1
MA 33 2
MA 22 3
MA 11 4
MA 8 5
相当于需要增加一个 排序列
前提:要分桶并排序

窗口函数:row_number()
create table answer8_order as select a.*, row_number() over (distribute by a.movietype sort by a.avgrate desc) rn from answer8 a;

answer8_order: movieid, moviename, movieyear, avgrate, movietype, rn

select movieid, rn, movietype, avgrate from answer8_order where rn <= 5 limit 100;

(9)各年评分最高的电影类型(年份,类型,影评分)

思路: group by 年份 电影类型 order by type_avgrate desc limit 1;

1997 action
1998 drama
…..

数据基础:answer8的字段 : movieid, moviename, movieyear, avgrate, movietype

select movieyear, movietype, avg(rate) avgrate from xxx_table group movieyear, movietype order by avgrate;

求出每一年的每一个电影类型的评分:
create table answer9 as select movieyear, movietype, avg(avgrate) as type_rate from answer8 group by movieyear, movietype;

create table answer9_1 as select a.*, row_number() over (distribute by movieyear sort by type_rate desc) rn from answer9 a;

select movieyear,movietype,type_rate,rn from answer9_1 where rn = 1;

求出每一年上映的电影当中评分最高的电影
answer7的五个字段:movieid, moviename, movieyear, avgrate, type

HQL:
create table best_year_movie as select a.*, row_number() over (distribute by movieyear sort by avgrate desc) rn from answer7 a;

select * from best_year_movie where rn = 1;

(10)每个地区最高评分的电影名,把结果存入hdfs(地区,电影名,电影评分)

思路:判断电影的地区的依据是 user 对 movie 的 评分

userid zipcode movieid rate
1 101 2001 2
2 101 2001 3
3 102 2001 4

实现步骤 : group by zipcode,movieid, row_number()

电影的详细信息表:movies表
电影的评分:ratings表
电影上映的地区数据:users表中

聚合操作: avg(ratings.rate)

HQL 实现:
create table answer1000000000 as select b.zipcode zipcode, c.title, avg(a.rate) as avgrate from ratings a join users b on a.userid = b.userid join movies c on a.movieid = c.movieid group by b.zipcode, c.title;

create table best_movie_district as select a.*, row_number() over (distribute by a.zipcode sort by a.avgrate desc) rn from answer10 a;

select * from best_movie_district where rn = 1 order by zipcode;

做题的思路:
根据最终的需求去一步步构造所需要的数据,直到最后能从三张基本当中查询为止

核心技能:

1、连接查询
2、UDTF函数 explode的使用
3、窗口函数 row_number()的使用

在做题的最开始,就把三张表给连起来,最好把 年份 这个字段也弄出来

剩下的还需要解析的字段: movietype
UDTF应用场景
测试UDTF
1、创建一张只有一个string类型的line字段表存储movie_line并导入数据:
create table movie_line(line string) row format delimited fields terminated by ‘\t’;

2、导入数据
load data local inpath ‘/home/hadoop/movies.dat’ into table movie_line;

3、编写UDTF,拆分电影类型,把一行变成多行:
add jar /home/hadoop/movieudtf.jar;

4、创建临时数据
create temporary function process_movie_type as ‘com.ghgj.hive.demo.MovieTypeUDTF’;

5、执行最后数据处理
create table movie_type as select adTable.movieid, adTable.moviename, adTable.movietype from movie_line lateral view process_movie_type(line) adTable as movieid, moviename, movietype;

Hive_day5笔记
hive的原理 和 优化

1、create table answer1000000 as select b.zipcode zipcode, c.title, avg(a.rate) as avgrate from ratings a join users b on a.userid = b.userid join movies c on a.movieid = c.movieid group by b.zipcode, c.title;

在hive的HQL语句的执行过程当中:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:
set hive.exec.reducers.max=

In order to set a constant number of reducers:
set mapreduce.job.reduces=

set mapred.reduce.tasks=2;
set mapreduce.job.reduces=2;

reduceTask1 500M —- > 256M
reduceTask1 10M
reduceTask1 10M

set mapreduce.job.reduces=3;

设计变量优先级
set mapreduce.job.reduces=3;

set mapreduce.job.reduces=-1;
那么就会按照hive.exec.reducers.max和hive.exec.reducers.bytes.per.reducer

如果当 按照这个hive.exec.reducers.bytes.per.reduce标准去运行,最后的max reduceTasks超过了hive.exec.reducers.max这个值,那么部分reduceTask会处理超过256M的数据

1000 * 256M = 256G

平常集群中,每个节点跑2个reduceTask

select * from student;
MR程序的数据输入目录:/user/hive/warehouse/mydb.db/student/

select * from student_ptn where city = “beijing”;
底层得原理:MR程序的数据输入目录:/user/hive/warehouse/mydb.db/student/city=beijing/

分桶插入

HIVE的本质就是将HQL语句转换成MR程序去运行

JOIN *

Group by +

distinct 1

mapreduce join
1、reduce join
2、map join

join

SELECT pv.pageid, u.age
FROM page_view pv JOIN user u
ON (pv.userid = u.userid);

page_view : userid, pageid
user: userid, age

HQL语句是被翻译成MR程序的,MR程序是分为Map和Reduce两个阶段的
该MR程序的输入数据:就是这两张表的所有数据
map阶段:
1、接收到user的每一条记录做转换,转换之后的key就是连接条件userid, value是最终要查询出来的在该表当中的字段age
2、接收到page_view的每一条记录做转换,转换之后的key就是连接条件userid, value是最终要查询出来的在该表当中的字段pageid

通过map写出去的key对有:
userid  age pageid  flag
101 33      user
101     u101    pv
101 44      user
101     u102    pv

reduce阶段:
1、区分来自两张不同表的key-value
2、让两张表的所有数据做连接
编写一个双层循环

Group By

select word, count(*) from wordcount group word;

HQL语句的意图:按照pageid 和 age 分组,计算每组当中的记录条数
SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

map阶段:
每读取到一行数据之后,把这一行数据当中的pageid 和 age 当做一个 key,
value是1;

reduce阶段;
把pageid和age都相同的 都放到 同一组,然后计算该组的 value值的和

Distinct : 核心逻辑就是去重

HQL语句的意图:按照age分组,计算每组数据当中不同的pageid的个数
SELECT age, count(distinct pageid) FROM pv_users GROUP BY age;

HQL语句的意图:按照age分组, 计算是每组数据当中的记录条数
SELECT age, count(pageid) FROM pv_users GROUP BY age;

map阶段:

读取到每一行数据之后做数据的转换,key是age, value是pageid     XXXXXX√√√√√√

读取到每一行数据之后做数据的转换,key是age和pageid, value是object   √√√√√√

reduce阶段:

key:age+pageid  values(1,1,1,1,1)
最后的输出,就是key


图的HQL语句:
select distinct(age, pageid) from pv_users;

where过滤是在map阶段
having过滤是在reduce阶段

select… from …. where …. group by .. order by …. having…. limit ….

from:MR程序的数据输入目录
where : maptaks阶段进行数据过滤
group by
join
distinct
select:select后面的字段就是maptask阶段的key所对应的value的值
order by
sort by : reduceTask之前就已经排序。shuffle过程当中完成的排序
having: reduce阶段进行数据过滤, 按照条件过滤
limit:reduce阶段进行数据的过滤, 按照数据条数过滤

hive的shell操作

hive –service cli = hive

在mysql中source命令主要用来执行脚本进行数据库的初始化

source file
file里面编写的就是初始的建表语句和初始数据的insert 语句

给hive的客户端设置参数:

hive-site.xml 最先生效
hive -i initconf.conf; 把之前hive-site.xml文件中的配置信息就给覆盖了。
set mapred.reduce.tasks=2; set命令会把hive客户端在启动的时候初始化的配置信息给覆盖

hdfs的代码编写的时候conf的调用:
集群安装的配置文件 hdfs-site.xml
项目当中引用的jar中自带的默认配置文件hdfs-default.xml
项目代码中设置的配置信息 conf.set(“fs.blocksize”,128M);

hive的数据倾斜

mapreduce程序在计算的过程当中也会出现数据倾斜

数据倾斜的概念:在整个集群当中。 出现了部分节点数据集中的这种情况,数据热点,所以最后再执行计算的时候会出现数据倾斜的情况
不管使用哪种计算框架,或者解决方案。 都会出现数据倾斜的问题
这个问题是数据的特性。不是计算框架的问题

数据热点。 在计算的时候就会数据倾斜的情况

数据倾斜的表现:整个计算任务当中,绝大多数的计算任务都能正常按时完成。但是部分节点由于数据热点的情况会导致处理这些数据所需要花费的时间远远多于其他正常的task消耗的时间

任务的执行进度,在前大部分都很快很均匀,但是最后会一直停留在99%附近

一定要进行优化。
根本原因:整个任务在没有执行完成之前,是不会释放jvm的绑定的

最好的情况就是:不管多少个任务,每个任务处理数量所消耗的时间都是一样的。

出现数据倾斜的情况:
1、group by不和聚集操作连用
group by就是按照相同的key被划分成一组进行数据处理.
group by和聚集操作连用的时候。基本不会出现数据倾斜。

假如运气不好:
有5个key:
key1(10)   -----   reduceTask1
key2(10)   -----   reduceTask1
key3(10)   -----   reduceTask1
key4(10)   -----   reduceTask1

key5(40)   -----   reduceTask2

2、count(distinct)
group by

3、连接join

小表 * 小表  ====  出现数据倾斜,可以忽略
大表 * 小表  ====  出现数据倾斜,

不同数据类型做关联

日志表log表中 userid 的类型 string
业务库表user表中 userid 的类型为 int

select a., b. from log a join user b on a.userid = b.userid;

1、要么把string转成int
203,49093 都能转换成数值
但是 s42309, a79872bou 这种明显不能转换成数值。 后天忽略转换的错,但是会转换成null

2、要么把int转成string
会不会出现连接的时候出现类型转异常的错误
转换之后的连接如果再出现数据倾斜,那就是正常数据的原因。

你可能感兴趣的:(hive)