hive 项目实战(2)

建表

创建表
这里总共需要创建4张表,明明只有两个数据文件,为什么要创建4张表呢?因为这里创建的表要使用orc的压缩方式,而不使用默认的textfile的方式,orc的压缩方式要想向表中导入数据需要使用子查询的方式导入,即把从另一张表中查询到的数据插入orc压缩格式的表汇中,所以这里需要四张表,两张textfile类型的表user和video,两张orc类型的表user_orc和video_orc

1.先创建textfile类型的表

create table user(
videoId string,
uploader string,
age int,
category array,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
create table video(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;
向两张表中导入数据,从hdfs中导入

load data inpath '数据文件在hdfs中的位置' into table user;
2.创建两张orc类型的表

create table user_orc(
videoId string,
uploader string,
age int,
category array,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array)
clustered by (uploader) into 8 buckets
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
create table video_orc(
uploader string,
videos int,
friends int)
clustered by (uploader) into 24 buckets
row format delimited
fields terminated by "\t"
stored as orc;
向两张表中导入数据

insert into table user_orc select *from user;
insert into table video_orc select *from video;
这时候数据就加载到两张表中了,可以进行简单的查看

select *from user_orc limit 10;
select *from video_orc limit 10
 

create table video(
videoId string,
uploader string,
age int,
category array,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

create table user(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;



create table video_orc(
videoId string,
uploader string,
age int,
category array,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array)
clustered by (uploader) into 8 buckets
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;



create table user_orc(
uploader string,
videos int,
friends int)
clustered by (uploader) into 24 buckets
row format delimited
fields terminated by "\t"
stored as orc;


统计视频类别热度top10

a. 炸开 类别

select videoId,category_name
from video_orc lateral view explode(category) t_category as category_name; t1

b. 分组 计算count
select category_name,count(1) as cnt from t1 group by category_name order by cnt desc limit 20;

SELECT
	category_name AS category,
	COUNT ( 1 ) AS cnt 
FROM
	( SELECT videoId, category_name FROM video_orc LATERAL VIEW explode ( category ) t_category AS category_name ) t1 
GROUP BY
	category_name 
ORDER BY
	cnt DESC 
LIMIT 20


统计视频观看数前20的类别

a.统计观看数前20 的类别
select videoId,category from video_orc order by views desc delimited 20; t1

b. 查看类别
select videoId, category_name from t1 lateral view explode(category) t_category as category_name;

select category_name,count(*) as cnt from  (select category_name from (select videoid,category from video_orc order by views desc limit 20)t1 lateral view explode(category) t_category as category_name)t2 group by category_name order by cnt;


每个类别 top10
a.炸开类别
select videoId,category_name,views
from video_orc lateral view explode(category) t_category as category_name; t1
b. 利用row number 函数
select t2.* from (select category_name,views,videoid,row_number() over(partition by category_name order by views desc) as  rank from (select videoid,category_name,views from video_orc lateral view explode(category) t_category as category_name)t1) t2 where t2.rank <10;

Animals 类别 top 10
select videoId,category_name,views
from video_orc lateral view explode(category) t_category as category_name ; t1

select views from (select videoId,category_name,views from video_orc lateral view explode(category) t_category as category_name)t1 where t1.category_name=Animals order by views limit 10; 

 

 

你可能感兴趣的:(大数据学习)