数据清洗-常用hive操作

1.将Hive表查询到的数据存储在HDFS中

hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/20190120' row format delimited fields terminated by '\t' 
      >select uid, ukey form r_show_explode_sohu where p_day='20190108' limit 10
解析:此处目标路径是'hdfs://dc4/user/sohutvrec/zqj',存储字段是uid和ukey,使用逗号间隔,hive表中查询的是r_show_explode_sohu表中2019年1月9日的数据
注意:目标存储路径需要使用全路径名,而且在hive中只能使用overwrite进行覆盖,不能使用into进行追加,因此存储的数据需要进行分区(比如按照时间)

2.去重查询——将查询字段中重复信息去掉,只保留一条

hive>select count(distinct ukey) as ukey from r_show_explode_sohu where p_day='20190108';
解析:使用distinct进行去重,此处查询的是r_show_explode_sohu表中字段名为ukey的内容,并作去重处理
注意:distinct需要放在字段最前面,在查询单个字段时挺好用的,但是有多个字段时,会对多个字段同时起作用(即多个字段全部重复才行)

hive>select uid,ukey from(select t.*,row_number() over(partition by ukey order by 1) rn from r_show_explode_sohu t where p_day='20180713') a where a.rn=1
解析:使用row_number() over()进行去重查询,此处去重字段是ukey
注意:Hive严格模式下,必须给子查询取一个别名,否则报错

去重:https://blog.csdn.net/xiaoshunzi111/article/details/70611946
row_number可以根据多字段进行去重:
    SELECT 字段1,字段2,字段3,字段4,字段5
    FROM (
           SELECT T.*, ROW_NUMBER() OVER(PARTITION BY 字段1,字段2,字段3 order by 1) RN
           FROM 表明 T
    ) A
    WHERE A.RN=1;

管道去重:awk '!x[$0]++'   将select后的数据管道并执行该命令去重,好使~     
    
3.统计查询——使用sum对同一字段不同状态信息量进行统计


hive>select first_category,sum(case when status='1' then 1 else 0 end) as statu_1,sum(case when status='2' then 1 else 0 end) as statu_2,sum(case when status='3' then 1 else 0 end) as statu_3,count(*) as statu_sum,'20190115' datatime from dw_video_common group by first_category
解析:case when...then...else...end表示将满足条件A的信息标记为1,不满足条件A的信息标记为0,然后sum将所有标记相加,也即是所有满足条件A的信息总量(类似if...else...语句)
注意:此处不能使用count进行统计,因为count统计的是所有标记的个数(包括1和0)

hive统计函数:https://blog.csdn.net/haramshen/article/details/52668586

4.查看parquet文件方法
1)需要使用parquet-tools-1.6.0rc3-SNAPSHOT.jar文件,因此先将该文件拷贝到自己的目录下
cp /home/sohutvrec/liying/sohutv/parquet-tools-1.6.0rc3-SNAPSHOT.jar ./
2)将需要查看的parquet文件拷贝到自己的目录下
hdfs>hdfs dfs -get sohutvrec/userBasic/20190114/part-00000-37edd67e-3520-4cb9-81a7-b2b58459bb63-c000.snappy.parquet
3)使用parquet-tools-1.6.0rc3-SNAPSHOT.jar查看指定parquet数据
/usr/java/jdk1.8.0_121/bin/java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat part-00000-37edd67e-3520-4cb9-81a7-b2b58459bb63-c000.snappy.parquet |more

注意:使用 Schema 查看parquet格式的数据类型
/usr/java/jdk1.8.0_121/bin/java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema part-00099-a1723cd1-f163-4834-92b7-8661b64711d3-c000.snappy.parquet | more

5.hive横表转纵表——使用lateral view
1)作用一:将一行数组数据转化为多行数据罗列 
hive>select distinct split(video.vid,',') as vid from sohutvrec.sohutv_engine_rec_log lateral view explode(videos) t1 as video WHERE featuretype='video' and dt='20190117' and hour>='00';
解析:此处videos字段就是一个"数组+struct"结构,直接使用"lateral view explode(videos) t1 as video "进行横转纵操作
注意:前面取出vid的时候需要使用后面定义好的别名"video"
参考:https://blog.csdn.net/dreamingfish2011/article/details/51250641
数据格式如下:response是一个array>格式,是横向的,需要转成纵向的~
request struct
response        array>

2)作用二:进行数组过滤查询
select distinct request.ukey, resp.vid from sohutvrec.sohutv_engine_rec_log lateral view explode(response) r1 as resp where resp.candidate='412' and dt='20190119' and hour>='00';
解析:此处过滤查询数组中candidate='412'的所有元素(即先一行转多行,然后选取其中指定行)

6.数组转化成set集合的格式——使用collect_set(array.item)
select request.ukey,request.uid,collect_set(resp.vid) as vidset from sohutvrec.sohutv_engine_rec_log lateral view explode(response) r1 as resp where resp.candidate='412' and dt='20190119' group by request.ukey,request.uid;
注意:需要配合group by来使用(未使用collect_set的字段进行分组)
      
数组转化成字符串的格式
select concat_ws(',',c_array)  from test_array  where dt='20190121' and size(c_array)=2 limit 2;  
reference:https://blog.csdn.net/gongmf/article/details/52680748
  
  
 
实际需求:
1.将sohutv_rec表中近六个月的用户uid-ukey信息去重之后存储在hdfs中,注意以时间进行分区
hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/sohutv_vetl/20190117' row format delimited fields terminated by '\t' 
     select uid,ukey from(select t.*,row_number() over(partition by ukey order by 1) rn from r_show_explode_sohu t where p_day>='20180717' and p_day<='20190117') a where a.rn=1

2.统计一级视频(字段为first_category)下各种不同状态视频量(状态分为-1、0、1、2、3),存储在hdfs中
hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/sohutv_rank/20190117' row format delimited fields terminated by '\t' 
    >select first_category,sum(case when status='-1' then 1 else 0 end) as label_n1,sum(case when status='0' then 1 else 0 end) as label_0,sum(case when status='1' then 1 else 0 end) as label_1,sum(case when status='2' then 1 else 0 end) as label_2,sum(case when status='3' then 1 else 0 end) as label_3,count(*) as label_all,'20190117' datatime from dw_video_common where dt=20190117 group by first_category

3.建立外部Hive表,按照日期进行分区,存储格式为parquet,关联搜狐视频用户特征和用户基础行为
1)建立外部表Hive表,按照时间进行分区,存储格式为parquet
    create external table userbasic(
        ukey string,
        brand string,
        os string,
        city string,
        province string,
        area string
    ) partitioned by (day string) stored as parquet;
注意:外部表中各字段信息一定要和关联表中各字段信息一致,可以使用 Schema 查看parquet格式的数据类型
例如:/usr/java/jdk1.8.0_121/bin/java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema part-00099-a1723cd1-f163-4834-92b7-8661b64711d3-c000.snappy.parquet | more

2)关联HDFS中的用户特征文件(userbasic)
hive>alter table userbasic add partition (day=20190116) 
    >location 'hdfs://dc4/user/sohutvrec/sohutvrec/userBasic/20190116'
注意:location一定要写正确,我就是在这里墨迹了半天

4.hive表缺特征——用户/视频去重
获取sohutvrec.sohutv_engine_rec_lo表中的uid-ukey-passport字段信息,其中featuretype是user
hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/feature_miss_user/20190117' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    >select distinct request.ukey, request.uid, request.passport from sohutvrec.sohutv_engine_rec_log WHERE featuretype='user' and dt='20190117' and hour>='00';
注意:request是struct结构,取出其中指定类型数据,直接structname.itemname即可

获取sohutvrec.sohutv_engine_rec_lo表中的vid信息,其中featuretype是video
hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/feature_video/20190117' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    >select distinct split(video.vid,',') as vid from sohutvrec.sohutv_engine_rec_log lateral view explode(videos) t1 as video WHERE featuretype='video' and dt='20190117' and hour>='00';

5.在sohutvrec.sohutv_engine_rec_log表中查询走上浮接口(candidate=412)的ukey-uid-vid数量
hive>insert overwrite directory 'hdfs://dc4/user/sohutvrec/zqj/sohutv_rank_2_3/20190119' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' 
    >select request.ukey,request.uid,collect_set(resp.vid) as vidset from sohutvrec.sohutv_engine_rec_log lateral view explode(response) r1 as resp where resp.candidate='412' and dt='20190119' group by request.ukey,request.uid;
解析:set集合中的元素需要自己单独设置分隔符,这里使用"COLLECTION ITEMS TERMINATED BY ','"
注意:统计有多少条数据量 
      hdfs>hdfs dfs -cat zqj/sohutv_rank_2_3/20190119/* | wc -l  

6.vetl.t_app_vv_jfpass表中存储的是全量用户行为日志信息,现在需要找出长视频(site=1)且播放时长大于300秒的所有用户(ukey)及其专辑id(playlistid),注意同一用户的专辑id可能会重复或者为0,需要进行去重并去掉为0的部分,同时如果一个用户对应的专辑id大于5个,那么按照时间最近原则,取最近的5个,并按照时间顺序排列,还需要注意的是t_app_vv_jfpass表中只有uid,而最终我们需要的是ukey,因此还需要和vetl.apps_userid进行联合查询(取三个月,因为每三个月ukey更新一次),将uid转化为ukey

hive>insert overwrite directory 'hdfs://dc4/user/vrecsys/zqj/ukey_uid_last_three_month' row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',' 
      >select ukey, concat_ws(',',collect_list(concat_ws('_',cast (playlistid as string),cast(ordid as string)))) as playlist from (select t2.ukey, t1.playlistid, row_number() over(partition by t2.ukey order by time desc ) as ordid from (select uid,playlistid,max(created_on_time) as time from vetl.t_app_vv_jfpass where p_day>=20190303 and p_day<=20190402 and playtime>300 and playlistid !=0 group by uid,playlistid)t1 join (select uid,ukey from vetl.apps_userid where p_day>=20190103 and  p_day<=20190402 group by uid,ukey)t2 on t1.uid =t2.uid)t3 where ordid<=5 group by ukey

思路: 首先按照各个限制条件,在t_app_vv_jfpass表中按照uid、playlistid进行分组,同时可能会有多个相同的uid和playlistid对,取时间最近的那组,经过简单筛选之后,就有了最原始的uid-playlistid-time数据,然后再和apps_userid进行联合查询,最后按照ukey进行分组并进行局部排序(倒序)(row_number() over(partition by t2.ukey order by time desc ) as ordid),至此得到按照ukey进行分组的数据ukey-palylistid-ordid,然后按照ordid进行专辑id个数选取,生成最终结果

技巧: 使用窗口函数 row_number() over(partition by t2.ukey order by time desc) 在分组数据内部进行排序,注意它的执行是在group by之后(几乎是最后执行的)

你可能感兴趣的:(数据清洗-常用hive操作)