文本导入hive表中

一.确定数据文件集合
1.来源渠道
自行写网络爬虫采集:研发成本高、不是本职工作
公开数据集:无研发成本,公开数据集质量高,数据量可大可小,按需获取即可。
第三方数据买卖公司:无研发成本,需要付费才能获取。
2.渠道选择
基于项目需求,选择公开数据集即可。
搜狗搜验室-http://www.sogou.com/labs/
多领域公开数据集-http://blog.csdn.net/marleylee/article/details/76587354
国外的公开数据集-https://site.douban.com/146782/widget/notes/15524697/note/519440833/
自行积累的公共数据集-https://mp.weixin.qq.com/s/8whZsvERs6zlUeYT677YyA
3.确定数据集
确定数据量级:
1.总大小
2.总文件个数或者说平均文件大小
3.总记录条数(单个文件大小/单个文件记录数 wc -l = 总文件大小/总记录条数)
本项目为:
2012年自年初到年末的约2.2亿条微博数据
共52周的数据,分成52个zip包
4.确定数据文件格式
csv格式
5.确定数据结构
mid, 消息的唯一id
retweeted_status_mid, 转发的原创微博的mid
uid, 微博主id
retweeted_uid, 转发的原创微博的uid
source, 终端
image, 图片
text, 内容
geo, 地理位置
create_at, 创建时间
deleted_last_seen, 微博被删除时间
permission_denied 当微博被删除后,设置成"permission denied"

二.将源数据装载到hive仓库
1.zip原始数据批量解压到指定目录
ls input_zip_dir/*.zip | xargs -n1 unzip -o -d output_csv_dir
2.在hive中创建weibo_origin和weibo_product两张同构表

CREATE external TABLE weibo_origin(
mid string,
retweeted_status_mid string,
uid string,
retweeted_uid string,
source string,
image string,
text string,
geo string,
created_at string,
deleted_last_seen string,
permission_denied string
)
comment ‘weibo content table’
partitioned by (week_seq string comment ‘the week sequence’)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’
STORED AS textfile;

CREATE TABLE weibo_product(
mid string,
retweeted_status_mid string,
uid string,
retweeted_uid string,
source string,
image string,
text string,
geo string,
created_at string,
deleted_last_seen string,
permission_denied string
)
comment ‘weibo content table’
partitioned by (week_seq string comment ‘the week sequence’)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’
STORED AS orcfile;

3.将解压完的明文每周数据表,按周load到weibo_origin载入原始数据
vi load_to_weibo_origin.sh
#!/bin/sh
source …/env.sh
output_csv_dir=/home/gudepeng/wbxm1/deal/wbsj
db_name=gudepeng
table_name=weibo_origin

week_seq_list=ls $input_zip_dir/*.csv | xargs -n1 echo | cut -d . -f1

for week_seq in week_seq_list;do
$HIVE -e "
use d b n a m e ; l o a d d a t a l o c a l i n p a t h ′ db_name; load data local inpath ' dbname;loaddatalocalinpathoutput_csv_dir/week_seq.csv’ overwrite into table t a b l e n a m e p a r t i t i o n ( w e e k s e q = ′ table_name partition(week_seq=' tablenamepartition(weekseq=week_seq’);
"
done

4.清选原始数据表weibo_origin,按周插入到weibo_product表中

vi load_to_weibo_product.sh
#!/bin/sh
source …/env.sh

week_seq=week1
db_name=gudepeng
from_table=weibo_origin
to_table=weibo_product

$HIVE -e "
use d b n a m e ; f r o m ( s e l e c t ∗ f r o m ′ db_name; from (select * from ' dbname;from(selectfromfrom_table’ where week_seq=‘ w e e k s e q ′ ) o n e w e e k i n s e r t o v e r w r i t e t a b l e ′ week_seq')oneweek insert overwrite table ' weekseq)oneweekinsertoverwritetableto_table’ partition(week_seq) select * where mid!=‘mid’
"

你可能感兴趣的:(项目)