大数据阶段项目之项目实现
目录
大数据阶段项目之项目实现
一.启动Hadoop分布式集群(伪分布式)
二.创建一个文件夹存储数据
三.将文件收集到HDFS
1.在Flume的data下创建zebra.conf
2.利用flume收集数据,将收集的数据落地到HDFS系统中。
3.执行命令,存储HDFS
4.查看eclipse中是否存在
四.启动HIVE
五.使用hive操作
1.创建zebra数据库
2.建立外部表,指向要处理的数据(外部表+分区表,用时间作为分区)
3.修复分区
4.查看数据
5.手动设置分区
6.再次查看数据
五.清洗数据,提取有用数据(23个字段)
1.建表语句
2.插入数据
3.查看数据
六.对清晰之后的数据进行整理,建立事实表
1.建表语句:
2.插入数据:
七.查询关心的信息,以应用受欢迎程度表为例:
1.建表语句:
2.插入数据:
3.查询前5名受欢迎app:
八.Sqoop组件工作流程:
1.在mysql建立对应的表
2.利用sqoop导出d_h_http_apptype表:
3.查看数据
4.可视化页面
一.启动Hadoop分布式集群(伪分布式)
二.创建一个文件夹存储数据
三.将文件收集到HDFS
HIVE是在HDFS上操作的,需要把文件存储到HDFS中,进行操作
1.在Flume的data下创建zebra.conf
2.利用flume收集数据,将收集的数据落地到HDFS系统中。
flume在收集日志的时候,按天为单位进行收集
a1.sources=r1
a1.channels=c1
a1.sinks=s1
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/zebra
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://192.168.150.137:9000/zebra/reportTime=%Y-%m-%d
a1.sinks.s1.hdfs.fileType=DataStream
a1.sinks.s1.hdfs.rollInterval=30
a1.sinks.s1.hdfs.rollSize=0
a1.sinks.s1.hdfs.rollCount=0
a1.channels.c1.type=memory
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1
3.执行命令,存储HDFS
4.查看eclipse中是否存在
四.启动HIVE
五.使用hive操作
1.创建zebra数据库
- 执行:create database zebra;
- 执行:use zebra;
2.建立外部表,指向要处理的数据(外部表+分区表,用时间作为分区)
建表语句:create EXTERNAL table zebra (a1 string,a2 string,a3 string,a4 string,a5 string,a6 string,a7 string,a8 string,a9 string,a10 string,a11 string,a12 string,a13 string,a14 string,a15 string,a16 string,a17 string,a18 string,a19 string,a20 string,a21 string,a22 string,a23 string,a24 string,a25 string,a26 string,a27 string,a28 string,a29 string,a30 string,a31 string,a32 string,a33 string,a34 string,a35 string,a36 string,a37 string,a38 string,a39 string,a40 string,a41 string,a42 string,a43 string,a44 string,a45 string,a46 string,a47 string,a48 string,a49 string,a50 string,a51 string,a52 string,a53 string,a54 string,a55 string,a56 string,a57 string,a58 string,a59 string,a60 string,a61 string,a62 string,a63 string,a64 string,a65 string,a66 string,a67 string,a68 string,a69 string,a70 string,a71 string,a72 string,a73 string,a74 string,a75 string,a76 string,a77 string) partitioned by (reporttime string) row format delimited fields terminated by '|' stored as textfile location '/zebra';
增加分区操作:msck repair table zebra;
可以通过抽样语法来检验:select * from zebra TABLESAMPLE (1 ROWS);来检查是否导入成功
增加分区操作:msck repair table zebra;
3.修复分区
这个表是手动创建的,hive无法识别,需要手动修复分区
4.查看数据
查看不到数据,也是就说,数据没有拿到
msck repair 添加分区,如果不好用的话,只能手动添加了
5.手动设置分区
PS:日期对应当前日期
6.再次查看数据
五.清洗数据,提取有用数据(23个字段)
1.建表语句
- create table dataclear(reporttime string,appType bigint,appSubtype bigint,userIp string,userPort bigint,appServerIP string,appServerPort bigint,host string,cellid string,appTypeCode bigint,interruptType String,transStatus bigint,trafficUL bigint,trafficDL bigint,retranUL bigint,retranDL bigint,procdureStartTime bigint,procdureEndTime bigint)row format delimited fields terminated by '|';
2.插入数据
- insert overwrite table dataclear select concat(reporttime,' ','00:00:00'),a23,a24,a27,a29,a31,a33,a59,a17,a19,a68,a55,a34,a35,a40,a41,a20,a21 from zebra;
3.查看数据
56s
效率巨低 20s
六.对清晰之后的数据进行整理,建立事实表
1.建表语句:
- create table f_http_app_host (reporttime string,appType bigint,appSubtype bigint,userIp string,userPort bigint,appServerIP string,appServerPort bigint,host string,cellid string,attempts bigint,accepts bigint,trafficUL bigint,trafficDL bigint,retranUL bigint,retranDL bigint,failCount bigint,transDelay bigint)row format delimited fields terminated by '|';
2.插入数据:
- insert overwrite table f_http_app_host select reporttime,appType,appSubtype,userIp,userPort,appServerIP,appServerPort,host, if(cellid == '',"000000000",cellid),if(appTypeCode == 103,1,0),if(appTypeCode == 103 and find_in_set(transStatus,"10,11,12,13,14,15,32,33,34,35,36,37,38,48,49,50,51,52,53,54,55,199,200,201,202,203,204,205,206,302,304,306")!=0 and interruptType == 0,1,0),if(apptypeCode == 103,trafficUL,0), if(apptypeCode == 103,trafficDL,0), if(apptypeCode == 103,retranUL,0), if(apptypeCode == 103,retranDL,0), if(appTypeCode == 103 and transStatus == 1 and interruptType == 0,1,0),if(appTypeCode == 103, procdureEndTime - procdureStartTime,0) from dataclear;
七.查询关心的信息,以应用受欢迎程度表为例:
1.建表语句:
- create table D_H_HTTP_APPTYPE(hourid string,appType int,appSubtype int,attempts bigint,accepts bigint,succRatio double,trafficUL bigint,trafficDL bigint,totalTraffic bigint,retranUL bigint,retranDL bigint,retranTraffic bigint,failCount bigint,transDelay bigint) row format delimited fields terminated by '|';
根据总表dataproc,按条件做聚合以及字段的累加
2.插入数据:
- insert overwrite table D_H_HTTP_APPTYPE select reporttime,apptype,appsubtype,sum(attempts),sum(accepts),round(sum(accepts)/sum(attempts),2),sum(trafficUL),sum(trafficDL),sum(trafficUL)+sum(trafficDL),sum(retranUL),sum(retranDL),sum(retranUL)+sum(retranDL),sum(failCount),sum(transDelay)from f_http_app_host group by reporttime,apptype,appsubtype;
3.查询前5名受欢迎app:
- select hourid,apptype,sum(totalTraffic) as tt from D_H_HTTP_APPTYPE group by hourid,apptype sort by tt desc limit 5;
八.Sqoop组件工作流程:
1.在mysql建立对应的表
2.利用sqoop导出d_h_http_apptype表:
- sh sqoop export --connect jdbc:mysql://hadoop03:3306/zebra --username root --password root --export-dir '/user/hive/warehouse/zebra.db/d_h_http_apptype/000000_0' --table D_H_HTTP_APPTYPE -m 1 --fields-terminated-by '|'
3.查看数据
4.可视化页面