转载请注明出处:https://blog.csdn.net/l1028386804/article/details/98945268
首先,有关Hadoop环境的搭建,大家可以参考博文《Hadoop之——基于3台服务器搭建Hadoop3.x集群(实测完整版)》,有关Nginx的安装和配置,可以参见博文《Nginx+Tomcat+Memcached负载均衡集群服务搭建》,有关Hive的安装和配置,可以参见博文《Hive之——Hive2.3.4 安装和配置》和《Hive之——hive本地模式配置,连接mysql数据库--Hive2.3.3+Hadoop2.9.0+MySQL5.7.18》。
Flume的安装比较简单,下载Flume后,解压,配置系统环境变量即可。下载Flume可以输入下面的命令。
wget http://mirror.bit.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
#启动Hadoop
start-dfs.sh
start-yarn.sh
#启动Nginx
/usr/local/nginx/sbin/nginx
#启动Hive命令行
hive
1.输入如下命令建立名为hive_nginx_log的数据库。
hive> create database hive_nginx_log;
2.查看Nginx的日志格式,如下所示。
192.168.175.10 - - [31/Jul/2019:21:19:39 +0800] "GET /test/sharding HTTP/1.1" 200 798 "http://192.168.175.200/" "Mozilla/5.0 (Windows NT 10.0; WOW64) Forefix/537.36 (KHTML, like Gecko) Forefix/65.0.3325.181 Forefix/537.36"
3.在数据库hive_nginx_log中建立数据表nginx_log。
use hive_nginx_log;
CREATE TABLE nginx_log(
client_ip STRING,
remote_login_name STRING,
remote_oauth_user STRING,
request_time_utf STRING,
request_method_url STRING,
status_code STRING,
send_bytes_size STRING,
source_access STRING,
client_info STRING)
partitioned by (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
4.新建Flume配置文件flume-hive-nginx-log.conf,监听Nginx日志,将Nginx日志推送到Hive表nginx_log中。
#定义agent名, source、channel、sink的名称
myagent.sources = r1
myagent.channels = c1
myagent.sinks = k1
# 配置Source
myagent.sources.r1.type = exec
myagent.sources.r1.channels = c1
myagent.sources.r1.deserializer.outputCharset = UTF-8
# 配置需要监控的日志输出目录
myagent.sources.r1.command = tail -F /usr/local/nginx/logs/access.log
#设置缓存提交行数
myagent.sources.s1.deserializer.maxLineLength =1048576
myagent.sources.s1.fileSuffix = .DONE
myagent.sources.s1.ignorePattern = access(_\d{4}\-\d{2}\-\d{2}_\d{2})?\.log(\.DONE)?
myagent.sources.s1.consumeOrder = oldest
myagent.sources.s1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
myagent.sources.s1.batchsize = 5
#具体定义channel
myagent.channels.c1.type = memory
myagent.channels.c1.capacity = 10000
myagent.channels.c1.transactionCapacity = 100
#具体定义sink
myagent.sinks.k1.type = hdfs
#%y-%m-%d/%H%M/%S
#这里对应就是hive 表的目录
myagent.sinks.k1.hdfs.path = hdfs://binghe100:9000/user/hive/warehouse/test_flume.db/nginx_log/%Y-%m-%d_%H
myagent.sinks.k1.hdfs.filePrefix = nginx-%Y-%m-%d_%H
myagent.sinks.k1.hdfs.fileSuffix = .log
myagent.sinks.k1.hdfs.fileType = DataStream
#不按照条数生成文件
myagent.sinks.k1.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个文件
myagent.sinks.k1.hdfs.rollSize = 2914560
myagent.sinks.k1.hdfs.useLocalTimeStamp = true
#组装source、channel、sink
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1
5.启动Flume
flume-ng agent --conf conf --conf-file /usr/local/flume-1.9.0/conf/flume-hive-nginx-log.conf --name myagent -Dflume.root.logger=INFO,console
6.访问Nginx,在浏览器中输入http://192.168.175.200进行访问
启动后的输出的信息如下。
hdfs://binghe200:9000/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/nginx-2019-07-31_23.1564589089324.log
7.由启动Flume时输出的信息可知,可以将当前Hive表的分区指向/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23目录。故在Hive命令行执行如下命令。
ALTER TABLE nginx_log ADD IF NOT EXISTS PARTITION (dt='2019-07-31_23') LOCATION '/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/';
8.查看Hive表nginx_log中的数据
hive> SELECT * FROM nginx_log;