Flume之——监听Nginx日志发送到Hive表

转载请注明出处:https://blog.csdn.net/l1028386804/article/details/98945268

一、环境准备

首先,有关Hadoop环境的搭建,大家可以参考博文《Hadoop之——基于3台服务器搭建Hadoop3.x集群(实测完整版)》,有关Nginx的安装和配置,可以参见博文《Nginx+Tomcat+Memcached负载均衡集群服务搭建》,有关Hive的安装和配置,可以参见博文《Hive之——Hive2.3.4 安装和配置》和《Hive之——hive本地模式配置,连接mysql数据库--Hive2.3.3+Hadoop2.9.0+MySQL5.7.18》。
Flume的安装比较简单,下载Flume后,解压,配置系统环境变量即可。下载Flume可以输入下面的命令。

wget http://mirror.bit.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

二、启动服务

#启动Hadoop
start-dfs.sh
start-yarn.sh
#启动Nginx
/usr/local/nginx/sbin/nginx
#启动Hive命令行
hive

三、建立Hive数据库和表

1.输入如下命令建立名为hive_nginx_log的数据库。

hive> create database hive_nginx_log;

2.查看Nginx的日志格式,如下所示。

192.168.175.10 - - [31/Jul/2019:21:19:39 +0800] "GET /test/sharding HTTP/1.1" 200 798 "http://192.168.175.200/" "Mozilla/5.0 (Windows NT 10.0; WOW64) Forefix/537.36 (KHTML, like Gecko) Forefix/65.0.3325.181 Forefix/537.36"

3.在数据库hive_nginx_log中建立数据表nginx_log。

use hive_nginx_log;
CREATE TABLE nginx_log(
  client_ip STRING,
  remote_login_name STRING,
  remote_oauth_user STRING,
  request_time_utf STRING,
  request_method_url STRING,
  status_code STRING,
  send_bytes_size STRING,
  source_access STRING,
  client_info STRING)
partitioned by (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

4.新建Flume配置文件flume-hive-nginx-log.conf,监听Nginx日志,将Nginx日志推送到Hive表nginx_log中。

#定义agent名, source、channel、sink的名称
myagent.sources = r1
myagent.channels = c1
myagent.sinks = k1
# 配置Source
myagent.sources.r1.type = exec
myagent.sources.r1.channels = c1
myagent.sources.r1.deserializer.outputCharset = UTF-8
# 配置需要监控的日志输出目录
myagent.sources.r1.command = tail -F /usr/local/nginx/logs/access.log
#设置缓存提交行数
myagent.sources.s1.deserializer.maxLineLength =1048576
myagent.sources.s1.fileSuffix = .DONE
myagent.sources.s1.ignorePattern = access(_\d{4}\-\d{2}\-\d{2}_\d{2})?\.log(\.DONE)?
myagent.sources.s1.consumeOrder = oldest
myagent.sources.s1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
myagent.sources.s1.batchsize = 5
#具体定义channel
myagent.channels.c1.type = memory
myagent.channels.c1.capacity = 10000
myagent.channels.c1.transactionCapacity = 100
#具体定义sink
myagent.sinks.k1.type = hdfs
#%y-%m-%d/%H%M/%S
#这里对应就是hive 表的目录
myagent.sinks.k1.hdfs.path = hdfs://binghe100:9000/user/hive/warehouse/test_flume.db/nginx_log/%Y-%m-%d_%H
myagent.sinks.k1.hdfs.filePrefix = nginx-%Y-%m-%d_%H
myagent.sinks.k1.hdfs.fileSuffix = .log
myagent.sinks.k1.hdfs.fileType = DataStream
#不按照条数生成文件
myagent.sinks.k1.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个文件
myagent.sinks.k1.hdfs.rollSize = 2914560
myagent.sinks.k1.hdfs.useLocalTimeStamp = true
#组装source、channel、sink
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1

5.启动Flume

flume-ng agent --conf conf --conf-file /usr/local/flume-1.9.0/conf/flume-hive-nginx-log.conf --name myagent -Dflume.root.logger=INFO,console

6.访问Nginx,在浏览器中输入http://192.168.175.200进行访问
启动后的输出的信息如下。

hdfs://binghe200:9000/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/nginx-2019-07-31_23.1564589089324.log

7.由启动Flume时输出的信息可知,可以将当前Hive表的分区指向/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23目录。故在Hive命令行执行如下命令。

ALTER TABLE nginx_log ADD IF NOT EXISTS PARTITION (dt='2019-07-31_23') LOCATION '/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/';

8.查看Hive表nginx_log中的数据

hive> SELECT * FROM nginx_log;

 

你可能感兴趣的:(Flume,Hadoop生态)