Spark Streaming实时流处理项目实战笔记

第二章 分布式日志收集框架Flume

课程目录
业务现状分析=>flume概述=>flume架构及核心组件=>flume环境部署=>flume实战

1、业务现状分析

  • WebServer/ApplicationServer分散在各个机器上
  • 大数据平台Hadoop进行统计分析
  • 日志如何收集到Hadoop平台上
  • 解决方案及存在问题

传统从Server到Hadoop处理上存在的问题
1.难以监控
2.IO的读写开销大
3.容错率高,负载均衡差
4.高延时,需隔一段时间启动

2、flume概述

flume官网:http://flume.apache.org/

Flume is a distributed(分布式的), reliable(高可靠的), and available service(高可用的服务) for efficiently collecting(海量收集), aggregating(聚合), and moving(移动系统) large amounts of log data.
Flume是由Cloudera提供的一个分布式、高可靠、高可用的服务,用于分布式的海量日志的高效收集、聚合、移动系统

设计目标
可靠性
扩展性
管理性

业界同类产品的对比
Flume: Cloudera/Apache Java
Scribe: Facebook C/C++ 不再维护
Chukwa: Yahoo/Apache Java 不再维护
Kafka:
Fluentd: Ruby
Logstash: ELK(ElasticSearch,Kibana)

Flume发展史
Cloudera 0.9.2 Flume-OG
flume-728 Flume-NG ==> Apache
2012.7 1.0
2015.5 1.6
~ 1.7

Flume架构及核心组件

  1. Source 收集
  2. Channel 聚集
  3. Sink 输出
    Spark Streaming实时流处理项目实战笔记_第1张图片
    Flume安装前置条件
    1.Java Runtime Environment - Java 1.8 or later
    2.Memory - Sufficient memory for configurations used by sources, channels or sinks
    3.Disk Space - Sufficient disk space for configurations used by channels or sinks
    4.Directory Permissions - Read/Write permissions for directories used by agent

安装jdk
下载
解压到~/app
将java配置系统环境变量中: vi ~/.bash_profile
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144
export PATH= J A V A H O M E / b i n : JAVA_HOME/bin: JAVAHOME/bin:PATH
source下让其配置生效:source ~/.bash_profile
检测: java -version

安装Flume
下载
解压到~/app
将java配置系统环境变量中: vi ~/.bash_profile
export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin
export PATH= F L U M E H O M E / b i n : FLUME_HOME/bin: FLUMEHOME/bin:PATH
source下让其配置生效 :source ~/.bash_profile
flume-env.sh的配置:export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144
检测: flume-ng version

Flume架构及核心组件

Flume实战:

需求一:从指定网络端口采集数据输出到控制台

Spark Streaming实时流处理项目实战笔记_第2张图片

使用Flume的关键就是写配置文件
A) 配置Source
B) 配置Channel
C) 配置Sink
D) 把以上三个组件串起来

a1: agent名称
r1: source的名称
k1: sink的名称
c1: channel的名称

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop000
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

查询官网文档
http://flume.apache.org/FlumeUserGuide.html#avro-legacy-source

 a1.sources.r1.type = netcat
    a1.sources.r1.bind = hadoop000
    a1.sources.r1.port = 44444

type:The component type name, needs to be org.apache.flume.source.avroLegacy.AvroLegacySource
host:The hostname or IP address to bind to
port:The port # to listen on

a1.sinks.k1.type = logger

type:The component type name, needs to be logger

a1.channels.c1.type = memory

type:The component type name, needs to be memory

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注意:一个source可以输出到多个channel,因此上面是channels;而一此只能从channel输出一个到sink,因此下面是channel

步骤:
1.写配置文件
在conf目录下:vi example.conf
将上面代码写入其中
2.启动agent

flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/example.conf \
-Dflume.root.logger=INFO,console

3.使用telnet进行测试: telnet hadoop000 44444

需求二:监控一个文件实时采集新增的数据输出到控制台

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log
a1.sources.r1.shell = /bin/sh -c

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

channels
type:The component type name, needs to be exec
command:The command to execute
shell:A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.

步骤:
1.写配置文件
在conf目录下:vi exec-memory-logger.conf
将上面代码写入其中
2.启动agent

flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/exec-memory-logger.conf \
-Dflume.root.logger=INFO,console

3.测试:

新打开一个窗口:输入以下内容

  [hadoop@hadoop001 data]$ echo hello >> data.log 
    [hadoop@hadoop001 data]$ echo world >> data.log 
    [hadoop@hadoop001 data]$ echo welcome >> data.log 

原窗口会出现以下变化:

Event: { headers:{} body: 68 65 6C 6C 6F  				hello }
Event: { headers:{} body: 77 6F 72 6C 64  				world }
Event: { headers:{} body: 77 65 6C 63 6F 6D 65			welcome }

需求三:将A服务器上的日志实时采集到B服务器端

技术选项:

exec source + memory channel + avro sink
avro source + memory channel + logger sink

两个配置文件:

exec-memory-avro.conf

# Name the components on this agent
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel

# Describe/configure the source
exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /home/hadoop/data/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c

# Describe the sink
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.bind = hadoo000
exec-memory-avro.sinks.avro-sink.port = 4444

# Use a channel which buffers events in memory
exec-memory-avro.channels.exec-source.type = memory

# Bind the source and sink to the channel
exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel

avro-memory-logger.conf

# Name the components on this agent
avro-memory-logger.sources = avro source
avro-memory-logger.sinks = logger sink
avro-memory-logger.channels = memory-channel

# Describe/configure the source
avro-memory-logger.sources.avro source.type = avro
avro-memory-logger.sources.avro source.bind = hadoop000
avro-memory-logger.sources.avro source.port = 44444 

# Describe the sink
avro-memory-logger.logger sink.type = logger

# Use a channel which buffers events in memory
avro-memory-logger.channels.avro source.type = memory

# Bind the source and sink to the channel
avro-memory-logger.sources.avro source.channels = memory-channel
avro-memory-logger.sinks.logger sink.channel = memory-channel

先启动avro-memory-logger

  flume-ng agent \
    --name avro-memory-logger \
    --conf $FLUME_HOME/conf \
    --conf-file $FLUME_HOME/conf/avro-memory-logger.conf \
    -Dflume.root.logger=INFO,console

再启动exec-memory-avro

 flume-ng agent \
    --name exec-memory-avro \
    --conf $FLUME_HOME/conf \
    --conf-file $FLUME_HOME/conf/exec-memory-avro.conf \
    -Dflume.root.logger=INFO,console

测试:
新打开一个窗口:输入以下内容

 [hadoop@hadoop001 data]$ echo hello spark >> data.log 
 [hadoop@hadoop001 data]$ echo hello hadoop >> data.log  

原窗口会出现以下变化:

Event: { headers:{} body: 68 65 6C 6C 6F 20 73 70 61 72 6B                hello spark }
Event: { headers:{} body: 68 65 6C 6C 6F 20 68 61 64 6F 6F 70             hello hadoop }

你可能感兴趣的:(bigData)