1、下载
http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
2、上传到指定的服务器(master)中的某个目录
3、解压
tar -xvf apache-flume-1.8.0-bin.tar.gz
4、cd apache-flume-1.8.0-bin/conf
5、cp flume-conf.properties.template flume-conf.properties
6、vi flume-conf.properties
一个理论:flume每一次接收到的数据称为一个事件
说在前面telnet 需要自己去装,然后启动:telnet localhost 44445
接收netcat的数据展示到console
## 定义 sources、channels 以及 sinks
# 数据源
agent1.sources = netcatSrc
# 收集的数据放在那里,memoryChannel这个是放在内存中
agent1.channels = memoryChannel
# 数据放在那里loggerSink这个是以log的方式打印在控制台上
agent1.sinks = loggerSink
## netcatSrc 的配置
# 启动的服务
agent1.sources.netcatSrc.type = netcat
# Ip是什么
agent1.sources.netcatSrc.bind = localhost
# 在哪个端口上
agent1.sources.netcatSrc.port = 44445
## loggerSink 的配置
# 以log的方式写到控制台上
agent1.sinks.loggerSink.type = logger
## memoryChannel 的配置
agent1.channels.memoryChannel.type = memory
# 能装100条记录
agent1.channels.memoryChannel.capacity = 100
## 通过 memoryChannel 连接 netcatSrc 和 loggerSink
agent1.sources.netcatSrc.channels = memoryChannel
agent1.sinks.loggerSink.channel = memoryChannel
# agent 一个代理 conf用conf的方式启动 agent1就是配置里的那个
# -Dflume.root.logger=INFO,console info的方式打印在控制台
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1 -Dflume.root.logger=INFO,console
接收到的数据先放内存,再打印到控制台上
接收telnet的数据,每五次事件存一次hdfs
## 定义 sources、channels 以及 sinks
agent1.sources = netcatSrc
# memoryChannel 可能会丢失数据 应该把它换成文件的形式
agent1.channels = memoryChannel
# 这个是把它放到hdfs去
agent1.sinks = hdfsSink
## netcatSrc 的配置
agent1.sources.netcatSrc.type = netcat
agent1.sources.netcatSrc.bind = localhost
agent1.sources.netcatSrc.port = 44445
## hdfsSink 的配置
# 保存到hdfs上去
agent1.sinks.hdfsSink.type = hdfs
# 放到这个目录下去
agent1.sinks.hdfsSink.hdfs.path = hdfs://master:8020/user/hadoop-jrq/spark-course/steaming/flume/%y-%m-%d
# 每5次flume事件写一次HDFS
agent1.sinks.hdfsSink.hdfs.batchSize = 5
# 这个一定位true,否则会报错
agent1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true
## memoryChannel 的配置
agent1.channels.memoryChannel.type = memory
agent1.channels.memoryChannel.capacity = 100
## 通过 memoryChannel 连接 netcatSrc 和 hdfsSink
agent1.sources.netcatSrc.channels = memoryChannel
agent1.sinks.hdfsSink.channel = memoryChannel
启动:
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1
监控一个文件,每五次事件写一次hdfs
## 定义 sources、channels 以及 sinks
agent1.sources = logSrc
# 数据接收到后放在文件中去
agent1.channels = fileChannel
agent1.sinks = hdfsSink
## logSrc 的配置
agent1.sources.logSrc.type = exec
# -F参数会有一个小bug因此,我使用-f0
agent1.sources.logSrc.command = tail -f0 /home/hadoop-jrq/spark-course/steaming/flume-course/demo3/logs/webserver.log
## hdfsSink 的配置
agent1.sinks.hdfsSink.type = hdfs
agent1.sinks.hdfsSink.hdfs.path = hdfs://master:8020/user/hadoop-jrq/spark-course/steaming/flume/%y-%m-%d
agent1.sinks.hdfsSink.hdfs.batchSize = 5
agent1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true
## fileChannel 的配置
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /home/hadoop-jrq/spark-course/steaming/flume-course/demo2-2/checkpoint
agent1.channels.fileChannel.dataDirs = /home/hadoop-jrq/spark-course/steaming/flume-course/demo2-2/data
## 通过 fileChannel 连接 logSrc 和 hdfsSink
agent1.sources.logSrc.channels = fileChannel
agent1.sinks.hdfsSink.channel = fileChannel
日志文件收集:实时的保存到hdfs中去
实时接收日志信息,然后保存到hdfs中去
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1
echo testdata >> webserver.log
监控日志文件,写入kafka,然后再用spark streaming处理数据
flume配置:
agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1
agent1.sources.s1.type = exec
agent1.sources.s1.command = tail -fn0 /home/hadoop-jrq/spark-course/steaming/weblog.log
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
agent1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# kafka的主题
agent1.sinks.k1.kafka.topic = pageview
# kafka集群的地址
agent1.sinks.k1.kafka.bootstrap.servers = master:9092,slave1:9092,slave2:9092
# 每20次事件发送一次
agent1.sinks.k1.kafka.flumeBatchSize = 20
# ack机制的选用,不懂的,可以去看我kafka的ack机制文章
agent1.sinks.k1.kafka.producer.acks = 1
agent1.sinks.k1.kafka.producer.linger.ms = 1
# 压缩机制
agent1.sinks.k1.kafka.producer.compression.type = snappy
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1
启动flume:bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1
# 模拟数据发送
echo testdata >> webserver.log
依赖的jar包:
org.apache.spark
spark-core_2.11
2.2.0
org.apache.spark
spark-streaming_2.11
2.2.0
org.apache.spark
spark-streaming-kafka-0-8_2.11
2.2.0
redis.clients
jedis
2.5.2
spark streaming代码:
import com.jrq.streaming.InternalRedisClient
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object PageViewStream {
def main(args: Array[String]) {
if (args.length < 2) {
printIn("参数不够")
System.exit(1)
}
println("开始")
val sparkConf = new SparkConf()
if (!sparkConf.contains("spark.master")){
// 这个是因为我本地的 环境有点问题,手动加上的Hadoop环境,否则会报错,如果你的不会报错,就不需要加上
System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master")
// 我这是本地启动,这个程序最少要5个core才能启动,3个core用于接收数据,2个core用于处理数据
sparkConf.setMaster("local[6]")
}
sparkConf.setAppName("DirectKafkaWordCount")
// val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val sc = new SparkContext(sparkConf)
val Array(brokers, topics) = args
// Create the context
val ssc = new StreamingContext(sc, Seconds(1))
// 这里是因为我配置了HA的名字,因此程序中需要这么写,否则会有bug,就是当你master挂掉的时候,你的程序会报错,当然在本地测试的时候会有问题,换成hdfs://master:8020/就行了
ssc.checkpoint("hdfs://mycluster/user/hadoop-jrq/spark-course/streaming/checkpoint")
// 从kafka中读取数据
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
// 初始化一个kafka Direct 模式
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// 且将每一行的字符串转换成PageView对象
val pageViews = messages.map { case (_, value) => PageView.fromString(value) }
// 每隔3秒统计前50秒内每一个URL的访问错误数
val badCount = pageViews.window(Seconds(50), Seconds(3))
.filter(_.status != 200).count()
// 每隔两秒统计前15秒内有多少个用户访问了url:https:foo.com/
val fooCount = pageViews.window(Seconds(15), Seconds(2)).filter(_.url != "https://foo.com/").map(view => (view.userID, 1)).groupByKey().count()
// 每隔5秒统计前30秒内有多少活跃用户
val activeUserCount = pageViews.window(Seconds(30), Seconds(1))
.map(view => (view.userID, 1))
.groupByKey()
.count()
activeUserCount.print()
activeUserCount.foreachRDD { rdd =>
// 对每个分区进行遍历,写入redis
rdd.foreachPartition { partitionRecords =>
// redis 的连接池
val jedis = InternalRedisClient.getPool.getResource
partitionRecords.foreach { count =>
jedis.set("active_user_count", count.toString)
}
// 写完后吧这个连接还给连接池
InternalRedisClient.getPool.returnResource(jedis)
}
}
ssc.start()
ssc.awaitTermination()
}
}
/*
表示访问一个网站页面的行为日志数据,格式为:
url status zipCode userID\n
*/
case class PageView(val url: String, val status: Int, val zipCode: Int, val userID: Int)
extends Serializable
object PageView extends Serializable {
def fromString(in: String): PageView = {
val parts = in.split(" ")
new PageView(parts(0), parts(1).toInt, parts(2).toInt, parts(3).toInt)
}
}
Redis连接池代码:
import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool
/**
* Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool}
*/
object InternalRedisClient extends Serializable {
@transient private lazy val pool: JedisPool = {
val maxTotal = 10
val maxIdle = 10
val minIdle = 1
val redisHost = "192.168.240.186"
val redisPort = 6379
val redisTimeout = 30000
val poolConfig = new GenericObjectPoolConfig()
poolConfig.setMaxTotal(maxTotal)
poolConfig.setMaxIdle(maxIdle)
poolConfig.setMinIdle(minIdle)
poolConfig.setTestOnBorrow(true)
poolConfig.setTestOnReturn(false)
poolConfig.setMaxWaitMillis(100000)
val hook = new Thread{
override def run = pool.destroy()
}
sys.addShutdownHook(hook.run)
new JedisPool(poolConfig, redisHost, redisPort, redisTimeout)
}
def getPool: JedisPool = {
assert(pool != null)
pool
}
def main(args: Array[String]): Unit = {
val p = getPool
val j = p.getResource
j.set("much", "")
}
}
集群启动的方式
/home/hadoop-jrq/bigdata/spark-2.2 .0-bin-hadoop2 .7/bin/spark-submit
--class com.jrq.streaming.kafka.PageViewStream \
-- master spark :// master: 7077 \
-- deploy - mode client \
-- driver - memory 512 m \
-- executor - memory 512 m \
-- total - executor - cores 5 \
-- executor - cores 2 \
/home/hadoop-jrq/spark-course/steaming/spark-streaming-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar \
master: 9092, slave1: 9092, slave2: 9092 pageview
模拟发送日志的方式:
echo http:// baidu.com 200 899787 22224 >> weblog.log