1.首先演示案例 linux学过的知识点监控文件tail -F 文件名 另一个窗口中往文件中添加数据
tail -F qqq.txt
echo "abcdfs" >> qqq.txt
模拟WEB服务器产生日志的过程:
流的机制是先写到缓存中,一定大小之后再写到磁盘上,
所以flume采集并不会看到一条一条的效果,
让流写一条刷新一次,模拟web服务器产生日志效果
1) SocketTest.java 创建socket类用来读取文件写入到另一个文件中
import java.io.*;
public class SocketTest {
public static void main(String[] args) throws IOException, InterruptedException {
File ctoFile = new File(args[0]);
File dest=new File(args[1]);
InputStreamReader rdCto = new InputStreamReader(new FileInputStream(ctoFile));
OutputStreamWriter writer=new OutputStreamWriter(new FileOutputStream(dest));
BufferedReader bfReader = new BufferedReader(rdCto);
BufferedWriter bwriter=new BufferedWriter(writer);
PrintWriter pw=new PrintWriter(bwriter);
String txtline = null;
while ((txtline = bfReader.readLine()) != null) {
Thread.sleep(2000);
pw.println(txtline);
pw.flush();
}
bfReader.close();
pw.close();
}
}
2)在linux上创建文件夹和文件 SocketTest/data.log
mkdir SocketTest
touch data.log
3)编译成.class上传到linux系统,需要两个参数 第一个参数 数据源 第二参数 目标文件
java SocketTest access.20120104.log SocketTest/data.log
4)监测 data.log文件
$> tail -F data.log
【flume + kafka + SparkStreaming】
第一步:【采用flume监控日志文件 (替换掉tail -F data.log)】
flume-exec-logger.conf
agent.sources = r1
agent.sinks = k1
agent.channels = c1
#指定类型为exec
agent.sources.r1.type = exec
agent.sources.r1.command = tail -F /home/hyxy/Desktop/SocketTest/data.log
agent.sources.r1.channels = c1
#设置成控制台输出用于测试有没有监听到输入文件的信息
agent.sinks.k1.type = logger
agent.sinks.k1.channel = c1
agent.channels.c1.type = memory
agent.channels.c1.capacity = 100
agent.channels.c1.transactionCapacity = 100
启动flume
$>flume-ng agent -n agent -c /home/hyxy/apps/flume/conf/ -f /home/hyxy/apps/flume/conf/flume-exec-logger.conf -Dflume.root.logger=INFO,console
第二步:【flume + kafka】
flumekafa.conf
agent.sources = r1
agent.sinks = k1
agent.channels = c1
#指定类型为exec
agent.sources.r1.type = exec
agent.sources.r1.command = tail -F /home/hyxy/Desktop/SocketTest/data.log
agent.sources.r1.channels = c1
#kafka对应的版本1.6 http://flume.apache.org/releases/content/1.6.0/FlumeUserGuide.html#kafka-sink
agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.topic = test222
agent.sinks.k1.brokerList = localhost:9092
agent.sinks.k1.batchSize = 20
agent.sinks.k1.requiredAcks = 1
agent.sinks.k1.channel = c1
agent.channels.c1.type = memory
agent.channels.c1.capacity = 100
agent.channels.c1.transactionCapacity = 100
启动kafka测试数据是否传到kafka并用消费者消费
启动zookeeper:
zookeeper-server-start.sh ~/apps/kafka/config/zookeeper.properties
启动kafka服务:
kafka-server-start.sh ~/apps/kafka/config/server.properties
创建主题:
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test222
启动SocketTest:
java SocketTest access.20120104.log ~/Desktop/SocketTest/data.log
启动flume:
flume-ng agent -n agent -c /home/hyxy/apps/flume/conf/ -f /home/hyxy/apps/flume/conf/flumekafa.conf -Dflume.root.logger=INFO,console
启动kafka消费者测试:
kafka-console-consumer.sh --zookeeper localhost:2181 --topic test222 --from-beginning
【flume + kafka + SparkStreaming】
用sparkStreaming替代 kafka消费者
$> kafka-console-consumer.sh --zookeeper localhost:2181 --topic teststreaming --from-beginning
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object FlumeKafka {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("fromkafka").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "master:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "teststreaming",
"auto.offset.reset" -> "latest", //lartest : 自动把offset设为最大的offset earliest 从头开始消费
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("teststreaming")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
// stream.map(x => x.key())print() 取出来的key是空的就证明取出来数据了
stream.map(x => x.value())print()
ssc.start()
ssc.awaitTermination()
}
}
org.apache.spark
spark-streaming-kafka-0-10_2.11
2.1.2
1)模拟WEB服务器产生日志的过程
启动zookeeper:
zookeeper-server-start.sh ~/apps/kafka/config/zookeeper.properties
启动kafka服务:
kafka-server-start.sh ~/apps/kafka/config/server.properties
创建主题:
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test222
启动SocketTest:
java SocketTest access.20120104.log ~/Desktop/SocketTest/data.log
启动flume:
flume-ng agent -n a1 -c /home/hyxy/apps/flume/conf/ -f /home/hyxy/apps/flume/conf/flumekafa.conf -Dflume.root.logger=INFO,console
2)运行SparkStreamin程序
-------------------------------------------
Time: 1565482650000 ms
-------------------------------------------
218.25.117.226 - - [04/Jan/2012:00:00:15 +0800] "GET /thread-1562140-1-1.html HTTP/1.1" 200 13427 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.8 (KHTML, like Gecko) Chrome/17.0.932.0 Safari/535.8"
123.125.71.84 - - [04/Jan/2012:00:00:16 +0800] "GET /thread-718324-1-1.html HTTP/1.1" 200 17133 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
218.25.117.226 - - [04/Jan/2012:00:00:16 +0800] "GET /thread-1561945-1-2.html HTTP/1.1" 200 14095 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.8 (KHTML, like Gecko) Chrome/17.0.932.0 Safari/535.8"
211.100.47.147 - - [04/Jan/2012:00:00:16 +0800] "GET / HTTP/1.0" 200 52803 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
116.114.60.72 - - [04/Jan/2012:00:00:16 +0800] "GET /thread-1562049-1-1.html HTTP/1.1" 200 21520 "http://www.itpub.net/forum-146-1.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
218.25.117.226 - - [04/Jan/2012:00:00:16 +0800] "GET /thread-1562298-1-1.html HTTP/1.1" 200 13207 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.8 (KHTML, like Gecko) Chrome/17.0.932.0 Safari/535.8"
216.251.112.134 - - [04/Jan/2012:00:00:16 +0800] "GET /thread-950066-1-3.html HTTP/1.1" 200 20044 "-" "Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7"
123.126.50.76 - - [04/Jan/2012:00:00:16 +0800] "GET /forum.php?mod=viewthread&tid=635079 HTTP/1.1" 200 14405 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
110.75.173.35 - - [04/Jan/2012:00:00:16 +0800] "GET /forum.php?mod=viewthread&tid=1373488&page=2 HTTP/1.1" 200 11808 "-" "Yahoo! Slurp China"
123.138.17.18 - - [04/Jan/2012:00:00:16 +0800] "GET / HTTP/1.1" 200 13203 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
SparkSession引包:
org.apache.spark
spark-sql_2.11
2.1.2
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ncSparkStreaming {
def main(args: Array[String]): Unit = {
//创建配置文件
val conf = new SparkConf().setAppName("nwc").setMaster("local[2]")
//创建流的上下文 通过SparkStreaming获得Sparkcontext
val ssc = new StreamingContext(conf,Seconds(5))
//设置日志级别
ssc.sparkContext.setLogLevel("ERROR")
//通过socket获取数据流
val lines = ssc.socketTextStream("master",9999)
//foreachRDD中创建 一个spark (通过sparkSession创建一个spark)
//lines.foreachRDD{rdd => 一行是一个rdd
lines.flatMap(x => x.split(" ")).foreachRDD{rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val df = rdd.toDF("word") //通过每一个rdd创建 一个df
df.createOrReplaceTempView("words") //通过df创建一个临时视图
val sqldf = spark.sql("select word, count(*) as total from words group by word") //通过SQl操作
sqldf.show() //不是print
}
//启动_________________________________
ssc.start()
//等待计算终止
ssc.awaitTermination()
//停止_________________________________
ssc.stop()
}
}
通过nc 传入数据 streaming
$> nc -l 1234
hello word hello 123
运行结果:
+-----+-----+
| word|total|
+-----+-----+
|hello | 2 |
|word | 1 |
|123 | 1 |
+-----+-----+
persist():
注意:基于窗口的操作生成的DStream会自动保留在内存中,而无需开发人员调用persist()。
元数据检查点:
配置 - 用于创建流应用程序的配置。
DStream操作 - 定义流应用程序的DStream操作集。
不完整的批次 - 其工作排队但尚未完成的批次。
数据检查点:
将生成的RDD保存到可靠的存储,故障时从检查点恢复数据.
检查点机制是我们在 Spark Streaming 中用来保障容错性的主要机制有迅速恢复的能力
从检查点恢复,这样 Spark Streaming 就可以读取之前运行的程序处理数据的进度,并从那里继续
何时启用检查点?
有状态转换的用法 - 如果在应用程序中使用了(updateStateByKey或reduceByKeyAndWindow使用反函数),
则必须提供检查点目录以允许定期RDD检查点。
从运行应用程序的驱动程序的故障中恢复 - 元数据检查点用于使用进度信息进行恢复。
def main(args: Array[String]): Unit = {
//创建检查点 查看ha是哪个节点启动
//使用工厂方法来创建ssc对象 SparkStreamingContext
//并非new创建上下文, 使用工厂方法获得上下文对象,可以获得检查点中的数据,而不是新的上下文对象
//获得上下文对象时,首先去检查点目录去找数据,如果找到就获得检查点目录的上下文对象,如果没找到执行后面函数返回ssc
val ssc = StreamingContext.getOrCreate("hdfs://master:9000/check", functionToCreateContext _)
//设置日志级别
ssc.sparkContext.setLogLevel("ERROR")
//启动
ssc.start()
//等待计算终止
ssc.awaitTermination()
}
//首次创建上下文对象
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().setAppName("nwc").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(4))
val lines = ssc.socketTextStream("master",9999)
//根据空格拆分单词
val words = lines.flatMap(_.split(" "))
//组成二元组
val pairs = words.map(word => (word,1))
//统计出现次数
val wordconts = pairs.reduceByKey(_+_)
//打印结果
wordconts.print(20)
ssc.checkpoint("hdfs://master:9000/check") // hadoop fs -mkdir /check
ssc //不需要return 返回最后一行 scala特性
}
1.启动hdfs服务
zkServer.sh start (master slave1 slave2)
start-dfs.sh
2.创建hdfs检查点路径
$> hadoop fs -mkdir /check
$> hadoop fs -chmod 777 /check
3.启动nc
$>nc -l 1234
hello word 123
heoo word
aaa
4.运行streaming程序
-------------------------------------------
Time: 1561005192000 ms
-------------------------------------------
(word,1)
(hello,1)
(123,1)
-------------------------------------------
Time: 1561005196000 ms
-------------------------------------------
(word,1)
(heoo,1)
-------------------------------------------
Time: 1561005200000 ms
-------------------------------------------
-------------------------------------------
Time: 1561005204000 ms
-------------------------------------------
(aaa,1)
5.查看hdfs生成检查点,检查点内容是二进制压缩文件
$> hadoop fs -lsr /check
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x - jÏëСÐÂ700µ羺 supergroup 0 2018-06-20 12:32 /check/83171791-f679-479a-8c23-005192000
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3299 2018-06-20 12:33 /check/checkpoint-1561005192000.bkf3936a5a885f
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3295 2018-06-20 12:33 /check/checkpoint-1561
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3298 2018-06-20 12:33 /check/checkpoint-1561005196000
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3302 2018-06-20 12:33 /check/checkpoint-1561005196000.bk
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3296 2018-06-20 12:33 /check/checkpoint-1561005200000
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3300 2018-06-20 12:33 /check/checkpoint-1561005200000.bk
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3296 2018-06-20 12:33 /check/checkpoint-1561005204000
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3300 2018-06-20 12:33 /check/checkpoint-1561005204000.bk
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3296 2018-06-20 12:33 /check/checkpoint-1561005208000
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 3300 2018-06-20 12:33 /check/checkpoint-1561005208000.bk
drwxrwxrwx - jÏëСÐÂ700µ羺 supergroup 0 2018-06-20 12:32 /check/receivedBlockMetadata
-rw-r--r-- 3 jÏëСÐÂ700µ羺 supergroup 0 2018-06-20 12:32 /check/receivedBlockMetadata/log-1561005172005-1561005232005