flume 监听linux下的文件夹下所有文件,通过spark批量读取数据

flume 监听linux下的文件夹下所有文件,并将文件内容存入到hdfs,生成多个以时间戳结尾的文件,通过spark批量读取数据。

  1. 配置 flume-spooldir.conf

     ### define agent
     a3.sources = r3
     a3.channels = c3
     a3.sinks = k3
     
     ### define sources
     a3.sources.r3.type = spooldir
     ### 要扫描的文件夹
     a3.sources.r3.spoolDir = /usr/local/src/apache-flume-1.6.0-bin/data
     ### 以.log结尾的文件不扫描
     a3.sources.r3.ignorePattern = ^(.)*\\.log$
     ### 扫描完成的文件加一个后缀
     a3.sources.r3.fileSuffix = .delete
     
     ### define channels
     a3.channels.c3.type = file
     a3.channels.c3.checkpointDir = /usr/local/src/apache-flume-1.6.0-bin/data/filechannel/checkpoint
     a3.channels.c3.dataDirs = /usr/local/src/apache-flume-1.6.0-bin/data/filechannel/data
     
     ### define sink
     a3.sinks.k3.type = hdfs
     ### 已当天日期在hdfs上创建一个文件夹
     a3.sinks.k3.hdfs.path = hdfs://master:9000/user/root/%Y%m%d
     a3.sinks.k3.hdfs.writeFormat = Text
     a3.sinks.k3.hdfs.batchSize = 100
     a3.sinks.k3.hdfs.useLocalTimeStamp = true
     a3.sinks.k3.hdfs.round = true
     a3.sinks.k3.hdfs.roundValue = 10
     a3.sinks.k3.hdfs.roundUnit = minute
     a3.sinks.k3.hdfs.fileType = DataStream
     
     ### bind the soures and  sink to the channel
     a3.sources.r3.channels = c3
     a3.sinks.k3.channel = c3
    
  2. 启动flume
    编辑start-spooldir.sh

     ./flume-ng agent \
     -c conf \
     -n a3 \
     -f /usr/local/src/apache-flume-1.6.0-bin/conf/flume-spooldir.conf \
    
  3. 查看hdfs下是否生成了文件

     master: >>  hadoop fs -ls /user/root
     19/01/01 19:45:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
     Found 1 items
     drwxr-xr-x   - root supergroup          0 2019-01-01 19:44 /user/root/20190309
    
  4. 开启spark-shell

     scala> import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
     import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
     
     scala> val sparksession = SparkSession.builder.getOrCreate()
     sparksession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@715d8b95
     //批量读取hdfs数据,需要配置basePath,作为访问hdfs的根目录
     scala> val sp = sparksession.read.option("basePath","hdfs://192.168.152.128:9000/").textFile("20190101/FlumeData.*")
     sp: org.apache.spark.sql.Dataset[String] = [value: string]
    

// 查看50条数据,全部显示每条数据

    scala> sp.show(50,false)

你可能感兴趣的:(spark学习笔记,flume学习笔记)