Spark大数据-输入源之文件流

文件流

  • 日志的实时捕捉:对目录进行监控,只要目录生成新的文件或者文件变动就捕捉。
  • 1.创建被监控的文件目录:
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir logfile
cd logfile
  • 2.spark-scala文件监控程序-实现词频统计
import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming{
    def main(args:Array[String]){
        val conf=new SparkConf().setMaster("local[2]").setAppName("file_stream")
        val ssc=new StreamingContext(conf,Seconds(2))
        //定义数据流
        val lines=ssc.textFileStream("file:///home/chenbengang/ziyu_bigdata/quick_learn_spark/logfile")
        //流计算过程
        val words=lines.flatMap(_.split(" "))
        val wordCounts=words.map(x => (x,1)).reduceByKey(_+_)
        wordCounts.print()
        //启动流计算
        ssc.start()
        ssc.awaitTermination()//遇到错误会停止
    }
}
//在streaming目录下执行/usr/sbt/sbt/bin/sbt package 编译打包
//或者在REPL中执行WordCountStreaming.main(Array())即可执行

采用独立应用程序创建文件流

1.创建程序的文件目录:

cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir -p src/main/scala //创建三级子目录的规范结构
cd src/main/scala
vim TestStreaming.scala

2.在TestStreaming.scala中编写

import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming{
    def main(args:Array[String]){
        val conf=new SparkConf().setMaster("local[2]").setAppName("file_stream")
        val ssc=new StreamingContext(conf,Seconds(2))
        val lines=ssc.textFileStream("file:///home/chenbengang/ziyu_bigdata/quick_learn_spark/logfile")
        val words=lines.flatMap(_.split(" "))
        val wordCounts=words.map(x => (x,1)).reduceByKey(_+_)
        wordCounts.print()
        ssc.start()
        ssc.awaitTermination()
    }
}

3.编译打包:

cd /usr/local/spark/mycode/streaming
vim simple.sbt

4.编写simple.sbt

name:="Simple Project"
version:="1.0"
scalaVersion:="2.12.6"
libraryDependencies+="org.apche.spark"%"spark-streaming_2.12"%"2.2.1"

5.执行sbt编译打包命令:

cd /usr/local/spark/mycode/streaming
/usr/local/sbt/sbt packge

6.启动程序:

cd /usr/local/spark/mycode/streaming
cd /usr/local/spark/bin/spark-submit --class "WordCountStreaming" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar

7.在logfile目录下随便创建几个txt写几个单词即可。

你可能感兴趣的:(Spark大数据)