spark file streams

For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that

  1. The files must have the same data format.
  2. The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
  3. Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/** * Created by MingDong on 2016/12/6. */
object FileWordCount {
  System.setProperty("hadoop.home.dir", "D:\\hadoop");
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("FileWordCount").setMaster("local[2]")

    // 创建Streaming的上下文,包括Spark的配置和时间间隔,这里时间为间隔20秒
    val ssc = new StreamingContext(sparkConf, Seconds(20))

    // 指定监控的目录,
    val lines = ssc.textFileStream("file:///D://data//")

    // 对指定文件夹变化的数据进行单词统计并且打印
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()

    // 启动Streaming
    ssc.start()
    ssc.awaitTermination()
  }
}

你可能感兴趣的:(spark,hdfs)