更多代码请见:https://github.com/xubo245/SparkLearning
1.理解:HdfsWordCount 是从hdfs的文件读入流文件,即制定文件目录,每个一段时间扫描该路径下的文件,不扫描子目录下的文件。
如果有新增加的文件,则进行流计算
val ssc = new StreamingContext(sparkConf, Seconds(2))
处理跟前面差不多
2.运行:
输入:
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 2.txt /xubo/spark/data/Streaming/hdfsWordCount/ hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 3.txt /xubo/spark/data/Streaming/hdfsWordCount/
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ cat 3.txt hello world hello world hello world hello world hello world hello world hello world a a a a a a a b b b
输出:
16/04/26 21:26:06 INFO scheduler.DAGScheduler: Job 19 finished: print at HdfsWordCount.scala:52, took 0.023056 s ------------------------------------------- Time: 1461677166000 ms ------------------------------------------- (hello,1) (world,1)
------------------------------------------- Time: 1461677550000 ms ------------------------------------------- (b,3) (hello,7) (world,7) (a,7)
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ // scalastyle:off println package org.apache.spark.Streaming.learning import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions /** * Counts words in new text files created in the given directory * Usage: HdfsWordCount <directory> * <directory> is the directory that Spark Streaming will use to find and read new text files. * * To run this on your local machine on directory `localdir`, run this example * $ bin/run-example \ * org.apache.spark.examples.streaming.HdfsWordCount localdir * * Then create a text file in `localdir` and the words in the file will get counted. */ object HdfsWordCount { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: HdfsWordCount <directory>") System.exit(1) } StreamingExamples.setStreamingLogLevels() val sparkConf = new SparkConf().setAppName("HdfsWordCount") // Create the context val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create the FileInputDStream on the directory and use the // stream to count words in new files created val lines = ssc.textFileStream(args(0)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } // scalastyle:on println