在详细地学习如何写自己的Spark Streaming程序之前,我们先来快速地看一个简单的Spark Streaming程序的例子。我们现在要计算从一个TCP数据服务器接收到的文本数据中单词的个数。我需要向下面这样去做:
首先,导入Spark Streaming的类;再导入一些StreamingContext的隐式转换,来增加来自其它类(比如DStream)的有用方法。StreamingContext是所有streaming功能的主入口点。我们以两个执行线程和1s的执行间隔来创建一个本地的StreamingContext。
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary in Spark 1.3+
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
通过这个Context,我们可以创建一个表示来自TCP流数据的DStream,需要指定主机名(如localhost)和端口(如9999)
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
DStream lines代表了来自数据服务器的数据流。这个DStream的每行记录是一个文本行。
下一步,我们将每一行通过空格分隔成单词。
// Split each line into words
val words = lines.flatMap(_.split(" "))
flatMap是一个“一对多”的DStream操作,它通过将原DStream中的每条记录映射为多条记录来产生一个新的DStream。在这里,每行被拆分成多个单词,words Dstream代表了新的单词流。
下一步,我们计算这些单词数量:
import org.apache.spark.streaming.StreamingContext._ // not necessary in Spark 1.3+
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
注意:执行这些代码时,Spark Streaming只是安装了一些计算操作,当它开始时才真正被执行,所以真正的处理还没有开始。当所有的转换都安装完成后,我们可以开始:
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
完整例子的代码可以从这里下载:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
我们现在通过maven工程的方式来实现上面的例子:
首先按照,我以前介绍的maven创建spark工程的方式创建一个相同的工程:http://blog.csdn.net/happyanger6/article/details/46493763
然后,修改pom.xml,将spark-streaming的依懒添加进去:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.0</version>
</dependency>
最终的pom.xml如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.spark.stream</groupId>
<artifactId>spark-stream</artifactId>
<version>1.0-SNAPSHOT</version>
<name>${project.artifactId}</name>
<description>My wonderfull scala app</description>
<inceptionYear>2010</inceptionYear>
<licenses>
<license>
<name>My License</name>
<url>http://....</url>
<distribution>repo</distribution>
</license>
</licenses>
<properties>
<maven.compiler.source>1.5</maven.compiler.source>
<maven.compiler.target>1.5</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.6</scala.version>
</properties>
<!--
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
-->
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-tools.testing</groupId>
<artifactId>specs_2.9.3</artifactId>
<version>1.6.9</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest</artifactId>
<version>1.2</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.10</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<!-- If you have classpath issue like NoDefClassError,... -->
<!-- useManifestOnlyJar>false</useManifestOnlyJar -->
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</project>
最终的目录结构如下:
.
./src
./src/main
./src/main/scala
./src/main/scala/com
./src/main/scala/com/spark
./src/main/scala/com/spark/stream
./src/main/scala/com/spark/stream/App.scala.bak
./src/main/scala/com/spark/stream/App.scala
./.gitignore
./pom.xml.bak
./pom.xml
./target
./target/classes
./target/classes/com
./target/classes/com/spark
./target/classes/com/spark/stream
./target/classes/com/spark/stream/App.class
./target/classes.timestamp
./target/test-classes
./target/surefire
./target/maven-archiver
./target/maven-archiver/pom.properties
./target/spark-stream-1.0-SNAPSHOT.jar
其中App.class的代码如下:
package com.spark.stream
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
/**
* @author ${user.name}
*/
object App {
def main(args : Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf,Seconds(1))
val lines = ssc.socketTextStream("localhost",9999)
val words = lines.flatMap(_.split(""))
val pairs = words.map(word => (word,1))
val wordsCounts = pairs.reduceByKey(_+_)
wordsCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
执行mvn clean package编译打包:
[root@localhost spark-stream]# mvn clean package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building spark-stream 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ spark-stream ---
[INFO] Deleting /usr/local/maven-study/spark-stream/target
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ spark-stream ---
[debug] execute contextualize
........
...................
Results :
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ spark-stream ---
[INFO] Building jar: /usr/local/maven-study/spark-stream/target/spark-stream-1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22.149s
[INFO] Finished at: Sun Jun 28 11:22:43 CST 2015
[INFO] Final Memory: 14M/34M
[INFO] ------------------------------------------------------------------------
[root@localhost spark-stream]#
然后我们先打开另一个终端,用nc来创建一个tcp数据服务器,并输入数据
[root@localhost ~]# nc -lk 9999
再在另一个终端提交运行我们刚刚编译的jar包:
[root@localhost spark-stream]# spark-submit --class "com.spark.stream.App" tart/spark-stream-1.0-SNAPSHOT.jar
15/06/28 11:28:50 INFO DAGScheduler: Job 2 finished: print at App.scala:21, took 0.028284 s
-------------------------------------------
Time: 1435462130000 ms
-------------------------------------------
(hello,1)
(world,1)
(the,1)