Spark流编程指引(二)----------------------一个快速的例子

在详细地学习如何写自己的Spark Streaming程序之前,我们先来快速地看一个简单的Spark Streaming程序的例子。我们现在要计算从一个TCP数据服务器接收到的文本数据中单词的个数。我需要向下面这样去做:


首先,导入Spark Streaming的类;再导入一些StreamingContext的隐式转换,来增加来自其它类(比如DStream)的有用方法。StreamingContext是所有streaming功能的主入口点。我们以两个执行线程和1s的执行间隔来创建一个本地的StreamingContext。

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary in Spark 1.3+

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))


通过这个Context,我们可以创建一个表示来自TCP流数据的DStream,需要指定主机名(如localhost)和端口(如9999)

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

DStream lines代表了来自数据服务器的数据流。这个DStream的每行记录是一个文本行。


下一步,我们将每一行通过空格分隔成单词。

// Split each line into words
val words = lines.flatMap(_.split(" "))

flatMap是一个“一对多”的DStream操作,它通过将原DStream中的每条记录映射为多条记录来产生一个新的DStream。在这里,每行被拆分成多个单词,words Dstream代表了新的单词流。


下一步,我们计算这些单词数量:

import org.apache.spark.streaming.StreamingContext._ // not necessary in Spark 1.3+
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
words DStream被进一步映射成一个(word,1)键值对的DStream pairs。然后我们通过reduceByKey获得每批数据中每个单词的出现频率。最后wordCounts打印每秒中的计数。

注意:执行这些代码时,Spark Streaming只是安装了一些计算操作,当它开始时才真正被执行,所以真正的处理还没有开始。当所有的转换都安装完成后,我们可以开始:

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate


完整例子的代码可以从这里下载:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala



我们现在通过maven工程的方式来实现上面的例子:

首先按照,我以前介绍的maven创建spark工程的方式创建一个相同的工程:http://blog.csdn.net/happyanger6/article/details/46493763


然后,修改pom.xml,将spark-streaming的依懒添加进去:

         <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
        <version>1.3.0</version>
    </dependency>


最终的pom.xml如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.spark.stream</groupId>
  <artifactId>spark-stream</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2010</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>

  <properties>
    <maven.compiler.source>1.5</maven.compiler.source>
    <maven.compiler.target>1.5</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.11.6</scala.version>
  </properties>

<!--
  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>
-->
  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.11.0</version>
    </dependency>
                <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.3.1</version>
    </dependency>
                <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
        <version>1.3.0</version>
    </dependency>
    <!-- Test -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scala-tools.testing</groupId>
      <artifactId>specs_2.9.3</artifactId>
      <version>1.6.9</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest</artifactId>
      <version>1.2</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.15.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <!--<arg>-make:transitive</arg>-->
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.10</version>
        <configuration>
          <useFile>false</useFile>
          <disableXmlReport>true</disableXmlReport>
          <!-- If you have classpath issue like NoDefClassError,... -->
          <!-- useManifestOnlyJar>false</useManifestOnlyJar -->
          <includes>
            <include>**/*Test.*</include>
            <include>**/*Suite.*</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>


最终的目录结构如下:

.
./src
./src/main
./src/main/scala
./src/main/scala/com
./src/main/scala/com/spark
./src/main/scala/com/spark/stream
./src/main/scala/com/spark/stream/App.scala.bak
./src/main/scala/com/spark/stream/App.scala
./.gitignore
./pom.xml.bak
./pom.xml
./target
./target/classes
./target/classes/com
./target/classes/com/spark
./target/classes/com/spark/stream
./target/classes/com/spark/stream/App.class
./target/classes.timestamp
./target/test-classes
./target/surefire
./target/maven-archiver
./target/maven-archiver/pom.properties
./target/spark-stream-1.0-SNAPSHOT.jar



其中App.class的代码如下:

package com.spark.stream

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

/**
 * @author ${user.name}
 */
object App {

  def main(args : Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(conf,Seconds(1))

    val lines = ssc.socketTextStream("localhost",9999)
    val words = lines.flatMap(_.split(""))
    val pairs = words.map(word => (word,1))
    val wordsCounts = pairs.reduceByKey(_+_)

    wordsCounts.print()

    ssc.start()
    ssc.awaitTermination()
    }

}


执行mvn clean package编译打包:

[root@localhost spark-stream]# mvn clean package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building spark-stream 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ spark-stream ---
[INFO] Deleting /usr/local/maven-study/spark-stream/target
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ spark-stream ---
[debug] execute contextualize
........
...................
Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ spark-stream ---
[INFO] Building jar: /usr/local/maven-study/spark-stream/target/spark-stream-1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22.149s
[INFO] Finished at: Sun Jun 28 11:22:43 CST 2015
[INFO] Final Memory: 14M/34M
[INFO] ------------------------------------------------------------------------
[root@localhost spark-stream]#



然后我们先打开另一个终端,用nc来创建一个tcp数据服务器,并输入数据

[root@localhost ~]# nc -lk 9999


再在另一个终端提交运行我们刚刚编译的jar包:

[root@localhost spark-stream]# spark-submit --class "com.spark.stream.App" tart/spark-stream-1.0-SNAPSHOT.jar

15/06/28 11:28:50 INFO DAGScheduler: Job 2 finished: print at App.scala:21, took 0.028284 s
-------------------------------------------
Time: 1435462130000 ms
-------------------------------------------
(hello,1)
(world,1)
(the,1)



你可能感兴趣的:(scala,hadoop,spark,spark,大数据,分布式,batch)