初见spark streaming之JavaNetWorkWordCount

背景

  接触了一点spark,看到网上关于实时处理系统的博文也是铺天盖地,觉得还是有比要了解一下,作为时下比较热门的大数据,还是要多看多听多用。


对象

 spark streaming JavaNetworkWordCount.java运行例

 代码链接:https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java

源码如下所示:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.streaming;

import java.util.Arrays;
import java.util.Iterator;
import java.util.regex.Pattern;

import scala.Tuple2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

/**
* Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
*
* Usage: JavaNetworkWordCount  
*  and  describe the TCP server that Spark Streaming would connect to receive data.
*
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost 9999`
*/
public final class JavaNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");

public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount  ");
System.exit(1);
}

StreamingExamples.setStreamingLogLevels();

// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));

// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
JavaReceiverInputDStream lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterator call(String x) {
return Arrays.asList(SPACE.split(x)).iterator();
}
});
JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2<>(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});

wordCounts.print();
ssc.start();
ssc.awaitTermination();
}
}

关于代码的理解,有些难解的地方,作为笔记,记录下来。

1. 设置拉去数据的频率,此处是每个1s,控制台上显示的是每隔1000ms 。

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));

2.此处的reduceByKey方法,每次只输出当次的操作记录,不保留上次的记录信息。对应的就是只针对本次的key,values。不保留前次的操作记录。相对应的方法就是updateStateByKey了。具体的可以参考下面的链接。http://spark.apache.org/docs/latest/api/java/index.html

JavaPairDStream wordCounts = words.mapToPair(
      new PairFunction() {
        @Override
        public Tuple2 call(String s) {
          return new Tuple2<>(s, 1);
        }
      }).reduceByKey(new Function2() {
        @Override
        public Integer call(Integer i1, Integer i2) {
          return i1 + i2;
        }
});

javadoc中关于reduceByKey的描述如下:

public JavaPairDStream reduceByKey(Function2 func)

Return a new DStream by applying reduceByKey to each RDD. The values for each key are merged using the associative and commutative reduce function.
Hash partitioning is used to generate the RDDs with Spark's default number of partitions.

Parameters:
    func - (undocumented)
Returns:
    (undocumented)

3.关于执行,使用spark自带的run-example工具,并使用如下的命令行去执行。
./bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost 9999

控制台上输出了好多INFO信息,也打印了好多形如下面的时间戳。
-------------------------------------------
Time: 1471950452000 ms
-------------------------------------------

但就是没有输出预期的结果:
-------------------------------------------
Time: 1471950568000 ms
-------------------------------------------
(hello,1)
(world!,1)

仔细查看了输出的log信息,去看了$SPARK_HOME/conf/spark-env.sh。里面有如下的记录信息。
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with 
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

根据这些,可以断定,没有加载HADOOP_CONF_DIR。停止SPARK服务,修改spark-env.sh,source后,重启SPARK 服务。重新执行run-example命令,但还是没有输出结果,看了代码里的注释信息,应该是没有给输入源数据,程序应该是在正常运行,但没有数据数据,所以也就不存在数据的加工和数据的输出了。


执行【nc -lk 9999】,报错,不存在的nc命令,安装nc【netcat】命令花了点时间,根据网上的提示,使用yum来安装,但总是报错,于是乎去找了一个redhat6的镜像文件,从里面的package文件夹中抽出了nc-1.84-22.el6.x86_64.rpm,安装后,就可以正常运行了。

手顺:

1.打开终端1,运行spark streaming JavaNetworkWordCount程序。

 

$SPARK_HOME/bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost 9999

此时控制台上输出如下【因为没有数据输入】
Spark assembly has been built with Hive, including Datanucleus jars on classpath
-------------------------------------------
Time: 1471951216000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951217000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951218000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951219000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951220000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951221000 ms
-------------------------------------------
#为了控制台显示干净,只输入自己想要的信息,遂将控制台上输出的log级别调整为ERROR.

2.打开终端2,执行命令

nc -lk 9999

此时开启了监挺localhost的9999端口号,等待数据流的输入。可以找一个文本文件,将内容copy进去。此处选择了hadoop的启动log。

[root@sv004 home]# nc -lk 9999
2016-08-22 20:43:58,994 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = sv004/172.28.156.85
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.2
STARTUP_MSG:   classpath = /home/project/hadoop-2.5.2/etc/hadoop:/home/project/hadoop-2.5.2/share/hadoop/common/lib/hamcrest-core-1.3.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-digester-1.8.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-compress-1.4.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/httpclient-4.2.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jersey-server-1.9.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/httpcore-4.2.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/slf4j-api-1.7.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jsp-api-2.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/hadoop-auth-2.5.2.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/netty-3.6.2.Final.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-collections-3.2.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jettison-1.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/junit-4.11.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/mockito-all-1.8.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/log4j-1.2.17.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/activation-1.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jersey-json-1.9.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-el-1.0.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-configuration-1.6.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/xz-1.0.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jetty-util-6.1.26.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/zookeeper-3.4.6.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/xmlenc-0.52.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-lang-2.6.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/asm-3.2.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-codec-1.4.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jsch-0.1.42.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jetty-6.1.26.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-net-3.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/commons-httpclient-3.1.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/hadoop-annotations-2.5.2.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/jersey-core-1.9.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/servlet-api-2.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/avro-1.7.4.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:/home/project/hadoop-2.5.2/share/hadoop/common/lib/guava-11.0
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r cc72e9b000545b86b75a61f4835eb86d57bfafc0; compiled by 'jenkins' on 2014-11-14T23:45Z
STARTUP_MSG:   java = 1.7.0_67
************************************************************/
2016-08-22 20:43:59,011 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2016-08-22 20:43:59,022 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []
2016-08-22 20:43:59,399 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2016-08-22 20:43:59,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-08-22 20:43:59,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2016-08-22 20:43:59,495 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://sv004:9000
2016-08-22 20:43:59,495 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use sv004:9000 to access this namenode/service.
2016-08-22 20:43:59,606 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-08-22 20:43:59,856 INFO org.apache.hadoop.hdfs.DFSUtil: Starting web server as: ${dfs.web.authentication.kerberos.principal}
2016-08-22 20:43:59,856 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://0.0.0.0:50070
2016-08-22 20:43:59,932 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2016-08-22 20:43:59,936 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined
2016-08-22 20:43:59,947 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2016-08-22 20:43:59,950 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs
2016-08-22 20:43:59,950 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
2016-08-22 20:43:59,950 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
2016-08-22 20:43:59,975 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)
2016-08-22 20:43:59,977 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*
2016-08-22 20:43:59,992 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: 0.0.0.0:50070
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:891)
        at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:827)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:693)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:583)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:751)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:735)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1473)
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:444)
        at sun.nio.ch.Net.bind(Net.java:436)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:886)
        ... 8 more
2016-08-22 20:43:59,995 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-08-22 20:43:59,996 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-08-22 20:43:59,996 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-08-22 20:43:59,996 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.net.BindException: Port in use: 0.0.0.0:50070
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:891)
        at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:827)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:693)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:583)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:751)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:735)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1473)
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:444)
        at sun.nio.ch.Net.bind(Net.java:436)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
        at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:886)
        ... 8 more
2016-08-22 20:43:59,997 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-08-22 20:43:59,998 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sv004/172.28.156.85
************************************************************/



3.终端1处加工输出后的统计结果如下所示:

-------------------------------------------
Time: 1471951472000 ms
-------------------------------------------
(http://0.0.0.0:50070,1)
(https://git-wip-us.apache.org/repos/asf/hadoop.git,1)
(this,1)
(is,2)
(snapshot,1)
(org.apache.hadoop.metrics2.impl.MetricsSystemImpl:,2)
(org.apache.hadoop.http.HttpRequestLog:,1)
([],2)
(Http,1)
(classes,1)
...

-------------------------------------------
Time: 1471951473000 ms
-------------------------------------------
(20:43:59,995,1)
(...,2)
(8,2)
(HttpServer.start(),1)
(org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142),2)
(org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407),2)
(org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:693),2)
(org.apache.hadoop.metrics2.impl.MetricsSystemImpl:,3)
(org.apache.hadoop.http.HttpServer2:,3)
(SHUTDOWN_MSG:,2)
...

-------------------------------------------
Time: 1471951474000 ms
-------------------------------------------

-------------------------------------------
Time: 1471951475000 ms
-------------------------------------------
(************************************************************/,1)

-------------------------------------------
Time: 1471951476000 ms
-------------------------------------------

PS:之前拉取不到数据的时候,根据网上的提示,修改过run-example中的local[*]-->local[1]和local[2].

此处的local[1]是万万不可的,设置1的话,整个spark里只有reciver,有处理者,设置2也可以,但在集群环境下,随着追加数据源和加大数据流,这个运行就很吃力了。所以,在单机模式下,还有就是你的虚拟机是2核的话,修改为2是可以的。但如果是1核的话,那你的程序应该是拉取不到数据的。建议,集群下还是保留原有设置local[*].


---over---


你可能感兴趣的:(spark,streaming)