Spark Java程序案例入门

spark 安装模式:
local(本地模式):常用于本地开发测试,本地还分为local单线程和local-cluster多线程
standalone(集群模式):典型的Mater/slave模式,不过也能看出Master是有单点故障的;Spark支持ZooKeeper来实现 HA
on yarn(集群模式): 运行在 yarn 资源管理器框架之上,由 yarn 负责资源管理,Spark 负责任务调度和计算
on mesos(集群模式): 运行在 mesos 资源管理器框架之上,由 mesos 负责资源管理,Spark 负责任务调度和计算

on cloud(集群模式):比如 AWS 的 EC2,使用这个模式能很方便的访问 Amazon的 S3;Spark 支持多种分布式存储系统:HDFS 和 S3


安装:
1.spark standalone模式 需要hadoop 的HDFS作为持久层
jdk1.6以上
安装hadoop集群请参考:http://blog.csdn.net/m0_37739193/article/details/71222673


2.安装scala(三台都要安装)
[hadoop@h40 ~]$ tar -zxvf scala-2.10.6.tgz
[hadoop@h41 ~]$ tar -zxvf scala-2.10.6.tgz
[hadoop@h42 ~]$ tar -zxvf scala-2.10.6.tgz


3.安装spark
[hadoop@h40 ~]$ tar -zxvf spark-1.3.1-bin-hadoop2.6.tgz 


[hadoop@h40 ~]$ vi .bash_profile

[hadoop@h40 ~]$ vi .bash_profile 
export SPARK_HOME=/home/hadoop/spark-1.3.1-bin-hadoop2.6
export SCALA_HOME=/home/hadoop/scala-2.10.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin
[hadoop@h40 ~]$ source .bash_profile

4.配置spark的 configuration文件
[hadoop@h40 ~]$ cd spark-1.1.0/conf
[hadoop@h40 conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@h40 conf]$ vi spark-env.sh
添加:

export JAVA_HOME=/usr/jdk1.7.0_25/
export SPARK_MASTER_IP=h40
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
#在spark-1.6.0-bin-hadoop2.6和spark-1.5.0-cdh5.5.2版本中为export SPARK_EXECUTOR_INSTANCES=1
export SPARK_WORKER_MEMORY=1g

5.配置slaves
[hadoop@h101 conf]$ vi slaves 
h41
h42


6.同步到其他节点
[hadoop@h40 ~]$ scp -r spark-1.1.0 h41:/home/hadoop
[hadoop@h40 ~]$ scp -r spark-1.1.0 h42:/home/hadoop


7.启动spark
[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ sbin/start-all.sh


8.验证
[hadoop@h40 ~]$ jps
主节点有 master进程
8861 Master


[hadoop@h41 ~]$ jps
[hadoop@h42 ~]$ jps
从节点有 Worker进程
8993 Worker


案例一:(spark-1.3.1-bin-hadoop2.6版本自带的WordCount例子)

[hadoop@h40 examples]$ pwd
/home/hadoop/spark-1.3.1-bin-hadoop2.6/examples/src/main/java/org/apache/spark/examples
[hadoop@h40 examples]$ ls
JavaHdfsLR.java  JavaLogQuery.java  JavaPageRank.java  JavaSparkPi.java  JavaStatusTrackerDemo.java  JavaTC.java  JavaWordCount.java  ml  mllib  sql  streaming
[hadoop@h40 examples]$ cat JavaWordCount.java 

package org.apache.spark.examples;

import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public final class JavaWordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) throws Exception {

    if (args.length < 1) {
      System.err.println("Usage: JavaWordCount ");
      System.exit(1);
    }

    SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    JavaRDD lines = ctx.textFile(args[0], 1);

    JavaRDD words = lines.flatMap(new FlatMapFunction() {
      @Override
      public Iterable call(String s) {
        return Arrays.asList(SPACE.split(s));
      }
    });

    JavaPairRDD ones = words.mapToPair(new PairFunction() {
      @Override
      public Tuple2 call(String s) {
        return new Tuple2(s, 1);
      }
    });

    JavaPairRDD counts = ones.reduceByKey(new Function2() {
      @Override
      public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
      }
    });

    List> output = counts.collect();
    for (Tuple2 tuple : output) {
      System.out.println(tuple._1() + ": " + tuple._2());
    }
    ctx.stop();
  }
}

在hdfs中创建相应文件:
[hadoop@h40 ~]$ vi hehe.txt
hello world
hello hadoop
hello hive
[hadoop@h40 ~]$ hadoop fs -mkdir /spark
[hadoop@h40 ~]$ hadoop fs -put hehe.txt /spark


然后进入spark的家目录执行命令:
[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ bin/spark-submit --master spark://h40:7077 --name JavaWordCountByHQ --class org.apache.spark.examples.JavaWordCount --executor-memory 500m --total-executor-cores 2 lib/spark-examples-1.3.1-hadoop2.6.0.jar hdfs://h40:9000/spark/hehe.txt

。。。(输出内容太多省略)
hive: 1
hadoop: 1
hello: 3
world: 1
。。。
(这个案例感觉和运行hadoop的mapreduce似的,这个也是离线处理)


案例二(spark streaming):
本来spark也自带了例子,但我没有成功,按官方步骤先打开nc -lk 9999端,再在spark的家目录下执行bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount h40 9999无法达到想要的结果,一直显示如下:

17/06/20 21:23:58 INFO dstream.SocketReceiver: Connected to h40:9999
17/06/20 21:23:59 INFO scheduler.JobScheduler: Added jobs for time 1497965039000 ms
17/06/20 21:24:00 INFO scheduler.JobScheduler: Added jobs for time 1497965040000 ms
17/06/20 21:24:01 INFO scheduler.JobScheduler: Added jobs for time 1497965041000 ms
17/06/20 21:24:02 INFO scheduler.JobScheduler: Added jobs for time 1497965042000 ms
17/06/20 21:24:03 INFO scheduler.JobScheduler: Added jobs for time 1497965043000 ms
17/06/20 21:24:04 INFO scheduler.JobScheduler: Added jobs for time 1497965044000 ms
17/06/20 21:24:05 INFO scheduler.JobScheduler: Added jobs for time 1497965045000 ms
17/06/20 21:24:06 INFO scheduler.JobScheduler: Added jobs for time 1497965046000 ms
17/06/20 21:24:07 INFO scheduler.JobScheduler: Added jobs for time 1497965047000 ms

[hadoop@h40 streaming]$ ls
JavaCustomReceiver.java  JavaFlumeEventCount.java  JavaNetworkWordCount.java  JavaQueueStream.java  JavaRecoverableNetworkWordCount.java  JavaStatefulNetworkWordCount.java
[hadoop@h40 streaming]$ pwd
/home/hadoop/spark-1.3.1-bin-hadoop2.6/examples/src/main/java/org/apache/spark/examples/streaming
[hadoop@h40 streaming]$ cat JavaNetworkWordCount.java

package org.apache.spark.examples.streaming;

import scala.Tuple2;
import com.google.common.collect.Lists;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import java.util.regex.Pattern;

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: JavaNetworkWordCount  
 *  and  describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost 9999`
 */
public final class JavaNetworkWordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) {
    if (args.length < 2) {
      System.err.println("Usage: JavaNetworkWordCount  ");
      System.exit(1);
    }

    StreamingExamples.setStreamingLogLevels();

    // Create the context with a 1 second batch size
    SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));

    // Create a JavaReceiverInputDStream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    JavaReceiverInputDStream lines = ssc.socketTextStream(
            args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
    JavaDStream words = lines.flatMap(new FlatMapFunction() {
      @Override
      public Iterable call(String x) {
        return Lists.newArrayList(SPACE.split(x));
      }
    });
    JavaPairDStream wordCounts = words.mapToPair(
      new PairFunction() {
        @Override
        public Tuple2 call(String s) {
          return new Tuple2(s, 1);
        }
      }).reduceByKey(new Function2() {
        @Override
        public Integer call(Integer i1, Integer i2) {
          return i1 + i2;
        }
      });

    wordCounts.print();
    ssc.start();
    ssc.awaitTermination();
  }
}

然后我将该代码修改了下才成功:
package org.apache.spark.examples.streaming;  
  
import java.util.regex.Pattern;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.examples.streaming.StreamingExamples;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import scala.Tuple2;

import com.google.common.collect.Lists;
  
public final class JavaNetworkWordCount {  
  private static final Pattern SPACE = Pattern.compile(" ");  
  
  public static void main(String[] args) {  
  
    StreamingExamples.setStreamingLogLevels();
  
    SparkConf sparkConf = new SparkConf().setAppName("wordcount").setMaster("local[2]");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));  
  
    JavaReceiverInputDStream lines = ssc.socketTextStream("h40", 9999);

    JavaDStream words = lines.flatMap(new FlatMapFunction() {  
      @Override  
      public Iterable call(String x) {  
        return Lists.newArrayList(SPACE.split(x));  
      }  
    });  
    JavaPairDStream wordCounts = words.mapToPair(  
      new PairFunction() {  
        @Override  
        public Tuple2 call(String s) {  
          return new Tuple2(s, 1);  
        }  
      }).reduceByKey(new Function2() {  
        @Override  
        public Integer call(Integer i1, Integer i2) {  
          return i1 + i2;  
        }  
      });  
  
    wordCounts.print();  
    ssc.start();  
    ssc.awaitTermination();  
  }  
}  


打开一个终端,输入 命令 nc -lk 9999,然后在myeclipse中将所需的jar包导入,直接运行该程序:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/huiqiang/Desktop/%e6%96%b0%e5%bb%ba%e6%96%87%e4%bb%b6%e5%a4%b9/spark-1.3.1-bin-hadoop2.6/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/huiqiang/Desktop/%e6%96%b0%e5%bb%ba%e6%96%87%e4%bb%b6%e5%a4%b9/spark-1.3.1-bin-hadoop2.6/spark-examples-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/07/10 09:18:56 INFO StreamingExamples: Setting log level to [WARN] for streaming example. To override add a custom log4j.properties to the classpath.
-------------------------------------------
Time: 1499649550000 ms
-------------------------------------------

-------------------------------------------
Time: 1499649560000 ms
-------------------------------------------

-------------------------------------------
Time: 1499649570000 ms
-------------------------------------------
。。。。。。。。。。。。。。。
然后你在9999端口输入数据则myeclipse的控制台中会打印出统计结果。


注意:

1.如果你想在Linux本地中运行该程序的话,需要将代码中的StreamingExamples.setStreamingLogLevels();这行代码删除掉,再用myeclipse打成streaming.jar包上传到Linux本地中去,再执行[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ bin/spark-submit --class org.apache.spark.examples.streaming.JavaNetworkWordCount streaming.jar

2.import com.google.common.collect.Lists;这个需要google-collections-1.0.jar这个jar包,可是解压的spark-1.3.1-bin-hadoop2.6.tgz并没有这个jar包,我从网上找了一个,如果你需要的话,可以去这里下载:http://download.csdn.net/detail/m0_37739193/9893632

这里有个怪现象我也不知道是什么原因造成的:当在Linux本地运行上面的程序的时候,虽然spark的lib目录下没有google-collections-1.0.jar,但在spark-1.3.1-bin-hadoop2.6

中可以正常运行,在spark-1.6.3-bin-hadoop2.6中却报Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Lists和Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Lists,即使我把google-collections-1.0.jar上传到spark-1.6.3-bin-hadoop2.6的lib目录下并在.bash_profile的export CLASSPATH中添加了这个lib目录,结果还是报上面的错误。我以为将google-collections-1.0.jar上传到spark的lib目录下并将该目录添加到.bash_profil的export CLASSPATH中就不会报错了哈,但在spark-1.3.1-bin-hadoop2.6和spark-1.6.3-bin-hadoop2.6所表现的状态完全不是我以为的啊,我目前来说不是很明白。


如果感兴趣的话可以浏览我的另一篇文章:spark streaming不同数据来源(socket套接字、hdfs目录)和存储位置(hdfs、本地)的java代码


参考:

http://blog.csdn.net/bluejoe2000/article/details/41556979
http://blog.csdn.net/huyangshu87/article/details/52288662

你可能感兴趣的:(spark)