on cloud(集群模式):比如 AWS 的 EC2,使用这个模式能很方便的访问 Amazon的 S3;Spark 支持多种分布式存储系统:HDFS 和 S3
安装:
1.spark standalone模式 需要hadoop 的HDFS作为持久层
jdk1.6以上
安装hadoop集群请参考:http://blog.csdn.net/m0_37739193/article/details/71222673
2.安装scala(三台都要安装)
[hadoop@h40 ~]$ tar -zxvf scala-2.10.6.tgz
[hadoop@h41 ~]$ tar -zxvf scala-2.10.6.tgz
[hadoop@h42 ~]$ tar -zxvf scala-2.10.6.tgz
3.安装spark
[hadoop@h40 ~]$ tar -zxvf spark-1.3.1-bin-hadoop2.6.tgz
[hadoop@h40 ~]$ vi .bash_profile
[hadoop@h40 ~]$ vi .bash_profile
export SPARK_HOME=/home/hadoop/spark-1.3.1-bin-hadoop2.6
export SCALA_HOME=/home/hadoop/scala-2.10.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin
[hadoop@h40 ~]$ source .bash_profile
4.配置spark的 configuration文件
[hadoop@h40 ~]$ cd spark-1.1.0/conf
[hadoop@h40 conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@h40 conf]$ vi spark-env.sh
添加:
export JAVA_HOME=/usr/jdk1.7.0_25/
export SPARK_MASTER_IP=h40
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
#在spark-1.6.0-bin-hadoop2.6和spark-1.5.0-cdh5.5.2版本中为export SPARK_EXECUTOR_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
6.同步到其他节点
[hadoop@h40 ~]$ scp -r spark-1.1.0 h41:/home/hadoop
[hadoop@h40 ~]$ scp -r spark-1.1.0 h42:/home/hadoop
7.启动spark
[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ sbin/start-all.sh
8.验证
[hadoop@h40 ~]$ jps
主节点有 master进程
8861 Master
[hadoop@h41 ~]$ jps
[hadoop@h42 ~]$ jps
从节点有 Worker进程
8993 Worker
案例一:(spark-1.3.1-bin-hadoop2.6版本自带的WordCount例子)
[hadoop@h40 examples]$ pwd
/home/hadoop/spark-1.3.1-bin-hadoop2.6/examples/src/main/java/org/apache/spark/examples
[hadoop@h40 examples]$ ls
JavaHdfsLR.java JavaLogQuery.java JavaPageRank.java JavaSparkPi.java JavaStatusTrackerDemo.java JavaTC.java JavaWordCount.java ml mllib sql streaming
[hadoop@h40 examples]$ cat JavaWordCount.java
package org.apache.spark.examples;
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public final class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount ");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD lines = ctx.textFile(args[0], 1);
JavaRDD words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String s) {
return Arrays.asList(SPACE.split(s));
}
});
JavaPairRDD ones = words.mapToPair(new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
});
JavaPairRDD counts = ones.reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List> output = counts.collect();
for (Tuple2,?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
ctx.stop();
}
}
然后进入spark的家目录执行命令:
[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ bin/spark-submit --master spark://h40:7077 --name JavaWordCountByHQ --class org.apache.spark.examples.JavaWordCount --executor-memory 500m --total-executor-cores 2 lib/spark-examples-1.3.1-hadoop2.6.0.jar hdfs://h40:9000/spark/hehe.txt
。。。(输出内容太多省略)
hive: 1
hadoop: 1
hello: 3
world: 1
。。。
(这个案例感觉和运行hadoop的mapreduce似的,这个也是离线处理)
案例二(spark streaming):
本来spark也自带了例子,但我没有成功,按官方步骤先打开nc -lk 9999端,再在spark的家目录下执行bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount h40 9999无法达到想要的结果,一直显示如下:
17/06/20 21:23:58 INFO dstream.SocketReceiver: Connected to h40:9999
17/06/20 21:23:59 INFO scheduler.JobScheduler: Added jobs for time 1497965039000 ms
17/06/20 21:24:00 INFO scheduler.JobScheduler: Added jobs for time 1497965040000 ms
17/06/20 21:24:01 INFO scheduler.JobScheduler: Added jobs for time 1497965041000 ms
17/06/20 21:24:02 INFO scheduler.JobScheduler: Added jobs for time 1497965042000 ms
17/06/20 21:24:03 INFO scheduler.JobScheduler: Added jobs for time 1497965043000 ms
17/06/20 21:24:04 INFO scheduler.JobScheduler: Added jobs for time 1497965044000 ms
17/06/20 21:24:05 INFO scheduler.JobScheduler: Added jobs for time 1497965045000 ms
17/06/20 21:24:06 INFO scheduler.JobScheduler: Added jobs for time 1497965046000 ms
17/06/20 21:24:07 INFO scheduler.JobScheduler: Added jobs for time 1497965047000 ms
[hadoop@h40 streaming]$ ls
JavaCustomReceiver.java JavaFlumeEventCount.java JavaNetworkWordCount.java JavaQueueStream.java JavaRecoverableNetworkWordCount.java JavaStatefulNetworkWordCount.java
[hadoop@h40 streaming]$ pwd
/home/hadoop/spark-1.3.1-bin-hadoop2.6/examples/src/main/java/org/apache/spark/examples/streaming
[hadoop@h40 streaming]$ cat JavaNetworkWordCount.java
package org.apache.spark.examples.streaming;
import scala.Tuple2;
import com.google.common.collect.Lists;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import java.util.regex.Pattern;
/**
* Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
*
* Usage: JavaNetworkWordCount
* and describe the TCP server that Spark Streaming would connect to receive data.
*
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost 9999`
*/
public final class JavaNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount ");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
JavaReceiverInputDStream lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});
JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
ssc.start();
ssc.awaitTermination();
}
}
package org.apache.spark.examples.streaming;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.examples.streaming.StreamingExamples;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import com.google.common.collect.Lists;
public final class JavaNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) {
StreamingExamples.setStreamingLogLevels();
SparkConf sparkConf = new SparkConf().setAppName("wordcount").setMaster("local[2]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
JavaReceiverInputDStream lines = ssc.socketTextStream("h40", 9999);
JavaDStream words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});
JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
ssc.start();
ssc.awaitTermination();
}
}
打开一个终端,输入 命令 nc -lk 9999,然后在myeclipse中将所需的jar包导入,直接运行该程序:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/huiqiang/Desktop/%e6%96%b0%e5%bb%ba%e6%96%87%e4%bb%b6%e5%a4%b9/spark-1.3.1-bin-hadoop2.6/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/huiqiang/Desktop/%e6%96%b0%e5%bb%ba%e6%96%87%e4%bb%b6%e5%a4%b9/spark-1.3.1-bin-hadoop2.6/spark-examples-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/07/10 09:18:56 INFO StreamingExamples: Setting log level to [WARN] for streaming example. To override add a custom log4j.properties to the classpath.
-------------------------------------------
Time: 1499649550000 ms
-------------------------------------------
-------------------------------------------
Time: 1499649560000 ms
-------------------------------------------
-------------------------------------------
Time: 1499649570000 ms
-------------------------------------------
。。。。。。。。。。。。。。。
然后你在9999端口输入数据则myeclipse的控制台中会打印出统计结果。
注意:
1.如果你想在Linux本地中运行该程序的话,需要将代码中的StreamingExamples.setStreamingLogLevels();这行代码删除掉,再用myeclipse打成streaming.jar包上传到Linux本地中去,再执行[hadoop@h40 spark-1.3.1-bin-hadoop2.6]$ bin/spark-submit --class org.apache.spark.examples.streaming.JavaNetworkWordCount streaming.jar
2.import com.google.common.collect.Lists;这个需要google-collections-1.0.jar这个jar包,可是解压的spark-1.3.1-bin-hadoop2.6.tgz并没有这个jar包,我从网上找了一个,如果你需要的话,可以去这里下载:http://download.csdn.net/detail/m0_37739193/9893632
这里有个怪现象我也不知道是什么原因造成的:当在Linux本地运行上面的程序的时候,虽然spark的lib目录下没有google-collections-1.0.jar,但在spark-1.3.1-bin-hadoop2.6
中可以正常运行,在spark-1.6.3-bin-hadoop2.6中却报Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Lists和Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Lists,即使我把google-collections-1.0.jar上传到spark-1.6.3-bin-hadoop2.6的lib目录下并在.bash_profile的export CLASSPATH中添加了这个lib目录,结果还是报上面的错误。我以为将google-collections-1.0.jar上传到spark的lib目录下并将该目录添加到.bash_profil的export CLASSPATH中就不会报错了哈,但在spark-1.3.1-bin-hadoop2.6和spark-1.6.3-bin-hadoop2.6所表现的状态完全不是我以为的啊,我目前来说不是很明白。
如果感兴趣的话可以浏览我的另一篇文章:spark streaming不同数据来源(socket套接字、hdfs目录)和存储位置(hdfs、本地)的java代码
参考:
http://blog.csdn.net/bluejoe2000/article/details/41556979
http://blog.csdn.net/huyangshu87/article/details/52288662