安装环境: Ubuntu sever版 ,java ,scala,
一:在linux下安装java环境(自行安装jdk)
二:安装Scala2.9.3
$ tar -zxf scala-2.9.3.tgz
$ sudo mv scala-2.9.3 /usr/lib
$ sudo vim /etc/profile
# add the following lines at the end
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
source /etc/profile
# test
$ scala -version
三:安装spark
从官网下载最新版本的spark,截止目前最新版的是1.5.1.下载地址:http://spark.apache.org/downloads.html
记住选择预编译好的文件下载,选择Pre-build for Hadoop 2.6 and later,下载的文件为spark-1.5.1-bin-hadoop2.6.tgz
解压
$ tar -zxf spark-1.5.1-bin-hadoop2.6.tgz
设置SPARK_EXAMPLES_JAR 环境变量
$ vim ~/.bashrc
# add the following lines at the end
export SPARK_EXAMPLES_JAR=$HOME/spark-0.7.2/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.2.jar
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile
这一步其实最关键,很不幸的是,官方文档和网上的博客,都没有提及这一点。我是偶然看到了这两篇帖子,Running SparkPi, Null pointer exception when running ./run spark.examples.SparkPi local,才补上了这一步,之前死活都无法运行SparkPi。
(可选)设置 SPARK_HOME环境变量,并将SPARK_HOME/bin加入PATH
$ vim ~/.bashrc
# add the following lines at the end
export SPARK_HOME=$HOME/spark-0.7.2
export PATH=$PATH:$SPARK_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile
后来安装以上两步感觉没用,但还是照做了。spark和hadoop是一样的,解压即可使用。
单机运行spark
四:Spark配置
配置Spark环境变量
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh 添加以下内容:
export JAVA_HOME=/usr/local/java-1.7.0
export HADOOP_HOME=/opt/hadoop-2.3.0-cdh5.0.0
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SCALA_HOME=/usr/local/scala-2.11.4
export SPARK_HOME=/home/lxw1234/spark-1.3.1-bin-hadoop2.3
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8099
export SPARK_WORKER_CORES=3 //每个Worker使用的CPU核数
export SPARK_WORKER_INSTANCES=1 //每个Slave中启动几个Worker实例
export SPARK_WORKER_MEMORY=10G //每个Worker使用多大的内存
export SPARK_WORKER_WEBUI_PORT=8081 //Worker的WebUI端口号
export SPARK_EXECUTOR_CORES=1 //每个Executor使用使用的核数
export SPARK_EXECUTOR_MEMORY=1G //每个Executor使用的内存
export SPARK_CLASSPATH=/opt/hadoop-lzo/current/hadoop-lzo.jar //由于要用到lzo,因此需要配置
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$CLASSPATH
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$HADOOP_HOME/lib/native
cp slaves.template slaves
vi slaves 添加以下内容:
localhost
五、配置免密码ssh登陆
因为Master和Slave处于一台机器,因此配置本机到本机的免密码ssh登陆,如有其他Slave,都需要配置Master到Slave的无密码ssh登陆。
cd ~/
ssh-keygen (一路回车)
cd .ssh/
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys
六、启动Spark Master
cd $SPARK_HOME/sbin/
./start-master.sh
启动日志位于 $SPARK_HOME/logs/目录下,正常启动的日志如下:
15/06/05 14:54:16 INFO server.AbstractConnector: Started SelectChannelConnector@localhost:6066
15/06/05 14:54:16 INFO util.Utils: Successfully started service on port 6066.
15/06/05 14:54:16 INFO rest.StandaloneRestServer: Started REST server for submitting applications on port 6066
15/06/05 14:54:16 INFO master.Master: Starting Spark master at spark://127.0.0.1:7077
15/06/05 14:54:16 INFO master.Master: Running Spark version 1.3.1
15/06/05 14:54:16 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 14:54:16 INFO server.AbstractConnector: Started [email protected]:8099
15/06/05 14:54:16 INFO util.Utils: Successfully started service ‘MasterUI’ on port 8099.
15/06/05 14:54:16 INFO ui.MasterWebUI: Started MasterWebUI at http://127.1.1.1:8099
15/06/05 14:54:16 INFO master.Master: I have been elected leader! New state: ALIVE
七、启动Spark Slave
cd $SPARK_HOME/sbin/
./start-slaves.sh
会根据$SPARK_HOME/conf/slaves文件中配置的主机,逐个ssh过去,启动Spark Worker
成功启动后,在WebUI界面上可以看到,已经有Worker注册上来了,如图:
在浏览器输入:http://192.168.1.84:8080/ (前面为master的ip地址)
八、简单小实例(统计文件中出现最多的50个单词)
在bin目录下直接运行./spark-shell
- hadoop@Master:/usr/local/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell
- log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
- log4j:WARN Please initialize the log4j system properly.
- log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
- Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
- To adjust logging level use sc.setLogLevel("INFO")
- Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 1.5.1
- /_/
-
- Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_79)
- Type in expressions to have them evaluated.
- Type :help for more information.
- 15/10/13 19:12:16 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
- Spark context available as sc.
- 15/10/13 19:12:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
- 15/10/13 19:12:19 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
- 15/10/13 19:12:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
- 15/10/13 19:12:35 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
- 15/10/13 19:12:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- 15/10/13 19:12:39 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
- 15/10/13 19:12:39 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
- SQL context available as sqlContext.
没注意这么多warn是怎么回事,接着进入spark-shell,依次输入:
var srcFile = sc.textFile("/usr/local/kern.log")
var a = srcFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)
a.map(word=>(word._2,word._1)).sortByKey(false).map(word=>(word._2,word._1)).take(50).foreach(println)
结果打印在终端:
在4040端口可查看job的情况 http://192.168.1.84:4040/jobs/
八、Spark Java programming (Spark and Spark Streaming)
1:spark批处理:统计一个文件中出现a和出现b的单词数:SimpleApp.java
- package org.apache.eagle.spark_streaming_kafka;
-
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.Function;
-
- public class SimpleApp {
-
- public static void main(String[] args) {
- String logFile = "/var/log/boot.log";
- SparkConf conf = new SparkConf().setAppName("Simple Application");
- JavaSparkContext sc = new JavaSparkContext(conf);
- JavaRDD logData = sc.textFile(logFile).cache();
-
- long numAs = logData.filter(new Function() {
-
-
-
- private static final long serialVersionUID = 1L;
-
- public Boolean call(String s) { return s.contains("a"); }
- }).count();
-
- long numBs = logData.filter(new Function() {
-
-
- public Boolean call(String s) { return s.contains("b"); }
- }).count();
-
- System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
-
- }
-
- }
2:Spark Streaming, 读取kafka数据做单词统计。
- package org.apache.eagle.spark_streaming_kafka;
-
- import java.util.HashMap;
- import java.util.Map;
- import java.util.regex.Pattern;
-
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import org.apache.spark.streaming.Duration;
- import org.apache.spark.streaming.api.java.JavaDStream;
- import org.apache.spark.streaming.api.java.JavaPairDStream;
- import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
- import org.apache.spark.streaming.api.java.JavaStreamingContext;
- import org.apache.spark.streaming.kafka.KafkaUtils;
-
- import com.google.common.collect.Lists;
-
- import scala.Tuple2;
-
-
-
-
-
-
- public class JavaKafkaWordCount
- {
- private static final Pattern SPACE = Pattern.compile(" ");
-
- private JavaKafkaWordCount() {
- }
-
- public static void main( String[] args )
- {
-
- String zkQuorum = "10.64.255.161";
- String group = "test-consumer-group";
- SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
-
- JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
- Map topicMap = new HashMap();
- topicMap.put("noise",1);
- JavaPairReceiverInputDStream messages =
- KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);;
-
- JavaDStream lines = messages.map(new Function, String>() {
- public String call(Tuple2 tuple2) {
- return tuple2._2();
- }
- });
- JavaDStream words = lines.flatMap(new FlatMapFunction() {
- public Iterable call(String x) {
- return Lists.newArrayList(SPACE.split(x));
- }
- });
-
- JavaPairDStream wordCounts = words.mapToPair(
- new PairFunction() {
- public Tuple2 call(String s) {
- return new Tuple2(s, 1);
- }
- }).reduceByKey(new Function2() {
- public Integer call(Integer i1, Integer i2) {
- return i1 + i2;
- }
- });
-
- wordCounts.print();
- jssc.start();
- jssc.awaitTermination();
- }
- }
注意几点:
1:环境:要确保spark在本机中正确安装,安装步骤如上所述。zookeeper集群和kafka集群要安装好,kafka的topic要新建好。
2:之前运行遇到找不到jar的情况(kafkaUtil),原因没有把所有依赖的jar包都打包到最终的jar包里去。应在pom.xml中添加一下:
- <build>
- <sourceDirectory>src/main/javasourceDirectory>
- <testSourceDirectory>src/test/javatestSourceDirectory>
- <plugins>
-