软件 | 版本 |
---|---|
jdk | jdk1.8.0_191 |
zookeeper | zookeeper-3.4.12 |
hadoop | hadoop-2.8.5 |
alluxio | alluxio-1.8.0-hadoop-2.8 |
spark | spark-2.3.2-bin-hadoop2.7 |
spark版本不对应,不推荐使用
解压:
$ tar -zxf /home/dpnice/Downloads/spark-2.3.2-bin-hadoop2.7.tgz -C /opt/Software/
创建软链接:
$ sudo ln -s /opt/Software/spark-2.3.2-bin-hadoop2.7/ /spark
进入conf目录:
$ cp spark-env.sh.template spark-env.sh
$ cp spark-defaults.conf.template spark-defaults.conf
$ cp slaves.template slaves
执行:$ vi spark-defaults.conf 配置内容如下
#alluxio alluxio-1.8.0-client.jar 位置
spark.driver.extraClassPath /opt/Software/alluxio-1.8.0-hadoop-2.8/client/alluxio-1.8.0-client.jar
spark.executor.extraClassPath /opt/Software/alluxio-1.8.0-hadoop-2.8/client/alluxio-1.8.0-client.jar
#配置zookeeper
spark.driver.extraJavaOptions -Dalluxio.zookeeper.address=cdh1:2181,cdh2:2181,cdh3:2181 -Dalluxio.zookeeper.enabled=true
spark.executor.extraJavaOptions -Dalluxio.zookeeper.address=cdh1:2181,cdh2:2181,cdh3:2181 -Dalluxio.zookeeper.enabled=true
执行:$ vi spark-env.sh 配置内容如下
export JAVA_HOME=/jdk
export SPARK_WORKER_MEMORY=500m
export SPARK_WORKER_CORES=1
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=cdh1:2181,cdh2:2181,cdh3:2181 -Dspark.deploy.zookeeper.dir=/spark"
配置$ vi slaves 将 worker 节点写入
发送到各个节点:
$ scp -r /opt/Software/spark-2.3.2-bin-hadoop2.7/conf/* dpnice@cdh2:/opt/Software/spark-2.3.2-bin-hadoop2.7/conf/
$ scp -r /opt/Software/spark-2.3.2-bin-hadoop2.7/conf/* dpnice@cdh3:/opt/Software/spark-2.3.2-bin-hadoop2.7/conf/
启动所有:
$ /spark/sbin/start-all.sh
master 节点在当前节点运行
关闭所有:
$ /spark/sbin/stop-all.sh
在其中一台启动standby master
启动:
$ /spark/sbin/start-master.sh
停止:
$ /spark/sbin/stop-master.sh
检查Spark与Alluxio的集成性 (支持Spark 2.X):
/alluxio/integration/checker/bin/alluxio-checker.sh spark spark://cdh1:7077
可以修改 /alluxio/integration/checker/bin/spark-checker.sh
/spark/bin/spark-shell --master spark://192.168.137.129:7077,192.168.137.128:7077 --executor-memory 450M
选择一个映射在Alluxio里,持久化在HDFS上的文件word.txt,并将结果保存在linux文件系统中:
部分读取的块缓存默认是开启的,但如果已经将这个选项关闭的话,LICENSE文件很可能不在Alluxio存储(非In-Alluxio)中。这是因为Alluxio只存储完整读入块,如果文件太小,Spark作业中,每个executor读入部分块。为了避免这种情况,你可以在Spark中定制分块数目。对于这个例子,由于只有一个块,我们将设置分块数为1。
sc.textFile("alluxio://192.168.137.128:19998/word.txt", 1).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")
当以容错模式运行Alluxio时,可以使用任何一个Alluxio master
$ $SPARK_HOME/sbin/start-slave.sh -h 启动Spark时想要获取数据本地化,可以用Spark提供的如下脚本显式指定主机名。以slave-hostname启动Spark Worker或者也可以通过设置$SPARK_HOME/conf/spark-env.sh里的SPARK_LOCAL_HOSTNAME获取数据本地化:SPARK_LOCAL_HOSTNAME=simple30
上传一个文件到Alluxio里,不持久化到HDFS:
$ /alluxio/bin/alluxio fs copyFromLocal word_1.txt /
spark-shell 中执行:
//结果输出:
sc.textFile("alluxio://192.168.137.128:19998/word_1.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect.foreach(println(_))
//结果保存到alluxio:
sc.textFile("alluxio://192.168.137.128:19998/word_1.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("alluxio://192.168.137.128:19998/spark/out")
参考:
http://www.alluxio.org/docs/1.8/cn/compute/Spark.html