浅析HiBench之SparkBench(集群)配置

一 、前言:
1. 语术:
Hadoop 版本: Version 2.7.1
HiBench 版本:Version 6.0
Spark 版本:Version 2.1.0

Scala 版本: scala-2.11.12

java: jdk8

集群节点:1master + 3 slaves

二、搭建Spark on yarn 集群

基于单节点Spark 配置 https://blog.csdn.net/don_chiang709/article/details/80438589 和  Hadoop 集群配置https://blog.csdn.net/don_chiang709/article/details/80647927,稍作修改。如下:

1. Hadoop: 修改master 的 yarn-site.xml 配置,增加spark-shuffle 插件。提交MR job时会自动选择mapreduce_shuffle,提交spark job会选择spark_shuffle.

 
    yarn.nodemanager.aux-services
    spark_shuffle,mapreduce_shuffle
 

 
    yarn.nodemanager.aux-services.mapreduce_shuffle.class
    org.apache.hadoop.mapred.ShuffleHandler
 

 
    yarn.nodemanager.aux-services.spark_shuffle.class
    org.apache.spark.network.yarn.YarnShuffleService
 

2. Spark:

a) 增加slave文件(~/Spark/conf/slaves),添加如下内容:

cluster2.serversolution.sh.xxx
cluster3.serversolution.sh.xxx
cluster4.serversolution.sh.xxx

b) copy Spark安装目录到每个slave节点,并配置每个节点的.bashrc里的SPARK_HOME变量

3. HiBench: 更新 spark.conf(HiBench/conf/spark.conf)

hibench.spark.home ~/Spark

hibench.spark.master   yarn
spark.deploy.mode  yarn-cluster

hibench.yarn.executor.num    4
hibench.yarn.executor.cores  11

# executor and driver memory in standalone & YARN mode
#spark.executor.memory  80g
spark.executor.memory  17g
spark.driver.memory    4g
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://cluster1.serversolution.sh.xxx:9000/sparklogs
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:NewRatio=1
spark.io.compression.lz4.blockSize      128k
spark.memory.fraction   0.8
spark.memory.storageFraction    0.2
spark.rdd.compress      true
spark.reducer.maxSizeInFlight   272m
spark.serializer        org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.index.cache.size  128m
 

4. 启动/停止脚步

 ~/bin/S1spark-yarn.sh:

#~/bin/sh
#. ~/lib/hadoop-common-lib.sh

    $HADOOP_HOME/sbin/start-dfs.sh
    $HADOOP_HOME/sbin/start-yarn.sh

#HistoryServer
    $SPARK_HOME/sbin/start-history-server.sh hdfs://cluster1.serversolution.sh.xxx:9000/sparklogs
 

~/bin/K1spark-yarn.sh

    $HADOOP_HOME/sbin/stop-dfs.sh
    $HADOOP_HOME/sbin/stop-yarn.sh

    $SPARK_HOME/sbin/stop-history-server.sh hdfs://cluster1.serversolution.sh.xxx:9000/sparklogs
 

三、启动后的进程:

master:

$ jps
9077 ResourceManager
8790 SecondaryNameNode
9383 HistoryServer
8492 NameNode


slaves:

$ clustercmd.sh jps
jps
---------------------------------------------
ssh cluster2.serversolution.sh.xxx 'jps'
10629 NodeManager 10424 DataNode 11023 Jps
---------------------------------------------
---------------------------------------------
ssh cluster3.serversolution.sh.xxx 'jps'
11106 NodeManager 11495 Jps 10890 DataNode
---------------------------------------------
---------------------------------------------
ssh cluster4.serversolution.sh.xxx 'jps'
11553 NodeManager 11347 DataNode 11950 Jps
---------------------------------------------

四、跑一把spark terasort测试

~/HiBench/bin/workloads/micro/terasort/spark/run.sh 

finish ScalaSparkTerasort bench

你可能感兴趣的:(Yarn,Spark)