不完全正确。
因为我们只能用火花的核心,而不是MR用于离线计算,数据存储仍取决于HDFS。
Spark+Hadoop的结合是最流行的组合和最有前途的一个,在未来大数据的领域!
6.1 MR的局限性
6.2 Spark解决了哪些问题?
==因此,将Hadoop MapReduce替换为新一代大数据处理平台是技术发展的趋势。在新一代的大数据处理平台中,spark是目前得到最广泛认可和支持的。
$tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt/ $mv spark-2.2.0-bin-hadoop2.7/ spark
export JAVA_HOME=/opt/jdk export SPARK_MASTER_IP=hdp01 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=4 export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_MEMORY=2g export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
#Configure environment variables for spark export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
$start-all-spark.sh
http://hdp01:8080
9.安装spark分布式集群
[root@hdp01 /opt/spark/conf]
export JAVA_HOME=/opt/jdk #Configure the host of the master export SPARK_MASTER_IP=hdp01 #Configure the port for master host communication export SPARK_MASTER_PORT=7077 #Configure the number of CPU cores used by spark in each worker export SPARK_WORKER_CORES=4 #Configure one worker per host export SPARK_WORKER_INSTANCES=1 #The memory used by worker is 2GB export SPARK_WORKER_MEMORY=2g #Directory in Hadoop's configuration file export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
[root@hdp01 /opt/spark/conf]
hdp03 hdp04 hdp05
[root@hdp01 /opt/spark/conf]
$scp -r /opt/spark hdp02:/opt/ $scp -r /opt/spark hdp03:/opt/ $scp -r /opt/spark hdp04:/opt/ $scp -r /opt/spark hdp05:/opt/
$scp -r /etc/profile hdp03:/etc/
[root@hdp01 /]
$scp -r /etc/profile hdp02:/etc/ $scp -r /etc/profile hdp03:/etc/ $scp -r /etc/profile hdp04:/etc/ $scp -r /etc/profile hdp05:/etc/
[root@hdp01 /]
$start-all-spark.sh
10. 配置Spark高可用性集群
先停止正在运行的火花集群
#Note the following two lines #export SPARK_MASTER_IP=hdp01 #export SPARK_MASTER_PORT=7077
$export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER - Dspark.deploy.zookeeper.url=hdp03:2181,hdp04:2181,hdp05:2181 -Dspark.deploy.zookeeper.dir=/spark"
$scp /opt/spark/conf/spark-env.sh hdp02:/opt/spark/conf $scp /opt/spark/conf/spark-env.sh hdp03:/opt/spark/conf $scp /opt/spark/conf/spark-env.sh hdp04:/opt/spark/conf $scp /opt/spark/conf/spark-env.sh hdp05:/opt/spark/conf
[root@hdp01 /]
$start-all-spark.sh
[root@hdp02 /]
$start-master.sh
11.第一个Spark Shell程序
$spark-shell --master spark://hdp01:7077 #Spark shell can specify the resources (total cores, memory used on each work) used by the spark shell application at startup. $spark-shell --master spark://hdp01:7077 --total-executor-cores 6 --executor-memory 1g #If you do not specify to use all cores on each worker by default, and 1G memory on each worker >>>sc.textFile("hdfs://ns1/sparktest/").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect
12.Spark中的角色
13. Spark提交的一般过程
如果发现任何不正确的地方,或者想分享有关上述主题的更多信息,欢迎反馈。
译自developpaper.com