初识 Spark

初识 Spark

因为本人已经有搭建好的 hadoop 和 hbase 集群,所以,选择 spark 版本为 without-hadoop 1.5.2。

安装

tar -xf /home/yuzx/data/download/spark-1.5.2-bin-without-hadoop.tgz -C /home/yuzx/server
ln -sf -T /home/yuzx/server/spark-1.5.2-bin-without-hadoop /home/yuzx/server/spark

配置 spark-env.sh

spark 的安装目录下有各种配置

# 的安装目录下有各种配置的模板
cp ${SPARK_HOME}/spark/conf/spark-env.sh.template ${SPARK_HOME}/spark/conf/spark-env.sh

spark-env.sh

#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

export SCALA_HOME=/home/yuzx/server/scala
export JAVA_HOME=/home/yuzx/server/jdk7

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
export HADOOP_CONF_DIR=/home/yuzx/server/hadoop/etc/hadoop

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_IP=10.0.3.242

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

# 以下三种方式选择一种适合自己的即可
# http://spark.apache.org/docs/latest/hadoop-provided.html
# If 'hadoop' binary is on your PATH
#export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/home/yuzx/server/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
#export SPARK_DIST_CLASSPATH=$(hadoop --config /home/yuzx/server/hadoop/etc/hadoop classpath)

slaves

# A Spark Worker will be started on each of the machines listed below.
dn1
dn2

启动

在 Master 节点上执行下面的命令:

sbin/start-all.sh

启动之后在远程节点上通过 jps 命令查看 java 进程,Master 节点会出现 Master 进程,Worker 节点(slave)会出现 Worker 进程。

验证 spark 集群安装

在本地(非集群节点),配置客户端环境,仍然需要配置 spark-env.sh 至少需要 export HADOOP_CONF_DIR=XXX

在 spark 集群上运行

# client 模式启动,driver 程序在客户端本地,executors 在集群的 worker 节点,cluster manager 为 spark 自己的 standalone
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master spark://10.0.3.242:7077 \
  --deploy-mode client \
  --executor-memory 2g \
  --executor-cores 1 \
  --total-executor-cores 100 \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

在任务执行过程中可以访问
http://127.0.0.1:4040/
来监控任务执行

在 yarn 上运行(牛逼的 spark on yarn),即:在 Hadoop 集群上跑

参考资料(还没来得及看):
http://spark.apache.org/docs/latest/running-on-yarn.html

先从集群中拷贝一份配置文件出来,注意本机的 hosts 要配置 hadoop 各节点的 host 映射
# 也可以设置在 spark env 中
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \
  --num-executors 50 \
  --executor-memory 2g \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

# yarn-client 模式
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master yarn-client \
  --num-executors 50 \
  --executor-memory 2g \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

yarn-cluster 如何查看日志呢?

我的环境是:http://dn2:8042/node/containerlogs/container_1449649503862_0004_01_000001/yuzx/stdout/?start=-4096

登录 namenode 的 WebUI,找到 Finished 的任务,进入,appattempt 的列表,最有测有 logs,进入,找到 stdout

yarn-client 模式的输出会直接在本地 console 中打出

执行过程解读

参考文档:
http://spark.apache.org/docs/latest/cluster-overview.html

  • org.apache.spark.examples.SparkPi 是一个 Spark 应用,类似一个 Java 小程序,带有 main program
  • 它的 main program 就是 driver program,包含一个 SparkContext,SparkContext 负责协调 Spark 应用执行
  • SparkContext 可以连接多种 cluster managers,具体有:
    • Spark 自己的 standalone cluster manager
    • Apache Mesos
    • Hadoop YARN
  • 一旦连接到 cluster manager,它会获取节点上的 executors,从图上看应该是 Worker 节点上的进程(通过 jps 命令观察 SparkPi 应用执行过程,应该是类似 CoarseGrainedExecutorBackend 这样的进程,在 Spark 应用启动后启动,在应用结束后终止)
    • executor 负责执行计算,存储应用数据
  • 然后,SparkContext 会向 executor 发送应用的代码(对于 SparkPi 应用来说,就是 spark-examples-1.5.2-hadoop2.2.0.jar)
  • 最后,SparkContext 会向 executor 发送 task,由 executor 来执行它们(呵呵,这些邪恶的 tasks)

Spark 名词

  • Application:构建在 Spark 上执行的用户程序,我觉得可以也可以叫 Spark App,运行时表现为一个 driver 程序,一组 executor 程序
  • Application:Jar 打包后的用户程序,类似可执行 Jar,别说你不懂哦,如果有依赖 Jar 怎么办,打成一个 Jar(也叫 Assembly Jar),金山词霸一下,就是超级 Jar,Jar 本身包含各种依赖,另外,Uber Jar 中不能包含 Hadoop 和 Spark 的 Jar,它们会在运行时由框架提供,就是说得配置为 runtime 类型的依赖了,别说你不懂哦
  • Driver program: 它是一个进程,它运行你的 Spark 应用中的 main,创建 SparkContext,如果是 deploy-mode=client,则这个 Driver 在客户端本地
  • Cluster manager: 是一个外部服务,用于在集群上分配资源(Standalone, Mesos, YARN)
  • Deploy mode: 用于区分 driver 程序在哪跑,如果 =cluster,由框架在集群中加载 driver 程序,如果 =client,一般在客户端本地
  • Worker Node: 工作者节点,真正干活的节点,就像程序员,Driver 就是经理了,负责监控和协调
  • Executor: 工作节点上干活的进程,就像程序员可以接很多来自不同的活,Executor 用于完成一个活,每个 Spark 应用有自己的 Executors,而且是多个,每个 WorkNode 上都有,进程级别
  • Task: 一个 Spark 应用跑的时候可以分解为多个 Task,每个 Task 发给其中一个 Executor 执行
  • Job: 多个 Task 的并行计算组成一个 Job
  • Stage: 每个 Job 会分解为多个 tasks 集,它们相互依赖,类似 mapreduce 中的 map stage 和 reduce stage

翻译这东西也挺累 ~~

你可能感兴趣的:(spark)