1.下载
下载地址:https://archive.apache.org/dist/spark/
2.解压
将安装包放入linux中,然后解压
tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz
修改文件夹名称
mv spark-1.6.0-bin-hadoop2.6 spark-1.6.0
3.环境配置
1)进入编辑
sudo vi /etc/profile
2)配置如下环境变量
export SPARK_HOME=/vol6/home/nudt_cb/psl/spark-1.6.0
export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH
3)刷新配置
source /etc/profile
4)修改spark-env.sh
a)重命名文件名
cp spark-env.sh.template spark-env.sh
b)修改文件内容
vi spark-env.sh
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
export JAVA_HOME=/vol6/appsoftware/jdk1.8.0_101
export SPARK_HOME=/vol6/home/nudt_cb/psl/spark-1.6.0
export SPARK_MASTER_ID=***.***.***.***
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=2g
5)配置slaves
a)重命名文件名
cp slaves.template slaves
b)修改内容
vi slaves
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# A Spark Worker will be started on each of the machines listed below.
localhost
4 查看spark
[nudt_cb@ln1%tianhe ~]$ spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
20/05/28 22:38:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
20/05/28 22:38:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
20/05/28 22:38:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
20/05/28 22:38:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
20/05/28 22:38:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
20/05/28 22:38:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.
scala>
5.配置参数说明
以下内容来自官网》》》》》》》》》
http://spark.apache.org/docs/latest/spark-standalone.html
要使用启动脚本启动Spark独立集群,您应该在Spark目录中创建一个名为conf / slaves的文件,该文件必须包含要启动Spark Worker的所有计算机的主机名,每行一个。如果conf / slaves不存在,则启动脚本默认为单台计算机(localhost),这对于测试非常有用。注意,主计算机通过ssh访问每个工作计算机。默认情况下,ssh是并行运行的,并且需要设置无密码(使用私钥)访问权限。如果您没有无密码设置,则可以设置环境变量SPARK_SSH_FOREGROUND并为每个工作线程依次提供一个密码。
设置完此文件后,您可以基于Hadoop的部署脚本,使用以下shell脚本启动或停止集群,该脚本可在SPARK_HOME/sbin
:
sbin/start-master.sh
-在执行脚本的计算机上启动主实例。sbin/start-slaves.sh
-在conf/slaves
文件中指定的每台计算机上启动一个从属实例。sbin/start-slave.sh
-在执行脚本的计算机上启动从属实例。sbin/start-all.sh
-如上所述,同时启动一个主机和多个从机。sbin/stop-master.sh
-停止通过sbin/start-master.sh
脚本启动的主机。sbin/stop-slaves.sh
-停止conf/slaves
文件中指定的机器上的所有从属实例。sbin/stop-all.sh
-如上所述,停止主机和从机。请注意,这些脚本必须在要在其上运行Spark master的计算机上执行,而不是在本地计算机上执行。
您可以选择通过在中设置环境变量来进一步配置集群conf/spark-env.sh
。从开始以创建此文件conf/spark-env.sh.template
,然后将其复制到所有辅助计算机上以使设置生效。可以使用以下设置:
环境变量 | 含义 |
---|---|
SPARK_MASTER_HOST |
将主服务器绑定到特定的主机名或IP地址,例如公共主机名或IP地址。 |
SPARK_MASTER_PORT |
在另一个端口上启动主服务器(默认:7077)。 |
SPARK_MASTER_WEBUI_PORT |
主Web UI的端口(默认值:8080)。 |
SPARK_MASTER_OPTS |
仅以“ -Dx = y”的形式应用于主服务器的配置属性(默认值:无)。请参阅下面的可能选项列表。 |
SPARK_LOCAL_DIRS |
用于Spark中“临时”空间的目录,包括映射输出文件和存储在磁盘上的RDD。它应该在系统中的快速本地磁盘上。它也可以是不同磁盘上多个目录的逗号分隔列表。 |
SPARK_WORKER_CORES |
允许Spark应用程序在计算机上使用的内核总数(默认值:所有可用内核)。 |
SPARK_WORKER_MEMORY |
允许Spark应用程序在计算机上使用的内存总量,例如1000m ,2g (默认值:总内存减去1 GB);请注意,每个应用程序的单独内存都是使用其spark.executor.memory 属性配置的。 |
SPARK_WORKER_PORT |
在特定端口上启动Spark worker(默认:随机)。 |
SPARK_WORKER_WEBUI_PORT |
辅助Web UI的端口(默认值:8081)。 |
SPARK_WORKER_DIR |
要在其中运行应用程序的目录,其中将包括日志和暂存空间(默认值:SPARK_HOME / work)。 |
SPARK_WORKER_OPTS |
仅以“ -Dx = y”的形式应用于工作程序的配置属性(默认值:无)。请参阅下面的可能选项列表。 |
SPARK_DAEMON_MEMORY |
分配给Spark主守护程序和辅助守护程序本身的内存(默认值:1g)。 |
SPARK_DAEMON_JAVA_OPTS |
Spark主服务器和辅助服务器守护程序的JVM选项本身以“ -Dx = y”的形式出现(默认值:无)。 |
SPARK_DAEMON_CLASSPATH |
Spark主守护程序和辅助守护程序本身的类路径(默认值:无)。 |
SPARK_PUBLIC_DNS |
Spark主服务器和辅助服务器的公共DNS名称(默认值:无)。 |
注意:启动脚本当前不支持Windows。要在Windows上运行Spark集群,请手动启动master和worker。
SPARK_MASTER_OPTS支持以下系统属性:
物业名称 | 默认 | 含义 |
---|---|---|
spark.deploy.retainedApplications |
200 | 显示的已完成申请的最大数量。较旧的应用程序将从UI中删除,以保持此限制。 |
spark.deploy.retainedDrivers |
200 | 要显示的已完成驱动程序的最大数量。较旧的驱动程序将从UI删除,以保持此限制。 |
spark.deploy.spreadOut |
真正 | 独立集群管理器是应将应用程序分布在各个节点上,还是应将它们合并到尽可能少的节点上。对于HDFS中的数据局部性而言,扩展通常更好,但对于计算密集型工作负载而言,整合更有效。 |
spark.deploy.defaultCores |
(无限的) | 如果未设置,则以Spark独立模式提供给应用程序的默认内核数spark.cores.max 。如果未设置,则应用程序始终会获得所有可用的内核,除非它们spark.cores.max 自行配置。在共享群集上将此值设置得较低,以防止用户默认情况下抓取整个群集。 |
spark.deploy.maxExecutorRetries |
10 | 对独立集群管理器删除有故障的应用程序之前可能发生的最大背对背执行器故障数的限制。如果应用程序具有正在运行的执行程序,则永远不会将其删除。如果应用程序spark.deploy.maxExecutorRetries 连续遇到多个 故障,没有执行程序成功地在这些故障之间开始运行,并且该应用程序没有运行的执行程序,则独立集群管理器将删除该应用程序并将其标记为失败。要禁用此自动删除功能,请设置spark.deploy.maxExecutorRetries 为 -1 。 |
spark.worker.timeout |
60 | 如果独立部署主服务器未接收到心跳信号,则该秒数之后该秒数将其视为丢失。 |
SPARK_WORKER_OPTS支持以下系统属性:
物业名称 | 默认 | 含义 |
---|---|---|
spark.worker.cleanup.enabled |
假 | 启用定期清除worker /应用程序目录。请注意,这仅影响独立模式,因为YARN的工作原理不同。仅清除已停止的应用程序的目录。 |
spark.worker.cleanup.interval |
1800(30分钟) | 控制工人清理本地计算机上旧的应用程序工作目录的时间间隔(以秒为单位)。 |
spark.worker.cleanup.appDataTtl |
604800(7天,7 * 24 * 3600) | 在每个工作程序上保留应用程序工作目录的秒数。这是生存时间,应取决于您拥有的可用磁盘空间量。应用程序日志和jars被下载到每个应用程序工作目录。随着时间的推移,工作目录会迅速填满磁盘空间,尤其是如果您非常频繁地运行作业时。 |
spark.storage.cleanupFilesAfterExecutorExit |
真正 | 在执行程序退出后,启用工作目录的清理非混洗文件(例如临时混洗块,缓存的RDD /广播块,溢出文件等)。请注意,这与`spark.worker.cleanup.enabled`不重叠,因为这可以清除死掉执行者的本地目录中的非随机文件,而`spark.worker.cleanup.enabled`则可以清除所有文件。 /停止和超时应用程序的子目录。这仅影响独立模式,将来可以添加对其他集群管理员的支持。 |
spark.worker.ui.compressedLogFileLengthCacheSize |
100 | 对于压缩日志文件,只能通过解压缩文件来计算未压缩文件。Spark缓存压缩日志文件的未压缩文件大小。此属性控制缓存大小。 |