无名氏0428

Spark部署方式---Standalone

官方地址：http://spark.apache.org/docs/latest/spark-standalone.html

1、搭建Standalone模式集群

2、手动启动集群

2-1) 在master节点上启动Spark Master服务，./sbin/start-master.sh

Master服务成功启动后，会打印出park://HOST:PORT样式的URL，读者可以将workers通过此URL连接到注册到Master之上。同时，这也是在编写程序要设置的“master”参数的值(val conf = new SparkConf().setMaster("URL"))。当然读者也可以通过WebUI来查看此URL。

如果Master部署在linux中，则可以在Master节点执行此命令：curl localhost:8080 | grep URL，结果如下所示：

红色标注的部分即为URL，或者通过浏览器直接访问 http://localhost[Master hostname or ip]:8080进行查看

2-2) 在worker节点上启动Spark Worker服务，./sbin/start-slave.sh

Worker服务成功启动后，可以通过master的WebUI进行查看。

2-3）启动时的一些参数

master:

-i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: 7077)
--webui-port PORT Port for web UI (default: 8080)
--properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf.

worker:

-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf.

3、集群启动脚本

为了能够通过启动脚本启动Spark集群，读者需要在Spark conf路径下创建一个slaves文件，在slaves文件中，以每行一条记录的形式，将wokers节点的hostname写入。如果没有创建slaves文件，那么通过启动脚本将启动一个单机模式的集群，一般此种模式常常用来做测试。主节点通linux ssh 命令向其他节点发布命令，通常情况下spark中将会以并行方式运行ssh命令，并且需要读者将主节点和从节点之间配置成免登陆模式（使用公钥-私钥）。但是假设读者并没有配置免登陆，这时候便需要设置SPARK_SSH_FOREGROUND环境变量，并且顺序提供每一个节点的密码（和slaves对应），成功设置如上文件之后，读者便可以通过脚本来启动或者停止集群。常用的脚本如下所示：

sbin/start-master.sh Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-slave.sh Starts a slave instance on the machine the script is executed on.
sbin/start-all.sh Starts both a master and a number of slaves as described above.
sbin/stop-master.sh Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh Stops both the master and the slaves as described above.

注意一点，这些脚本必须在master节点上必须执行在集群的节点之上，而不是你本地的机器。

读者可以更深层次地对集群变量进行设置，比如上述的环境变量，便可以在conf/spark-env.sh中进行设置，如果在spark conf目录下没有此文件，那么可以通过将spark-env.sh.template重命名，配置完成之后需要将该文件发放给所有节点，在该文件内一些常用的配置如下：

Environment Variable	Meaning
`SPARK_MASTER_HOST`	Bind the master to a specific hostname or IP address, for example a public one.
`SPARK_MASTER_PORT`	Start the master on a different port (default: 7077).
`SPARK_MASTER_WEBUI_PORT`	Port for the master web UI (default: 8080).
`SPARK_MASTER_OPTS`	Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.
`SPARK_LOCAL_DIRS`	Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
`SPARK_WORKER_CORES`	Total number of cores to allow Spark applications to use on the machine (default: all available cores).
`SPARK_WORKER_MEMORY`	Total amount of memory to allow Spark applications to use on the machine, e.g. `1000m`, `2g` (default: total memory minus 1 GB); note that each application's individual memory is configured using its `spark.executor.memory`property.
`SPARK_WORKER_PORT`	Start the Spark worker on a specific port (default: random).
`SPARK_WORKER_WEBUI_PORT`	Port for the worker web UI (default: 8081).
`SPARK_WORKER_DIR`	Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).
`SPARK_WORKER_OPTS`	Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.
`SPARK_DAEMON_MEMORY`	Memory to allocate to the Spark master and worker daemons themselves (default: 1g).
`SPARK_DAEMON_JAVA_OPTS`	JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).
`SPARK_PUBLIC_DNS`	The public DNS name of the Spark master and workers (default: none).

其中 SPARK_MASTER_OPTS 中可以设置如下内容：

SPARK_WORKER_OPTS 中可以设置如下内容：

Property Name	Default	Meaning
`spark.worker.cleanup.enabled`	false	Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
`spark.worker.cleanup.interval`	1800 (30 minutes)	Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
`spark.worker.cleanup.appDataTtl`	7 * 24 * 3600 (7 days)	The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

配置上述参数之后，可以对SPARK_WORKER_DIR下的jar文件进行清除，但是log日志是保留的（可以参看此文章http://www.tuicool.com/articles/RV3MFz）

4、使用spark-shell

Spark中提供了spark-shell命令，用来为用户提供交互式的窗口，命令如下所示：

./bin/spark-shell --master spark://IP[HOSTNAME]:PORT

读者也可以对此（Driver）设置一些参数：

--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster")
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages

to avoid dependency conflicts.
--repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver.

Note that jars added with --jars are automatically included in the classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab.

针对Spark分布式部署模式（Standalone,mesos,yarn），不同的部署模式具有不同的参数设置方式：

Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)

YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to
the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.

针对YARN-ONLY中的--num-executors在Standalone中实现方式，可以通过下述命令间接实现：

--total-executor-cores m

--executor-cores n ，这样num-executors的值便为 m / n 向上取整

代码中可以通过

new SparkConf().set("spark.cores.max","m") 进行设置，下面给出部分参数的对应关系：

driverExtraJavaOptions "spark.driver.extraJavaOptions"
driverExtraLibraryPath "spark.driver.extraLibraryPath"
driverMemory "spark.driver.memory" env.get("SPARK_DRIVER_MEMORY")
driverCores "spark.driver.cores"
executorMemory "spark.executor.memory" env.get("SPARK_EXECUTOR_MEMORY")
executorCores "spark.executor.cores" env.get("SPARK_EXECUTOR_CORES")
totalExecutorCores "spark.cores.max"
name "spark.app.name"
jars "spark.jars"
ivyRepoPath "spark.jars.ivy"
packages "spark.jars.packages"
packagesExclusions "spark.jars.excludes"

5、资源调度

对于Standalone模式资源调度目前只支持简单FIFO模式的程序调度，如果Spark集群环境中有多个用户在使用，可以通过上述参数进行限制，比如最大分配的核数。官网给出如下的简单示例：

val conf = new SparkConf()
             .setMaster(...)
             .setAppName(...)
             .set("spark.cores.max", "10")
val sc = new SparkContext(conf)

并提出spark.core.max可以通过在spark-env.sh中设置 SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores="来替代程序中的spark.cores.max

6、监控和日志

6-1）监控：可以通过MasterUI（8080端口）查看项目的情况，当然这里端口读者可以人为设置，并且默认情况下，如果8080被其他程序占用，Spark会在端口上做+1运算，8081 ，8082 直至找到一个未被占用的端口。

6-2）日志：日志也可以通过WebUI进行查看，也可以直接访问相应的worker节点进入到SPARK_WORKER_DIR目录下查找对应的工程文件夹下的stderr,stdout文件

你可能感兴趣的:(SparkCommon)

戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理