Hadoop2.2.0 部署spark 1.0
2014年7月
目 录
介绍...1
1 集群网络环境介绍及快速部署...2
2 SSH无密码验证配置...6
2.1配置所有节点之间SSH无密码验证...6
3 JDK安装和Java环境变量配置...10
3.1 安装 JDK 1.7.10
3.2 Java环境变量配置...10
4 Hadoop集群配置...11
(1)配置Hadoop的配置文件...11
(2)复制配置好的各文件到所有数据节点上。...14
5 Hadoop集群启动...15
6 Hadoop测试...17
7 用YarnClient调用hadoop集群...18
8.配置spark 1.0集群...20
8.1 配置环境变量...20
6.3 将程序分发给每个节点...21
6.4 启动...21
6.5 执行测试程序...22
这是利用Vmware 10.0在一台服务器上搭建的分布式环境,操作系统CentOS 6.4 X64中配置Hadoop-2.2.0时的总结文档。 Hadoop配置建议所有配置文件中使用主机名进行配置,并且机器上应在防火墙中开启相应端口,并设置SSHD服务为开机启动,此外java环境变量可以在/etc/profile中配置。
为了方便使用,这里把需要的程序都打包,放到了云盘上,详见 http://yun.baidu.com/s/1eQeQ7DK
集群包含五个节点:1个namenode,4个datanode,节点之间局域网连接,可以相互ping通。
所有节点均是Centos 6.4 64位系统,防火墙均禁用,sshd服务均开启并设置为开机启动。
a) 首先在VMware中安装好一台Centos 6.4,创建hadoop用户。假设虚拟机的名字为NameNode
b) 关闭虚拟机,把NameNode文件夹,拷贝4份,并命名为DataNode1,..,DataNode4
c) 用VMware打开每个DateNode,设置其虚拟机的名字
d) 打开操作系统,当弹出对话框时,选择“Icopy it”
e) 打开每个虚拟机,查看ip地址
ifconfig
现将IP地址规划如下
192.168.1.150 |
namenode |
192.168.1.151 |
datanode1 |
192.168.1.152 |
datanode2 |
192.168.1.153 |
datanode3 |
192.168.1.154 |
datanode4 |
f) 每个虚拟机,永久关闭防火墙(非常重要,一定要确认),并关闭SELINUX
chkconfig iptables off (永久生效)
service iptables stop (临时有效)
vim /etc/selinux/config
[root@DataNode1 local]#chkconfig iptables off
[root@DataNode1 local]#service iptables stop
iptables: Flushing firewallrules: [ OK ]
iptables: Setting chains topolicy ACCEPT: filter [ OK ]
iptables: Unloadingmodules: [ OK ]
[root@DataNode1 local]#
g) 配置NameNode
第一步,检查机器名
#hostname
如发现不对,则修改,root用户登陆,修改命令如下
# vim /etc/sysconfig/network
|
依次对每个节点进行处理,修改完之后,重启系统#reboot
h) 修改/etc/hosts
root用户
vim /etc/sysconfig/network
(1)namenode节点上编辑/etc/hosts文件
将所有节点的名字和IP地址写入其中,写入如下内容,注意注释掉127.0.0.1行,保证内容如下:(对IP地址一定要确认,是否有重复或者错误)
192.168.1.150 namenode 192.168.1.151 datanode1 192.168.1.152 datanode2 192.168.1.153 datanode3 192.168.1.154 datanode4 # 127.0.0.1 centos63 localhost.localdomain localhost |
(2)将Namenode上的/etc/hosts文件复制到所有数据节点上,操作步骤如下:
root用户登录namenode;
执行命令:
scp /etc/hosts [email protected]:/etc/hosts
scp /etc/hosts [email protected]:/etc/hosts
scp /etc/hosts [email protected]:/etc/hosts
scp /etc/hosts [email protected]:/etc/hosts
i) 规划系统目录
安装目录和数据目录分开,且数据目录和hadoop的用户目录分开,如果需要重新格式化,则可以直接删除所有的数据目录,然后重建数据目录。
如果数据目录和安装目录或者用户目录放置在一起,则对数据目录操作时,存在误删除程序或者用户文件的风险。
完整路径 |
说明 |
/opt/hadoop |
hadoop的程序安装主目录 |
/home/hadoop/hd_space/tmp |
临时目录 |
/home/hadoop/hd_space/hdfs/name |
namenode上存储hdfs名字空间元数据 |
/home/hadoop/hd_space/hdfs/data |
datanode上数据块的物理存储位置 |
/home/hadoop/hd_space/mapred/local |
tasktracker上执行mapreduce程序时的本地目录 |
/home/hadoop/hd_space/mapred/system |
这个是hdfs中的目录,存储执行mr程序时的共享文件 |
开始建立目录:
在NameNode下,root用户
rm -rf /home/hd_space
mkdir -p /home/hadoop/hd_space/tmp
mkdir -p /home/hadoop/hd_space/dfs/name
mkdir -p /home/hadoop/hd_space/dfs/data
mkdir -p /home/hadoop/hd_space/mapred/local
mkdir -p /home/hadoop/hd_space/mapred/system
chown -R hadoop:hadoop /home/hadoop/hd_space/
修改目录/home/hadoop的拥有者(因为该目录用于安装hadoop,用户对其必须有rwx权限。)
chown -R hadoop:hadoop /home/hadoop
创建完毕基础目录后,下一步就是设置SSH无密码验证,以方便hadoop对集群进行管理。
Hadoop需要使用SSH协议,namenode将使用SSH协议启动namenode和datanode进程,datanode向namenode传递心跳信息可能也是使用SSH协议,这是我认为的,还没有做深入了解,datanode之间可能也需要使用SSH协议。假若是,则需要配置使得所有节点之间可以相互SSH无密码登陆验证。
(0)原理
节点A要实现无密码公钥认证连接到节点B上时,节点A是客户端,节点B是服务端,需要在客户端A上生成一个密钥对,包括一个公钥和一个私钥,而后将公钥复制到服务端B上。当客户端A通过ssh连接服务端B时,服务端B就会生成一个随机数并用客户端A的公钥对随机数进行加密,并发送给客户端A。客户端A收到加密数之后再用私钥进行解密,并将解密数回传给B,B确认解密数无误之后就允许A进行连接了。这就是一个公钥认证过程,其间不需要用户手工输入密码。重要过程是将客户端A公钥复制到B上。
因此如果要实现所有节点之间无密码公钥认证,则需要将所有节点的公钥都复制到所有节点上。
(1)所有机器上生成密码对
(a)所有节点用hadoop用户登陆,并执行以下命令,生成rsa密钥对:
ssh-keygen -t rsa
这将在/home/hd_space/.ssh/目录下生成一个私钥id_rsa和一个公钥id_rsa.pub。
# su hadoop
ssh-keygen -trsa
Generatingpublic/private rsa key pair.
Enter file in whichto save the key (/home/ hadoop /.ssh/id_rsa): 默认路径
Enter passphrase(empty for no passphrase): 回车,空密码
Enter samepassphrase again:
Your identificationhas been saved in /home/ hadoop /.ssh/id_rsa.
Your public key hasbeen saved in /home/ hadoop /.ssh/id_rsa.pub.
这将在/home/hd_space/.ssh/目录下生成一个私钥id_rsa和一个公钥id_rsa.pub。
(b)将所有datanode节点的公钥id_rsa.pub传送到namenode上:
DataNode1上执行命令:
scp id_rsa.pub hadoop@NameNode:/home/hadoop/.ssh/ id_rsa.pub.datanode1
......
DataNodeN上执行命令:
scp id_rsa.pub hadoop@NameNode:/home/hadoop/.ssh/ id_rsa.pub.datanoden
检查一下是否都已传输过来
各个数据节点的公钥已经传输过来。
(c)namenode节点上综合所有公钥(包括自身)并传送到所有节点上
[[email protected]]$ cat id_rsa.pub >> authorized_keys 这是namenode自己的公钥
[[email protected]]$ cat id_rsa.pub.datanode1 >> authorized_keys
[[email protected]]$ cat id_rsa.pub.datanode2 >> authorized_keys
[[email protected]]$ cat id_rsa.pub.datanode3 >> authorized_keys
[[email protected]]$ cat id_rsa.pub.datanode4 >> authorized_keys
chmod644 ~/.ssh/authorized_keys
使用SSH协议将namenode的公钥信息authorized_keys复制到所有DataNode的.ssh目录下。
scpauthorized_keys data节点ip地址:/home/hd_space/.ssh
scp~/.ssh/authorized_keys hadoop@DataNode1:/home/hadoop/.ssh/authorized_keys
scp~/.ssh/authorized_keys hadoop@DataNode2:/home/hadoop/.ssh/authorized_keys
scp~/.ssh/authorized_keys hadoop@DataNode3:/home/hadoop/.ssh/authorized_keys
scp~/.ssh/authorized_keys hadoop@DataNode4:/home/hadoop/.ssh/authorized_keys
从这里就可以看到,当配置好hosts之后,就可以直接以机器名来访问各个机器,不用再记忆各个机器的具体IP地址,当集群中机器很多且IP不连续时,就发挥出威力来了。
从上图可以看到,将authorized_keys分发给各个节点之后,可以直接ssh登录,不再需要密码。
这样配置过后,namenode可以无密码登录所有datanode,可以通过命令
“ssh DataNode1(2,3,4)”来验证。
配置完毕,在namenode上执行“ssh NameNode,所有数据节点”命令,因为ssh执行一次之后将不会再询问。在各个DataNode上也进行“ssh NameNode,所有数据节点”命令。
至此,所有的节点都能相互访问,下一步开始配置jdk
1.下载JDK。
选定linux环境版本,下载到的文件是:jdk-7u21-linux-x64.tar.gz
2.解压
mv jdk-7u21-linux-x64.tar.gz
tarxf jdk-7u21-linux-x64.tar.gz
root用户登陆,命令行中执行命令”vim /etc/profile”,并加入以下内容,配置环境变量(注意/etc/profile这个文件很重要,后面Hadoop的配置还会用到)。
#setjava environment
exportJAVA_HOME=/opt/jdk1.7.0_21
exportJRE_HOME=/opt/jdk1.7.0_21/jre
exportPATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
exportCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
保存并退出,执行以下命令使配置生效
chmod +x /etc/profile
source /etc/profile
配置完毕,在命令行中使用命令“java - version”可以判断是否成功。在hadoop用户下测试java –version,一样成功。
在namenode上执行:
Hadoop用户登录。
下载hadoop-2.2.0(已编译好的64位的hadoop 2.2,可以从网盘下载
http://pan.baidu.com/s/1sjz2ORN),将其解压到/opt目录下.
(a)配置/etc/profile
#set hadoop
export HADOOP_HOME=/opt/hadoop-2.2.0
exportHADOOP_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop
exportPATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
$ vim $HADOOP_CONF_DIR/hadoop-env.sh
exportJAVA_HOME=/opt/jdk1.7.0_21
vim $HADOOP_CONF_DIR/ yarn-env.sh
export JAVA_HOME=/opt/jdk1.7.0_21
vim $HADOOP_CONF_DIR/core-site.xml
(g)配置masters文件,把localhost修改为namenode的主机名
NameNode |
(h)配置slaves文件, 删除localhost,加入所有datanode的主机名
DataNode1 DataNode2 DataNode3 DataNode4 |
在NameNode,执行脚本命令
for target inDataNode1 DataNode2 DataNode3 DataNode4
do
scp -r/opt/hadoop-2.2.0/etc/hadoop $target:/opt/hadoop-2.2.0/etc
done
hadoop namenode -format
--------------------因为配置了环境变量,此处不需要输入hadoop命令的全路径/hadoop/bin/hadoop
执行后的结果中会提示“ dfs/namehas been successfully formatted”。否则格式化失败。
启动hadoop:
start-dfs.sh
start-yarn.sh
启动成功后,分别在namenode和datanode所在机器上使用jps 命令查看,会在namenode所在机器上看到namenode,secondaryNamenode,ResourceManager
[hadoop@NameNode hadoop]$ jps
9097 Jps
8662 SecondaryNameNode
8836 ResourceManager
8459 NameNode
[hadoop@NameNode hadoop]$
会在datanode1所在机器上看到datanode,tasktracker.否则启动失败,检查配置是否有问题。
[root@DataNode1 .ssh]# jps
4885 Jps
4623 DataNode
4736 NodeManager
[root@DataNode1 .ssh]#
datanode1所在机器上看到datanode,NodeManager.
查看集群状态:
hdfs dfsadmin –report
停止hadoop:
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
查看HDFS: http://192.168.1.150:50070/dfshealth.jsp
查看RM:
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs-mkdir /tmp
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs -ls /
14/07/08 15:31:22 WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
Found 1 items
drwx------ - hadoop supergroup 02014-07-08 15:29 /tmp
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs-copyFromLocal /opt/hadoop-2.2.0/test.txt hdfs://namenode:9000/tmp/test.txt
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs -ls/tmp
14/07/08 15:34:11 WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
Found 2 items
drwx------ - hadoop supergroup 02014-07-08 15:29 /tmp/hadoop-yarn
-rw-r--r-- 3 hadoop supergroup 20442014-07-08 15:34 /tmp/test.txt
执行命令
[hadoop@NameNode hadoop-2.2.0]$
hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount/tmp/test.txt /tmp-output
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs -ls/tmp-output
14/07/08 16:07:21 WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
Found 2 items
-rw-r--r-- 3 hadoop supergroup 02014-07-08 15:35 /tmp-output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 10452014-07-08 15:35 /tmp-output/part-r-00000
[hadoop@NameNode hadoop-2.2.0]$
查看执行结果
[hadoop@NameNode hadoop-2.2.0]$ hdfs dfs -cat /tmp-output/part-r-00000
BAD_ID=0 1
Bytes 2
CONNECTION=0 1
CPU 1
Combine 2
hdfs dfs -mkdir /jar
hdfs dfs -mkdir /jar/spark
hdfs dfs -copyFromLocal/opt/spark-1.0.0-bin-2.2.0/lib/spark-assembly-1.0.0-hadoop2.2.0.jar hdfs://namenode:9000/jar/spark/spark-assembly-1.0.0-hadoop2.2.0.jar
只需要把解压包copy到yarn集群中的任意一台。一个节点就够了,不需要在所有节点都部署,除非你需要多个Client节点调用spark作业。
在这里我们不需要搭建独立的Spark集群,利用Yarn Client调用Hadoop集群的计算资源。
mv 解压后的目录/conf/spark-env.sh.template 解压后的目录/conf/spark-env.sh
编辑spark-env.sh
export HADOOP_HOME=/opt/hadoop-2.2.0
exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_EXECUTOR_INSTANCES=4
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=2G
SPARK_YARN_APP_NAME="Spark 1.0.0"
这是我的配置,配置和之前的几个版本略有不同,但大差不差。
用Yarn Client调用一下MR中的经典例子:Spark版的word count
这里要特别注意,SparkContext有变动,之前版本wordcount例子中的的第一个参数要去掉。
为了方便,我把SPARK_HOME/lib/spark-assembly-1.0.0-hadoop2.2.0.jar 拷贝到了HDFS中进行调用。(直接调用本地磁盘也是可以的)
SPARK_JAR="hdfs://NameNode:9000/jar/spark/spark-assembly-1.0.0-hadoop2.2.0.jar"\
./bin/spark-class org.apache.spark.deploy.yarn.Client\
--jar./lib/spark-examples-1.0.0-hadoop2.2.0.jar \
--classorg.apache.spark.examples.JavaWordCount \
--arg hdfs://NameNode:9000/tmp/test.txt \
--num-executors 50 \
--executor-cores 1 \
--driver-memory 2048M \
--executor-memory 1000M \
--name "word count on spark"
运行结果在stdout中查看
速度还行吧,用4台节点/64个core计算5.1GB文件,用时221秒。
添加计算节点
vi /opt/spark-1.0.0-bin-2.2.0/conf/slaves
DataNode1
DataNode2
DataNode3
DataNode4
修改spark-env.sh
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
添加如下信息
export SCALA_HOME=/opt/scala-2.10.3
export JAVA_HOME=/opt/jdk1.7.0_55
export SPARK_MASTER_IP=192.168.1.150
export SPARK_WORKER_MEMORY=10G
#设置JVM的内存设置
# Set SPARK_MEM if it isn't already set sincewe also use it for this process
SPARK_MEM=${SPARK_MEM:-10g}
export SPARK_MEM
# Set JAVA_OPTS to be able to load nativelibraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS-Xms$SPARK_MEM -Xmx$SPARK_MEM"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
SPARK_WORKER_MEMORY 是Spark在每一个节点上可用内存的最大,增加这个数值可以在内存中缓存更多的数据,但是一定要记住给Slave的操作系统和其他服务预留足够的内存。
http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space
下面是从stackoverflow上参考的信息
Havea look at thestart up scripts aJava heap size is set there, it looks like you're not setting this beforerunning Spark worker.
# Set SPARK_MEM if it isn't already set since we also use it for this process
SPARK_MEM=${SPARK_MEM:-512m}
export SPARK_MEM
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"
for target in DataNode1 DataNode2 DataNode3DataNode4
do
scp-r /opt/spark-1.0.0-bin-2.2.0 $target:/opt
done
cd /opt/spark-1.0.0-bin-2.2.0/sbin
./start-all.sh
[hadoop@NameNode sbin]$ ./start-all.sh
startingorg.apache.spark.deploy.master.Master, logging to/opt/spark-1.0.0-bin-2.2.0/sbin/../logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-NameNode.out
DataNode2: startingorg.apache.spark.deploy.worker.Worker, logging to/opt/spark-1.0.0-bin-2.2.0/sbin/../logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-DataNode2.out
DataNode3: starting org.apache.spark.deploy.worker.Worker,logging to/opt/spark-1.0.0-bin-2.2.0/sbin/../logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-DataNode3.out
DataNode1: startingorg.apache.spark.deploy.worker.Worker, logging to/opt/spark-1.0.0-bin-2.2.0/sbin/../logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-DataNode1.out
DataNode4: startingorg.apache.spark.deploy.worker.Worker, logging to/opt/spark-1.0.0-bin-2.2.0/sbin/../logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-DataNode4.out
[hadoop@NameNode sbin]$
在浏览器查看
[[email protected]]$ bin/spark-shell--executor-memory 2g --driver-memory 1g --master spark://NameNode:7077
14/07/08 19:18:09INFO spark.SecurityManager: Changing view acls to: hadoop
14/07/08 19:18:09INFO spark.SecurityManager: SecurityManager: authentication disabled; ui aclsdisabled; users with view permissions: Set(hadoop)
14/07/08 19:18:09INFO spark.HttpServer: Starting HTTP Server
14/07/08 19:18:09INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/08 19:18:09INFO server.AbstractConnector: Started [email protected]:57198
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Scala version2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_21)
Type in expressionsto have them evaluated.
Type :help for moreinformation.
14/07/08 19:18:13INFO spark.SecurityManager: Changing view acls to: hadoop
14/07/08 19:18:13INFO spark.SecurityManager: SecurityManager: authentication disabled; ui aclsdisabled; users with view permissions: Set(hadoop)
14/07/08 19:18:13INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/08 19:18:13INFO Remoting: Starting remoting
14/07/08 19:18:14INFO Remoting: Remoting started; listening on addresses:[akka.tcp://spark@NameNode:51486]
14/07/08 19:18:14INFO Remoting: Remoting now listens on addresses:[akka.tcp://spark@NameNode:51486]
14/07/08 19:18:14INFO spark.SparkEnv: Registering MapOutputTracker
14/07/08 19:18:14INFO spark.SparkEnv: Registering BlockManagerMaster
14/07/08 19:18:14INFO storage.DiskBlockManager: Created local directory at/tmp/spark-local-20140708191814-fe19
14/07/08 19:18:14INFO storage.MemoryStore: MemoryStore started with capacity 5.8 GB.
14/07/08 19:18:14INFO network.ConnectionManager: Bound socket to port 47219 with id =ConnectionManagerId(NameNode,47219)
14/07/08 19:18:14INFO storage.BlockManagerMaster: Trying to register BlockManager
14/07/08 19:18:14INFO storage.BlockManagerInfo: Registering block manager NameNode:47219 with5.8 GB RAM
14/07/08 19:18:14INFO storage.BlockManagerMaster: Registered BlockManager
14/07/08 19:18:14INFO spark.HttpServer: Starting HTTP Server
14/07/08 19:18:14INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/08 19:18:14INFO server.AbstractConnector: Started [email protected]:35560
14/07/08 19:18:14INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.1.150:35560
14/07/08 19:18:14INFO spark.HttpFileServer: HTTP File server directory is/tmp/spark-201155bc-731d-4eea-b637-88982e32ee14
14/07/08 19:18:14INFO spark.HttpServer: Starting HTTP Server
14/07/08 19:18:14INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/08 19:18:14INFO server.AbstractConnector: Started [email protected]:53311
14/07/08 19:18:14INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/08 19:18:14INFO server.AbstractConnector: Started [email protected]:4040
14/07/08 19:18:14INFO ui.SparkUI: Started SparkUI at http://NameNode:4040
14/07/08 19:18:15 WARNutil.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable
14/07/08 19:18:15INFO client.AppClient$ClientActor: Connecting to masterspark://NameNode:7077...
14/07/08 19:18:15INFO repl.SparkILoop: Created spark context..
14/07/08 19:18:15INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with appID app-20140708191815-0001
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor added: app-20140708191815-0001/0 onworker-20140708190701-DataNode4-48388 (DataNode4:48388) with 16 cores
14/07/08 19:18:15INFO cluster.SparkDeploySchedulerBackend: Granted executor IDapp-20140708191815-0001/0 on hostPort DataNode4:48388 with 16 cores, 2.0 GB RAM
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor added: app-20140708191815-0001/1 onworker-20140708190659-DataNode3-44272 (DataNode3:44272) with 16 cores
14/07/08 19:18:15INFO cluster.SparkDeploySchedulerBackend: Granted executor IDapp-20140708191815-0001/1 on hostPort DataNode3:44272 with 16 cores, 2.0 GB RAM
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor added: app-20140708191815-0001/2 onworker-20140708190700-DataNode2-57378 (DataNode2:57378) with 16 cores
14/07/08 19:18:15INFO cluster.SparkDeploySchedulerBackend: Granted executor IDapp-20140708191815-0001/2 on hostPort DataNode2:57378 with 16 cores, 2.0 GB RAM
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor added: app-20140708191815-0001/3 onworker-20140708190700-DataNode1-55222 (DataNode1:55222) with 16 cores
14/07/08 19:18:15INFO cluster.SparkDeploySchedulerBackend: Granted executor IDapp-20140708191815-0001/3 on hostPort DataNode1:55222 with 16 cores, 2.0 GB RAM
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor updated: app-20140708191815-0001/3is now RUNNING
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor updated: app-20140708191815-0001/2is now RUNNING
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor updated: app-20140708191815-0001/0is now RUNNING
14/07/08 19:18:15INFO client.AppClient$ClientActor: Executor updated: app-20140708191815-0001/1is now RUNNING
Spark contextavailable as sc.
scala> 14/07/0819:18:18 INFO cluster.SparkDeploySchedulerBackend: Registered executor:Actor[akka.tcp://sparkExecutor@DataNode4:40761/user/Executor#807513222] with ID0
14/07/08 19:18:18INFO cluster.SparkDeploySchedulerBackend: Registered executor:Actor[akka.tcp://sparkExecutor@DataNode1:57590/user/Executor#-2071278347] withID 3
14/07/08 19:18:18INFO cluster.SparkDeploySchedulerBackend: Registered executor:Actor[akka.tcp://sparkExecutor@DataNode2:43335/user/Executor#-723681055] withID 2
14/07/08 19:18:18INFO cluster.SparkDeploySchedulerBackend: Registered executor:Actor[akka.tcp://sparkExecutor@DataNode3:43008/user/Executor#-1215215976] withID 1
14/07/08 19:18:18INFO storage.BlockManagerInfo: Registering block manager DataNode4:44391 with1177.6 MB RAM
14/07/08 19:18:18INFO storage.BlockManagerInfo: Registering block manager DataNode1:40306 with1177.6 MB RAM
14/07/08 19:18:18INFO storage.BlockManagerInfo: Registering block manager DataNode2:35755 with1177.6 MB RAM
14/07/08 19:18:18INFO storage.BlockManagerInfo: Registering block manager DataNode3:42366 with1177.6 MB RAM
scala> valrdd=sc.textFile("hdfs://NameNode:9000/tmp/test.txt")
14/07/08 19:18:39INFO storage.MemoryStore: ensureFreeSpace(141503) called with curMem=0,maxMem=6174041702
14/07/08 19:18:39INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimatedsize 138.2 KB, free 5.7 GB)
rdd:org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
scala> rdd.cache()
res0: rdd.type =MappedRDD[1] at textFile at
scala> valwordcount=rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_)
14/07/08 19:19:04INFO mapred.FileInputFormat: Total input paths to process : 1
wordcount:org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at
scala> wordcount.take(10)
14/07/08 19:19:11INFO spark.SparkContext: Starting job: take at
14/07/08 19:19:11INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at
14/07/08 19:19:11INFO scheduler.DAGScheduler: Got job 0 (take at
14/07/08 19:19:11INFO scheduler.DAGScheduler: Final stage: Stage 0(take at
14/07/08 19:19:11INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/07/08 19:19:11INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/07/08 19:19:11INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] atreduceByKey at
14/07/08 19:19:11INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[4]at reduceByKey at
14/07/08 19:19:11INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/07/08 19:19:11INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 2:DataNode2 (NODE_LOCAL)
14/07/08 19:19:11INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2079 bytes in 6 ms
14/07/08 19:19:11INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 1 on executor 1:DataNode3 (NODE_LOCAL)
14/07/08 19:19:11INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 2079 bytes in 1 ms
14/07/08 19:19:12INFO storage.BlockManagerInfo: Added rdd_1_1 in memory on DataNode3:42366(size: 3.2 KB, free: 1177.6 MB)
14/07/08 19:19:12INFO storage.BlockManagerInfo: Added rdd_1_0 in memory on DataNode2:35755 (size:3.1 KB, free: 1177.6 MB)
14/07/08 19:19:13INFO scheduler.TaskSetManager: Finished TID 0 in 1830 ms on DataNode2(progress: 1/2)
14/07/08 19:19:13INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 0)
14/07/08 19:19:13INFO scheduler.TaskSetManager: Finished TID 1 in 1821 ms on DataNode3(progress: 2/2)
14/07/08 19:19:13INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have allcompleted, from pool
14/07/08 19:19:13INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 1)
14/07/08 19:19:13INFO scheduler.DAGScheduler: Stage 1 (reduceByKey at
14/07/08 19:19:13INFO scheduler.DAGScheduler: looking for newly runnable stages
14/07/08 19:19:13INFO scheduler.DAGScheduler: running: Set()
14/07/08 19:19:13INFO scheduler.DAGScheduler: waiting: Set(Stage 0)
14/07/08 19:19:13INFO scheduler.DAGScheduler: failed: Set()
14/07/08 19:19:13INFO scheduler.DAGScheduler: Missing parents for Stage 0: List()
14/07/08 19:19:13INFO scheduler.DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[6] atreduceByKey at
14/07/08 19:19:13INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0(MapPartitionsRDD[6] at reduceByKey at
14/07/08 19:19:13INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/07/08 19:19:13INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 2 on executor 2:DataNode2 (PROCESS_LOCAL)
14/07/08 19:19:13INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1972 bytes in 1 ms
14/07/08 19:19:13INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations forshuffle 0 to spark@DataNode2:36057
14/07/08 19:19:13INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 146bytes
14/07/08 19:19:13INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
14/07/08 19:19:13INFO scheduler.TaskSetManager: Finished TID 2 in 404 ms on DataNode2 (progress:1/1)
14/07/08 19:19:13INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have allcompleted, from pool
14/07/08 19:19:13INFO scheduler.DAGScheduler: Stage 0 (take at
14/07/08 19:19:13INFO spark.SparkContext: Job finished: take at
res1: Array[(String,Int)] = Array((BAD_ID=0,1), (committed,1), (Written=196192,1), (tasks=1,3),(Framework,1), (outputs=1,1), (groups=18040,1), (map,2), (Reduce,4), (ystem,1))
scala> valwordsort=wordcount.map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1))
14/07/08 19:19:23 INFOspark.SparkContext: Starting job: sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Got job 1 (sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Final stage: Stage 2(sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 3)
14/07/08 19:19:23INFO scheduler.DAGScheduler: Missing parents: List()
14/07/08 19:19:23INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[7] at map at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 2(MappedRDD[7] at map at
14/07/08 19:19:23INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
14/07/08 19:19:23INFO scheduler.TaskSetManager: Starting task 2.0:0 as TID 3 on executor 2:DataNode2 (PROCESS_LOCAL)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Serialized task 2.0:0 as 1970 bytes in 0 ms
14/07/08 19:19:23INFO scheduler.TaskSetManager: Starting task 2.0:1 as TID 4 on executor 1:DataNode3 (PROCESS_LOCAL)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Serialized task 2.0:1 as 1970 bytes in 0 ms
14/07/08 19:19:23INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations forshuffle 0 to spark@DataNode3:59586
14/07/08 19:19:23INFO scheduler.DAGScheduler: Completed ResultTask(2, 0)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Finished TID 3 in 117 ms on DataNode2 (progress:1/2)
14/07/08 19:19:23INFO scheduler.DAGScheduler: Completed ResultTask(2, 1)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Finished TID 4 in 168 ms on DataNode3 (progress:2/2)
14/07/08 19:19:23INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have allcompleted, from pool
14/07/08 19:19:23INFO scheduler.DAGScheduler: Stage 2 (sortByKey at
14/07/08 19:19:23INFO spark.SparkContext: Job finished: sortByKey at
14/07/08 19:19:23INFO spark.SparkContext: Starting job: sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Got job 2 (sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Final stage: Stage 4(sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 5)
14/07/08 19:19:23INFO scheduler.DAGScheduler: Missing parents: List()
14/07/08 19:19:23INFO scheduler.DAGScheduler: Submitting Stage 4 (MappedRDD[9] at sortByKey at
14/07/08 19:19:23INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 4(MappedRDD[9] at sortByKey at
14/07/08 19:19:23INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
14/07/08 19:19:23INFO scheduler.TaskSetManager: Starting task 4.0:0 as TID 5 on executor 2:DataNode2 (PROCESS_LOCAL)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 2454 bytes in 0 ms
14/07/08 19:19:23 INFOscheduler.TaskSetManager: Starting task 4.0:1 as TID 6 on executor 0: DataNode4(PROCESS_LOCAL)
14/07/08 19:19:23INFO scheduler.TaskSetManager: Serialized task 4.0:1 as 2454 bytes in 0 ms
14/07/08 19:19:24INFO scheduler.DAGScheduler: Completed ResultTask(4, 0)
14/07/08 19:19:24INFO scheduler.TaskSetManager: Finished TID 5 in 104 ms on DataNode2 (progress:1/2)
14/07/08 19:19:24INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations forshuffle 0 to spark@DataNode4:45983
14/07/08 19:19:24INFO scheduler.DAGScheduler: Completed ResultTask(4, 1)
14/07/08 19:19:24INFO scheduler.TaskSetManager: Finished TID 6 in 908 ms on DataNode4 (progress:2/2)
14/07/08 19:19:24INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have allcompleted, from pool
14/07/08 19:19:24INFO scheduler.DAGScheduler: Stage 4 (sortByKey at
14/07/08 19:19:24INFO spark.SparkContext: Job finished: sortByKey at
wordsort:org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at
scala> wordsort.take(10)
14/07/08 19:19:31INFO spark.SparkContext: Starting job: take at
14/07/08 19:19:31INFO scheduler.DAGScheduler: Registering RDD 7 (map at
14/07/08 19:19:31INFO scheduler.DAGScheduler: Got job 3 (take at
14/07/08 19:19:31INFO scheduler.DAGScheduler: Final stage: Stage 6(take at
14/07/08 19:19:31INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 7)
14/07/08 19:19:31INFO scheduler.DAGScheduler: Missing parents: List(Stage 7)
14/07/08 19:19:31INFO scheduler.DAGScheduler: Submitting Stage 7 (MappedRDD[7] at map at
14/07/08 19:19:31INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 7(MappedRDD[7] at map at
14/07/08 19:19:31INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
14/07/08 19:19:31INFO scheduler.TaskSetManager: Starting task 7.0:0 as TID 7 on executor 0:DataNode4 (PROCESS_LOCAL)
14/07/08 19:19:31INFO scheduler.TaskSetManager: Serialized task 7.0:0 as 2102 bytes in 1 ms
14/07/08 19:19:31INFO scheduler.TaskSetManager: Starting task 7.0:1 as TID 8 on executor 3:DataNode1 (PROCESS_LOCAL)
14/07/08 19:19:31INFO scheduler.TaskSetManager: Serialized task 7.0:1 as 2102 bytes in 0 ms
14/07/08 19:19:32INFO scheduler.TaskSetManager: Finished TID 7 in 93 ms on DataNode4 (progress:1/2)
14/07/08 19:19:32INFO scheduler.DAGScheduler: Completed ShuffleMapTask(7, 0)
14/07/08 19:19:32INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations forshuffle 0 to spark@DataNode1:46772
14/07/08 19:19:32INFO scheduler.TaskSetManager: Finished TID 8 in 820 ms on DataNode1 (progress:2/2)
14/07/08 19:19:32INFO scheduler.DAGScheduler: Completed ShuffleMapTask(7, 1)
14/07/08 19:19:32INFO scheduler.TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have allcompleted, from pool
14/07/08 19:19:32INFO scheduler.DAGScheduler: Stage 7 (map at
14/07/08 19:19:32INFO scheduler.DAGScheduler: looking for newly runnable stages
14/07/08 19:19:32INFO scheduler.DAGScheduler: running: Set()
14/07/08 19:19:32INFO scheduler.DAGScheduler: waiting: Set(Stage 6)
14/07/08 19:19:32INFO scheduler.DAGScheduler: failed: Set()
14/07/08 19:19:32INFO scheduler.DAGScheduler: Missing parents for Stage 6: List()
14/07/08 19:19:32INFO scheduler.DAGScheduler: Submitting Stage 6 (MappedRDD[12] at map at
14/07/08 19:19:32INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 6(MappedRDD[12] at map at
14/07/08 19:19:32INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 1 tasks
14/07/08 19:19:32INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 9 on executor 2:DataNode2 (PROCESS_LOCAL)
14/07/08 19:19:32INFO scheduler.TaskSetManager: Serialized task 6.0:0 as 2381 bytes in 0 ms
14/07/08 19:19:32INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations forshuffle 1 to spark@DataNode2:36057
14/07/08 19:19:32INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 149bytes
14/07/08 19:19:32INFO scheduler.DAGScheduler: Completed ResultTask(6, 0)
14/07/08 19:19:32INFO scheduler.TaskSetManager: Finished TID 9 in 119 ms on DataNode2 (progress:1/1)
14/07/08 19:19:32INFO scheduler.TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have allcompleted, from pool
14/07/08 19:19:32INFO scheduler.DAGScheduler: Stage 6 (take at
14/07/08 19:19:32INFO spark.SparkContext: Job finished: take at
res2:Array[(String, Int)] = Array(("",724), (Number,10), (of,10), (Map,5),(FILE:,5), (HDFS:,5), (output,5), (Reduce,4), (input,4), (time,4))
scala>
bin/spark-submit--master spark://NameNode:7077 --class org.apache.spark.examples.SparkPi--executor-memory 2g lib/spark-examples-1.0.0-hadoop2.2.0.jar 1000
部分执行结果
4/07/08 19:37:12 INFO scheduler.TaskSetManager:Finished TID 994 in 610 ms on DataNode3 (progress: 998/1000)
14/07/08 19:37:12 INFOscheduler.DAGScheduler: Completed ResultTask(0, 994)
14/07/08 19:37:12 INFOscheduler.TaskSetManager: Finished TID 997 in 620 ms on DataNode3 (progress:999/1000)
14/07/08 19:37:12 INFOscheduler.DAGScheduler: Completed ResultTask(0, 997)
14/07/08 19:37:12 INFOscheduler.TaskSetManager: Finished TID 993 in 625 ms on DataNode3 (progress:1000/1000)
14/07/08 19:37:12 INFOscheduler.DAGScheduler: Completed ResultTask(0, 993)
14/07/08 19:37:12 INFOscheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 25.020s
14/07/08 19:37:12 INFOscheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have allcompleted, from pool
14/07/08 19:37:12 INFO spark.SparkContext:Job finished: reduce at SparkPi.scala:35, took 25.502195433 s
Pi is roughly 3.14185688
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/metrics/json,null}
14/07/08 19:37:12 INFO handler.ContextHandler:stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/executors/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
14/07/08 19:37:12 INFO handler.ContextHandler:stopped o.e.j.s.ServletContextHandler{/environment/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/environment,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/storage/rdd,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/storage/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/stages/pool/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/stages/stage/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stoppedo.e.j.s.ServletContextHandler{/stages/json,null}
14/07/08 19:37:12 INFOhandler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
14/07/08 19:37:12 INFO ui.SparkUI: StoppedSpark web UI at http://NameNode:4040
14/07/08 19:37:12 INFOscheduler.DAGScheduler: Stopping DAGScheduler
14/07/08 19:37:12 INFOcluster.SparkDeploySchedulerBackend: Shutting down all executors
14/07/08 19:37:12 INFOcluster.SparkDeploySchedulerBackend: Asking each executor to shut down
14/07/08 19:37:13 INFOspark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
14/07/08 19:37:13 INFOnetwork.ConnectionManager: Selector thread was interrupted!
14/07/08 19:37:13 INFOnetwork.ConnectionManager: ConnectionManager stopped
14/07/08 19:37:13 INFO storage.MemoryStore:MemoryStore cleared
14/07/08 19:37:13 INFO storage.BlockManager:BlockManager stopped
14/07/08 19:37:13 INFOstorage.BlockManagerMasterActor: Stopping BlockManagerMaster
14/07/08 19:37:13 INFOstorage.BlockManagerMaster: BlockManagerMaster stopped
14/07/08 19:37:13 INFO spark.SparkContext:Successfully stopped SparkContext
14/07/08 19:37:13 INFOremote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
14/07/08 19:37:13 INFOremote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down;proceeding with flushing remote transports.