yum -y install lrzsz
,程序会自动安装好。要下载,则sz
[找到你要下载的文件];要上传,则rz
浏览找到你本机要上传的文件。tar -zxvf jdk-8u231-linux-x64.tar.gz -C /usr/local
,将Java安装包解压到/usr/localyum -y install vim
,安装vim编辑器JAVA_HOME=/usr/local/jdk1.8.0_231
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$PATH
export JAVA_HOME PATH CLASSPATH
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz -C /usr/local
JAVA_HOME=/usr/local/jdk1.8.0_231
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH
export JAVA_HOME SPARK_HOME PATH CLASSPATH
run-example SparkPi 2 # 其中参数2是指两个并行度
[root@ied opt]# run-example SparkPi 2
22/02/20 04:24:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/02/20 04:24:34 INFO SparkContext: Running Spark version 2.4.4
22/02/20 04:24:34 INFO SparkContext: Submitted application: Spark Pi
22/02/20 04:24:34 INFO SecurityManager: Changing view acls to: root
22/02/20 04:24:34 INFO SecurityManager: Changing modify acls to: root
22/02/20 04:24:34 INFO SecurityManager: Changing view acls groups to:
22/02/20 04:24:34 INFO SecurityManager: Changing modify acls groups to:
22/02/20 04:24:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
22/02/20 04:24:35 INFO Utils: Successfully started service 'sparkDriver' on port 41942.
22/02/20 04:24:35 INFO SparkEnv: Registering MapOutputTracker
22/02/20 04:24:36 INFO SparkEnv: Registering BlockManagerMaster
22/02/20 04:24:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/02/20 04:24:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/02/20 04:24:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8de32b0e-530a-47ba-ad2d-efcfaa2af498
22/02/20 04:24:36 INFO MemoryStore: MemoryStore started with capacity 413.9 MB
22/02/20 04:24:36 INFO SparkEnv: Registering OutputCommitCoordinator
22/02/20 04:24:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/02/20 04:24:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ied:4040
22/02/20 04:24:36 INFO SparkContext: Added JAR file:///usr/local/spark-2.4.4-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.4.jar at spark://ied:41942/jars/spark-examples_2.11-2.4.4.jar with timestamp 1645302276946
22/02/20 04:24:36 INFO SparkContext: Added JAR file:///usr/local/spark-2.4.4-bin-hadoop2.7/examples/jars/scopt_2.11-3.7.0.jar at spark://ied:41942/jars/scopt_2.11-3.7.0.jar with timestamp 1645302276946
22/02/20 04:24:37 INFO Executor: Starting executor ID driver on host localhost
22/02/20 04:24:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33814.
22/02/20 04:24:37 INFO NettyBlockTransferService: Server created on ied:33814
22/02/20 04:24:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/02/20 04:24:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ied, 33814, None)
22/02/20 04:24:37 INFO BlockManagerMasterEndpoint: Registering block manager ied:33814 with 413.9 MB RAM, BlockManagerId(driver, ied, 33814, None)
22/02/20 04:24:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ied, 33814, None)
22/02/20 04:24:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ied, 33814, None)
22/02/20 04:24:39 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
22/02/20 04:24:39 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
22/02/20 04:24:39 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
22/02/20 04:24:39 INFO DAGScheduler: Parents of final stage: List()
22/02/20 04:24:39 INFO DAGScheduler: Missing parents: List()
22/02/20 04:24:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
22/02/20 04:24:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 413.9 MB)
22/02/20 04:24:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 413.9 MB)
22/02/20 04:24:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ied:33814 (size: 1256.0 B, free: 413.9 MB)
22/02/20 04:24:40 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
22/02/20 04:24:40 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
22/02/20 04:24:40 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
22/02/20 04:24:40 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
22/02/20 04:24:40 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/02/20 04:24:40 INFO Executor: Fetching spark://ied:41942/jars/scopt_2.11-3.7.0.jar with timestamp 1645302276946
22/02/20 04:24:41 INFO TransportClientFactory: Successfully created connection to ied/192.168.225.100:41942 after 185 ms (0 ms spent in bootstraps)
22/02/20 04:24:41 INFO Utils: Fetching spark://ied:41942/jars/scopt_2.11-3.7.0.jar to /tmp/spark-1426c39a-4d28-40e6-84da-d2d5f6071ddf/userFiles-3f7a473d-50b4-46ed-be1f-d77e07167e09/fetchFileTemp2787747616090799670.tmp
22/02/20 04:24:42 INFO Executor: Adding file:/tmp/spark-1426c39a-4d28-40e6-84da-d2d5f6071ddf/userFiles-3f7a473d-50b4-46ed-be1f-d77e07167e09/scopt_2.11-3.7.0.jar to class loader
22/02/20 04:24:42 INFO Executor: Fetching spark://ied:41942/jars/spark-examples_2.11-2.4.4.jar with timestamp 1645302276946
22/02/20 04:24:42 INFO Utils: Fetching spark://ied:41942/jars/spark-examples_2.11-2.4.4.jar to /tmp/spark-1426c39a-4d28-40e6-84da-d2d5f6071ddf/userFiles-3f7a473d-50b4-46ed-be1f-d77e07167e09/fetchFileTemp5384793568751348333.tmp
22/02/20 04:24:42 INFO Executor: Adding file:/tmp/spark-1426c39a-4d28-40e6-84da-d2d5f6071ddf/userFiles-3f7a473d-50b4-46ed-be1f-d77e07167e09/spark-examples_2.11-2.4.4.jar to class loader
22/02/20 04:24:42 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 910 bytes result sent to driver
22/02/20 04:24:42 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
22/02/20 04:24:42 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/02/20 04:24:42 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 867 bytes result sent to driver
22/02/20 04:24:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1654 ms on localhost (executor driver) (1/2)
22/02/20 04:24:42 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 139 ms on localhost (executor driver) (2/2)
22/02/20 04:24:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/02/20 04:24:42 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 2.597 s
22/02/20 04:24:42 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 2.956212 s
Pi is roughly 3.1441757208786045
22/02/20 04:24:42 INFO SparkUI: Stopped Spark web UI at http://ied:4040
22/02/20 04:24:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/02/20 04:24:42 INFO MemoryStore: MemoryStore cleared
22/02/20 04:24:42 INFO BlockManager: BlockManager stopped
22/02/20 04:24:42 INFO BlockManagerMaster: BlockManagerMaster stopped
22/02/20 04:24:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/02/20 04:24:42 INFO SparkContext: Successfully stopped SparkContext
22/02/20 04:24:42 INFO ShutdownHookManager: Shutdown hook called
22/02/20 04:24:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-1426c39a-4d28-40e6-84da-d2d5f6071ddf
22/02/20 04:24:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8fe131d-a733-466f-9665-4277ace75a06
看第61行:Pi is roughly 3.1441757208786045
>>> lines = sc.textFile('test.txt')
>>> sparkLines = lines.filter(lambda line: 'spark' in line)
>>> sparkLines.first()
‘hello hadoop hello spark’
惰性
计算这些RDD。它们只有第一次在一个行动操作中用到时,才会真正计算。这种策略刚开始看起来可能会显得有些奇怪,不过在大数据领域是很有道理的。比如,看看例2
和例3
,我们以一个文本文件定义了数据,然后把其中包含spark的行筛选出来。如果Spark 在我们运行lines = sc.textFile(…) 时就把文件中所有的行都读取并存储起来,就会消耗很多存储空间,而我们马上就要筛选掉其中的很多数据。相反, 一旦Spark 了解了完整的转化操作链之后,它就可以只计算求结果时真正需要的数据。事实上,在行动操作first() 中,Spark 只需要扫描文件直到找到第一个匹配的行为止,而不需要读取整个文件。ied
上已经安装了Spark单机版环境,并不需要Hadoop,但是Spark伪分布式环境就需要建立在Hadoop伪分布式环境基础之上。cd /usr/local/hadoop-2.7.1
与 ll
bin
目录:命令脚本etc/hadoop
目录:存放hadoop的配置文件lib
目录:hadoop运行的依赖jar包sbin
目录:存放启动和关闭hadoop等命令libexec
目录:存放的也是hadoop命令,但一般不常用export JAVA_HOME=/usr/local/jdk1.8.0_231
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://ied:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/usr/local/hadoop-2.7.1/tmpvalue>
property>
configuration>
<configuration>
<property>
<name>dfs.replicationname>
<value>1value>
property>
<property>
<name>dfs.permissionsname>
<value>falsevalue>
property>
configuration>
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
configuration>
<configuration>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>iedvalue>
property>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
configuration>
vim /etc/profile
JAVA_HOME=/usr/local/jdk1.8.0_231
HADOOP_HOME=/usr/local/hadoop-2.7.1
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$PATH
export JAVA_HOME HADOOP_HOME SPARK_HOME PATH CLASSPATH
hdfs namenode -format
,格式化名称节点,形成可用的分布式文件系统HDFS22/02/22 21:09:34 INFO common.Storage: Storage directory /usr/local/hadoop-2.7.1/tmp/dfs/name has been successfully formatted.
,表明名称节点格式化节点成功JAVA_HOME=/usr/local/jdk1.8.0_231
HADOOP_HOME=/usr/local/hadoop-2.7.1
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export JAVA_HOME HADOOP_HOME SPARK_HOME PATH CLASSPATH
start-master.sh
,启动spark老大 - Masterstart-slaves.sh
,启动spark小弟 - Workervim $SPARK_HOME/sbin/spark-config.sh
,添加JAVA_HOME环境变量source $SPARK_HOME/sbin/spark-config.sh
,让配置生效start-slaves.sh
,启动spark小弟 - Worker关闭与禁用虚拟机ied的防火墙
执行命令:systemctl stop firewalld.service
vim
编辑器vi /etc/resolv.conf
,修改/etc/resolv.conf
文件yum -y install vim
/etc/resolv.conf
文件添加了域名解析服务器,因此可以ping通域名了vi /etc/resolv.conf
,添加域名解析服务器yum - install vim
,安装vim
编辑器192.168.1.103 master
192.168.1.104 slave1
192.168.1.105 slave2
systemctl stop firewalld.service # 关闭防火墙
systemctl disable firewalld.service # 禁用防火墙
systemctl status firewalld.service
/etc/sysconfig/selinux
文件里SELINUX=enforcing
,将enforcing
改成disable
,就可以关闭SeLinux安全机制ssh-keygen
,生成密钥对ssh-copy-id root@master
,将公钥拷贝到masterssh-copy-id root@slave1
,将公钥拷贝到slave1ssh-copy-id root@slave2
,将公钥拷贝到slave2ssh-keygen
,生成密钥对ssh-copy-id root@master
,将公钥拷贝到masterssh-copy-id root@slave1
,将公钥拷贝到slave1ssh-copy-id root@slave2
,将公钥拷贝到slave2ssh-keygen
,生成密钥对ssh-copy-id root@master
,将公钥拷贝到masterssh-copy-id root@slave1
,将公钥拷贝到slave1ssh-copy-id root@slave2
,将公钥拷贝到slave2JAVA_HOME=/usr/local/jdk1.8.0_231
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$PATH
export JAVA_HOME PATH CLASSPATH
scp -r /usr/local/jdk1.8.0_231 root@slave1:/usr/local
scp -r /etc/profile root@slave1:/etc/profile
source /etc/profile
,让配置生效tar -zxvf apache-zookeeper-3.7.0-bin.tar.gz -C /usr/local
,将zookeeper安装解压到指定目录zookeeper-3.7.0
,执行命令:mv /usr/local/apache-zookeeper-3.7.0-bin /usr/local/zookeeper-3.7.0
ZkData
子目录vim zoo.cfg
,修改zoo.cfg文件,配置数据目录和服务器选举iddataDir=/usr/local/zookeeper-3.7.0/ZkData
# server's election id
server.1=192.168.1.103:2888:3888
server.2=192.168.1.104:2888:3888
server.3=192.168.1.105:2888:3888
说明:server后面的数字是选举id,在选举过程中会用到
注意:数字一定要能比较出大小。
2888:原子广播端口号,可以自定义
3888:选举端口号,可以自定义
JAVA_HOME=/usr/local/jdk1.8.0_231
ZK_HOME=/usr/local/zookeeper-3.7.0
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$ZK_HOME/bin:$PATH
export JAVA_HOME ZK_HOME PATH CLASSPATH
systemctl stop firewalld.service # 临时关闭防火墙
systemctl disable firewalld.service # 禁止开机启动防火墙
systemctl status firewalld # 查看防火墙状态
scp -r /usr/local/zookeeper-3.7.0 root@slave1:/usr/local
scp /etc/profile root@slave1:/etc/profile
source /etc/profile
,让配置生效zookeeper-3.7.0/ZkData
目录,修改myid的内容为2将虚拟机master上的zookeeper安装目录复制到虚拟机slave2相同目录,执行命令:scp -r /usr/local/zookeeper-3.7.0 root@slave2:/usr/local
将虚拟机master上的/etc/profile复制到虚拟机slave2相同位置,执行命令:scp /etc/profile root@slave2:/etc/profile
zkServer.sh start
zkServer.sh status
zkServer.sh status
zkServer.sh status
zkCli.sh
,启动zk客户端,创建一个节点/zk01
quit
命令,退出zk客户端vim /etc/profile
JAVA_HOME=/usr/local/jdk1.8.0_231
ZK_HOME=/usr/local/zookeeper-3.7.0
HADOOP_HOME=/usr/local//hadoop-2.7.1
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$ZK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export JAVA_HOME ZK_HOME HADOOP_HOME PATH CLASSPATH
export JAVA_HOME=/usr/local/jdk1.8.0_231
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://master:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/usr/local/hadoop-2.7.1/tmpvalue>
property>
configuration>
<configuration>
<property>
<name>dfs.namenode.name.dirname>
<value>/usr/local/hadoop-2.7.1/tmp/namenodevalue>
property>
<property>
<name>dfs.datanode.data.dirname>
<value>/usr/local/hadoop-2.7.1/tmp/disk1, /usr/local/hadoo
p-2.7.1/tmp/disk2value>
property>
configuration>
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
configuration>
<configuration>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>mastervalue>
property>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
configuration>
yarn.nodemanager.aux-services
项的默认值是“mapreduce.shuffle
”,但如果在hadoop-2.7 中继续使用这个值,NodeManager 会启动失败。scp /etc/profile root@slave1:/etc/profile
hdfs namenode -format
22/02/26 13:23:22 INFO common.Storage: Storage directory /usr/local/hadoop-2.7.1/tmp/namenode has been successfully formatted.
,表明名称节点格式化成功。一个名称节点(namenode),在master虚拟机上;三个数据节点(datanode)在三个虚拟机上。
辅助名称节点(secondarynamenode)的地址是0.0.0.0
,这是默认的,当然可以修改
<property>
<name>dfs.namenode.secondary.http-addressname>
<value>master:50090value>
property>
这样就是在master虚拟机(192.168.1.103)上启动辅助名称节点(secondarynamenode)
启动了YARN守护进程;一个资源管理器(resourcemanager)在master虚拟机上,三个节点管理器(nodemanager)在三个虚拟机上
http://master:50070
http://192.168.1.103:50070
C:\Windows\System32\drivers\etc\hosts
文件http://master:50070
stop-all.sh
(相当于同时执行了stop-dfs.sh
与stop-yarn.sh
)This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
,说明stop-all.sh
脚本已经被废弃掉了,让我们最好使用stop-dfs.sh
与stop-yarn.sh
。JAVA_HOME=/usr/local/jdk1.8.0_231
ZK_HOME=/usr/local/zookeeper-3.7.0
HADOOP_HOME=/usr/local/hadoop-2.7.1
SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$ZK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export JAVA_HOME ZK_HOME HADOOP_HOME SPARK_HOME PATH CLASSPATH
export JAVA_HOME=/usr/local/jdk1.8.0_231
export SPARK_MASTER_IP=192.168.1.103 # Spark主节点IP
export SPARK_MASTER_PORT=7077 # Spark主节点端口号
export SPARK_WORKER_MEMORY=512m # 每个节点给予执行者的全部内存
export SPARK_WORKER_CORES=1 # 设置每台机器所用的核数
export SPARK_EXECUTOR_MEMORY=512m # 设置每个执行者的内存
export SPARK_EXECUTOR_CORES=1 # 设置每个执行者的核数
export SPARK_WORKER_INSTANCES=1 # 设置每个节点的实例数(worker进程数)
spark.master spark://master:7077 # 设置主节点
spark.eventLog.enabled true # 开启任务日志功能
spark.eventLog.dir hdfs://master:8021/spark-logs # 设置任务日志位置
./start-all.sh
./start-history-server.sh hdfs://master:8020/spark-logs
,开启任务日志功能启动成功后,三个Worker都有Id
执行命令:./start-history-server.sh hdfs://master:8021/spark-logs
历史服务器启动不了,因此无法访问http://master:18080
来查看任务日志
spark-shell --master spark://master:7077
vim test.txt