一、Hadoop3.x介绍
Hadoop是用Java语言实现的,允许用户使用简单的编程模型实现跨机器集群对海量数据进行分布式计算处理的框架。Hadoop3.x主要包含如下三个组件:
- Hadoop HDFS,分布式文件存储系统,用来解决海量数据的存储,确保存储具有可扩展性、高效及可靠性。
- Hadoop YARN,集群资源管理和任务调度框架,用来解决资源任务调度的问题。
- Hadoop MapReduce,分布式计算框架,用来解决海量数据的高效计算问题。
Hadoop是整个大数据生态的底层支撑技术,MapReduce作为第一代离线计算框架,因为其效率的问题,现在几乎没有公司使用了,但是其思想和编程模型是学习大数据不能绕开的,HDFS和YARN,则是大数据生态的核心技术,无论计算框架如何更替,这两者始终是大数据生态的基石。
二、Hadoop3.x集群安装
Hadoop集群的安装其实就是HDFS集群和YARN集群的安装,这两个集群在逻辑上分离,但在物理上通常是在一起的。
HDFS集群中的角色有:
- NameNode,主要负责元数据的存放;
- SecondNameNode,协助NameNode做一些持久化的工作;
- DataNode,主要用来存放数据;
YARN集群中的角色有:
- ResoureManager,资源管理器,负责统一调度资源;
- NodeManager,负责具体每个节点上任务的资源启停和报告;
关于集群中每个角色的作用和详细工作原理,此处做简单了解即可,详细内容后续有文章会专门介绍。所以Hadoop集群就等于HDFS集群+YARN集群,而MapReduce只是作为一个计算框架,其程序本质上就是Java程序,会被分发到集群中的节点上去参与计算而已。
我们专门在阿里云上购买了三台服务器,都是1C2G Linux CentOS7.5 X64的系统,分别命名为node1(172.24.38.209),node2(172.24.38.210),node3(172.24.38.211),确保三台服务器相互都能联通,然后我们对于这三台服务器组件集群的规划如下:
第一步,我们需要在每个节点上安装JDK运行环境,假设我们下载的JDK放在/root/soft目录下
。
#进入JDK所在目录
cd /root/soft
#解压
tar zxvf jdk-8u65-linux-x64.tar.gz
#配置环境变量
vim /etc/profile
export JAVA_HOME=/root/soft/jdk1.8.0_241
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
source /etc/profile
#验证
[root@iZuf6gmsvearrd5uc3emkyZ soft]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
第二步,我们需要在每个节点上确保时间同步和关闭防火墙。
#时间同步
ntpdate ntp5.aliyun.com
# 防火墙关闭
#查看防火墙状态
firewall-cmd --state
#停止firewalld服务
systemctl stop firewalld.service
#开机禁用firewalld服务
systemctl disable firewalld.service
第三步,我们需要开通node1到node1、node2、node3的ssh免密连接服务。
# 生成公钥密钥,全部回车下一步即可
ssh-keygen -t rsa
ssh-keygen -t dsa
#开通node1到node1的免密ssh
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh-copy-id 172.24.38.209
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
[email protected]'s password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh '172.24.38.209'"
and check to make sure that only the key(s) you wanted were added.
#开通node1到node2的免密ssh
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh-copy-id 172.24.38.210
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host '172.24.38.210 (172.24.38.210)' can't be established.
ECDSA key fingerprint is SHA256:ah4dSYvdlmiJv/Q8aJ5Vdm/PtYGCLE61/hl8waEeeSg.
ECDSA key fingerprint is MD5:4b:53:93:61:2b:a7:6d:79:67:c4:54:ca:24:11:86:26.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
[email protected]'s password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh '172.24.38.210'"
and check to make sure that only the key(s) you wanted were added.
#开通node1到node3的免密ssh
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh-copy-id 172.24.38.211
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host '172.24.38.211 (172.24.38.211)' can't be established.
ECDSA key fingerprint is SHA256:F1oP0hFY+V3VHUL5rOSLEeTCv3m+y92u5RCW6RpBNDI.
ECDSA key fingerprint is MD5:ed:c7:a5:0b:f2:25:71:7b:fc:a8:e1:ce:fd:eb:19:7b.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
[email protected]'s password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh '172.24.38.211'"
and check to make sure that only the key(s) you wanted were added.
#node1免密ssh到node1实验
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh 172.24.38.209
Last login: Wed Nov 2 15:31:22 2022 from 172.24.38.209
Welcome to Alibaba Cloud Elastic Compute Service !
[root@iZuf6gmsvearrd5uc3emkyZ ~]# exit
logout
Connection to 172.24.38.209 closed.
#node1免密ssh到node2实验
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh 172.24.38.210
Last login: Wed Nov 2 15:13:46 2022 from 122.193.199.200
Welcome to Alibaba Cloud Elastic Compute Service !
[root@iZuf6gmsvearrd5uc3emkzZ ~]# exit
logout
Connection to 172.24.38.210 closed.
#node1免密ssh到node3实验
[root@iZuf6gmsvearrd5uc3emkyZ soft]# ssh 172.24.38.211
Last login: Wed Nov 2 14:50:32 2022 from 122.193.199.232
Welcome to Alibaba Cloud Elastic Compute Service !
[root@iZuf6gmsvearrd5uc3eml0Z ~]# exit
logout
Connection to 172.24.38.211 closed.
第四步,我们需要在每个节点上安装Hadoop,假设我们下载的Hadoop放在/root/soft目录下
,可以先在一台服务器上完成修改后,再将Hadoop文件夹scp到其它两台服务器。
#进入目录
cd /root/soft
#解压
tar zxvf hadoop-3.3.4.tar.gz
#修改hadoop配置文件hadoop-env.sh
cd /root/soft/hadoop-3.3.4/etc/hadoop
vim hadoop-env.sh
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
#改成你自己的JAVA_HOME地址
export JAVA_HOME=/root/soft/jdk1.8.0_241
#设置HADOOP环境变量
vim /etc/profile
export HADOOP_HOME=/root/soft/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
然后再修改core-site.xml
配置文件:
fs.defaultFS
hdfs://172.24.38.209:8020
hadoop.tmp.dir
/root/data/hadoop
hadoop.http.staticuser.user
root
hadoop.proxyuser.root.hosts
*
hadoop.proxyuser.root.groups
*
fs.trash.interval
1440
然后再修改hdfs-site.xml
配置文件:
dfs.namenode.secondary.http-address
172.24.38.210:9868
然后再修改mapred-site.xml
配置文件:
mapreduce.framework.name
yarn
mapreduce.jobhistory.address
172.24.38.209:10020
mapreduce.jobhistory.webapp.address
172.24.38.209:19888
yarn.app.mapreduce.am.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}
mapreduce.map.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}
mapreduce.reduce.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}
然后再修改yarn-site.xml
配置文件:
yarn.resourcemanager.hostname
172.24.38.209
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.pmem-check-enabled
false
yarn.nodemanager.vmem-check-enabled
false
yarn.log-aggregation-enable
true
yarn.log.server.url
http://172.24.38.209:19888/jobhistory/logs
yarn.log-aggregation.retain-seconds
604800
然后再修改workers
配置文件:
172.24.38.209
172.24.38.210
172.24.38.211
然后将修改好的hadoop-3.3.4文件夹远程拷贝到其它两台服务器:
scp -r hadoop-3.3.4 [email protected]:/root/soft
scp -r hadoop-3.3.4 [email protected]:/root/soft
最后,我们需要在每一台服务器上配置Hadoop的环境变量:
vim /etc/profile
export HADOOP_HOME=/root/soft/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
第五步,我们首次启动时,需要在node1上进行namenode的格式化:
hdfs namenode -format
#格式化成功的部分日志
2022-11-02 16:40:49,064 INFO common.Storage: Storage directory /root/data/hadoop/dfs/name has been successfully formatted.
第六步,启动我们的hadoop hdfs集群:
#启动HDFS集群,停止命令为stop-dfs.sh
[root@iZuf6gmsvearrd5uc3emkyZ hadoop]# start-dfs.sh
Starting namenodes on [iZuf6gmsvearrd5uc3emkyZ]
Last login: Wed Nov 2 16:51:29 CST 2022 on pts/1
Starting datanodes
Last login: Wed Nov 2 16:53:30 CST 2022 on pts/1
172.24.38.210: WARNING: /root/soft/hadoop-3.3.4/logs does not exist. Creating.
172.24.38.211: WARNING: /root/soft/hadoop-3.3.4/logs does not exist. Creating.
Starting secondary namenodes [172.24.38.210]
Last login: Wed Nov 2 16:53:33 CST 2022 on pts/1
#在node1上查看java进程,检验计划的hdfs角色是否都启动了
[root@iZuf6gmsvearrd5uc3emkyZ hadoop]# jps
8055 DataNode
8333 Jps
7919 NameNode
#在node2上查看java进程,检验计划的hdfs角色是否都启动了
[root@iZuf6gmsvearrd5uc3emkzZ hadoop]# jps
1793 Jps
1738 SecondaryNameNode
1643 DataNode
#在node3上查看java进程,检验计划的hdfs角色是否都启动了
[root@iZuf6gmsvearrd5uc3eml0Z hadoop]# jps
1605 DataNode
1671 Jps
结果显示,和我们计划的集群角色一致,hdfs集群启动成功。
第七步,启动我们的hadoop yarn集群:
#启动YARN集群,停止命令为stop-yarn.sh
[root@iZuf6gmsvearrd5uc3emkyZ hadoop]# start-yarn.sh
Starting resourcemanager
Last login: Wed Nov 2 16:53:43 CST 2022 on pts/1
Starting nodemanagers
Last login: Wed Nov 2 17:02:08 CST 2022 on pts/1
#在node1上查看java进程,检验计划的hdfs和yarn角色是否都启动了
[root@iZuf6gmsvearrd5uc3emkyZ hadoop]# jps
8947 Jps
8487 ResourceManager
8055 DataNode
8617 NodeManager
7919 NameNode
#在node2上查看java进程,检验计划的hdfs和yarn角色是否都启动了
[root@iZuf6gmsvearrd5uc3emkzZ hadoop]# jps
1875 NodeManager
1973 Jps
1738 SecondaryNameNode
1643 DataNode
#在node3上查看java进程,检验计划的hdfs和yarn角色是否都启动了
[root@iZuf6gmsvearrd5uc3eml0Z hadoop]# jps
1605 DataNode
1835 Jps
1741 NodeManager
结果显示,和我们计划的集群角色一致,hdfs集群和yarn集群都启动成功。如果嫌弃两个集群需要一个个启动麻烦,也可以使用一键启动命令:
#一键启动HDFS和YARN集群
start-all.sh
#一键停止HDFS和YARN集群
stop-all.sh
hdfs和yarn都为用户提供了WebUI界面,因为默认情况下,我们在阿里云创建的服务器是不对外开放不需要的接口的,所以想要能在公网访问WebUI界面,需要在阿里云的控制台安全组里面开放服务器的hdfs和yarn端口:
然后访问node1(NameNode和ResourceManger所在的服务器)的公网域名加端口就能使用HDFS和YARN了。
最后,我们使用官方自带的MapReduce案例来计算一下圆周率的值:
# 进入案例所在目录
[root@iZuf6gmsvearrd5uc3emkyZ ~]# cd /root/soft/hadoop-3.3.4/share/hadoop/mapreduce/
[root@iZuf6gmsvearrd5uc3emkyZ mapreduce]# ls
hadoop-mapreduce-client-app-3.3.4.jar hadoop-mapreduce-client-hs-plugins-3.3.4.jar hadoop-mapreduce-client-shuffle-3.3.4.jar lib-examples
hadoop-mapreduce-client-common-3.3.4.jar hadoop-mapreduce-client-jobclient-3.3.4.jar hadoop-mapreduce-client-uploader-3.3.4.jar sources
hadoop-mapreduce-client-core-3.3.4.jar hadoop-mapreduce-client-jobclient-3.3.4-tests.jar hadoop-mapreduce-examples-3.3.4.jar
hadoop-mapreduce-client-hs-3.3.4.jar hadoop-mapreduce-client-nativetask-3.3.4.jar jdiff
# 执行计算任务
[root@iZuf6gmsvearrd5uc3emkyZ mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.4.jar pi 2 4
Number of Maps = 2
Samples per Map = 4
Wrote input for Map #0
Wrote input for Map #1
Starting Job
# 由Yarn的ResourceManager来分配资源
2022-11-02 17:31:41,735 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /172.24.38.209:8032
2022-11-02 17:31:42,738 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1667379747044_0002
2022-11-02 17:31:43,046 INFO input.FileInputFormat: Total input files to process : 2
2022-11-02 17:31:43,953 INFO mapreduce.JobSubmitter: number of splits:2
2022-11-02 17:31:44,735 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1667379747044_0002
2022-11-02 17:31:44,736 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-11-02 17:31:45,110 INFO conf.Configuration: resource-types.xml not found
2022-11-02 17:31:45,110 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-11-02 17:31:45,264 INFO impl.YarnClientImpl: Submitted application application_1667379747044_0002
2022-11-02 17:31:45,349 INFO mapreduce.Job: The url to track the job: http://iZuf6gmsvearrd5uc3emkyZ:8088/proxy/application_1667379747044_0002/
2022-11-02 17:31:45,350 INFO mapreduce.Job: Running job: job_1667379747044_0002
2022-11-02 17:31:59,929 INFO mapreduce.Job: Job job_1667379747044_0002 running in uber mode : false
2022-11-02 17:31:59,931 INFO mapreduce.Job: map 0% reduce 0%
2022-11-02 17:32:14,551 INFO mapreduce.Job: map 50% reduce 0%
2022-11-02 17:32:15,568 INFO mapreduce.Job: map 100% reduce 0%
2022-11-02 17:32:23,754 INFO mapreduce.Job: map 100% reduce 100%
2022-11-02 17:32:25,819 INFO mapreduce.Job: Job job_1667379747044_0002 completed successfully
2022-11-02 17:32:26,168 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=50
FILE: Number of bytes written=830664
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=536
HDFS: Number of bytes written=215
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=24774
Total time spent by all reduces in occupied slots (ms)=6828
Total time spent by all map tasks (ms)=24774
Total time spent by all reduce tasks (ms)=6828
Total vcore-milliseconds taken by all map tasks=24774
Total vcore-milliseconds taken by all reduce tasks=6828
Total megabyte-milliseconds taken by all map tasks=25368576
Total megabyte-milliseconds taken by all reduce tasks=6991872
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=36
Map output materialized bytes=56
Input split bytes=300
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=56
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=781
CPU time spent (ms)=2500
Physical memory (bytes) snapshot=537804800
Virtual memory (bytes) snapshot=8220102656
Total committed heap usage (bytes)=295051264
Peak Map Physical memory (bytes)=213385216
Peak Map Virtual memory (bytes)=2737229824
Peak Reduce Physical memory (bytes)=113819648
Peak Reduce Virtual memory (bytes)=2745643008
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236
File Output Format Counters
Bytes Written=97
Job Finished in 44.671 seconds
# 计算结果
Estimated value of Pi is 3.50000000000000000000
如果任务执行时间较久,也可以手动停止:
# 查看运行中任务列表
[root@iZuf6gmsvearrd5uc3emkyZ ~]# yarn application -list -appStates RUNNING
2022-11-02 17:29:29,506 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /172.24.38.209:8032
Total number of applications (application-types: [], states: [RUNNING] and tags: []):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1667379747044_0001 QuasiMonteCarlo MAPREDUCE root default RUNNING UNDEFINED 27.5%http://iZuf6gmsvearrd5uc3emkyZ:36895
# 停止指定的任务
[root@iZuf6gmsvearrd5uc3emkyZ ~]# yarn application -kill application_1667379747044_0001
2022-11-02 17:29:58,805 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /172.24.38.209:8032
Killing application application_1667379747044_0001
2022-11-02 17:30:00,183 INFO impl.YarnClientImpl: Killed application application_1667379747044_0001
至此,一个完整的Hadoop集群搭建教程已经完成。
在企业的实际应用中,更多的是直接使用云厂商提供的Hadoop集群,在创建MapReduce集群的时候就可以选择对应的集群类型,无需用户自己搭建,非常便捷,但是收费也比较高。因此,对于个人学习还是需要自己搭建一下,既能方便后续的学习,也学习了Hadoop的搭建,提高了自己的动手能力。