建立伪分布式(有条件的可以建立分布式环境)的Hadoop环境,并成功运行示例程序。
在其核心,Hadoop主要有两个层次,即:
除了上面提到的两个核心组件,Hadoop的框架还包括以下两个模块:
注:本实验主要涉及到:HDFS(分布式文件系统)、YARN(资源管理和调度框架)、以及MapReduce(离线计算)。
注:上述的master
、slave1
、slave2
均是主机名(结点名)。
保存作业的数据、配置信息等等,最后的结果也是保存在HDFS上面
MapReduce计划分三个阶段执行,即映射阶段,shuffle阶段,并减少阶段。
在一个MapReduce工作过程中:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount /input /output
hadoop fs -cat | ls|mkdir
hadoop dfs -put ./testdata.txt /testdata
hadoop fs -rmr /testdata
hadoop fs -rm /testdata.txt
安装比较简单,可参考如下链接: VMware Workstation Pro 16 安装教程
安装比较简单,可参考如下链接:VMware16.0上安装ubuntu20.04
安装Ubuntu后,需要换源,换为国内源,方面下载软件,环境比较简单,可参考如下链接:Ubuntu20.04换源
注意:安装Ubuntu时,建议断网安装。在安装时,如果联网,会自动更新软件,因为源为国外的,下载速度很慢,需花较多时间安装。
进入官网下载安装包(现在需要登录Oracle才能下载,没有的话先去注册)
Java SE Development Kit 8 Downloads
# 新建目录
sudo mkdir /usr/local/java
# 解压java安装包至 /usr/local/java
sudo tar jdk-8u351-linux-x64.tar.gz -C /usr/local/java
将java环境变量添加至Ubuntu全局环境变量中
dd@wdn:~/wdn_hdfs$ sudo vim /etc/profile
# 在profile文件最后追加以下内容
# java
export JAVA_HOME=/usr/local/java/jdk1.8.0_351
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
export JAVA_PATH=${JAVA_HOME}/bin:${JRE_HOME}/bin
export PATH=$PATH:${JAVA_PATH}
# 保存,退出
:wq
# source命令更新环境变量
dd@wdn:~/wdn_hdfs$ source /etc/profile
然后查看java版本,结果是1.8版本,说明安装成功
dd@wdn:~/wdn_hdfs$ java -version
java version "1.8.0_351"
Java(TM) SE Runtime Environment (build 1.8.0_351-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.351-b10, mixed mode)
如果要使用可选的启动和停止脚本,则必须安装ssh并运行sshd才能使用管理远程Hadoop守护进程的Hadoop脚本。此外,建议还安装pdsh以更好地管理ssh资源
dd@wdn:~/wdn_hdfs$ sudo apt install ssh
dd@wdn:~/wdn_hdfs$ sudo apt install pdsh
请从[Apache镜像](Apache Downloads)下载Hadoop软件,版本为 hadoop 3.3.4 。
解压Hadoop安装包文件
# 解压
dd@wdn:~/wdn_hdfs$ sudo tar -zxvf hadoop-3.3.4.tar.gz
# 查看解压文件
dd@wdn:~/wdn_hdfs$ cd hadoop-3.3.4/
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ ls
bin include libexec licenses-binary NOTICE-binary README.txt share
etc lib LICENSE-binary LICENSE.txt NOTICE.txt sbin
然后查看hadoop版本,检查是否安装成功
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop version
Hadoop 3.3.4
Source code repository https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb
Compiled by stevel on 2022-07-29T12:32Z
Compiled with protoc 3.7.1
From source with checksum fb9dd8918a7b8a5b430d61af858f6ec
This command was run using /home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/common/hadoop-common-3.3.4.jar
编辑etc/hadoop/hadoop-env.sh
文件,设置java环境变量
# vim 打开 etc/hadoop/hadoop-env.sh 文件
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ vim etc/hadoop/hadoop-env.sh
# 编辑 etc/hadoop/hadoop-env.sh 文件,增加java环境变量
export JAVA_HOME=/usr/local/java/jdk1.8.0_351
# 退出,保存
使用hadoop
命令验证是否配置成功,若显示命令使用文档,则配置成功
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class
....
SUBCOMMAND may print help when invoked w/o parameters or with -h.
Hadoop也可以在单个节点上以伪分布式模式运行,其中每个Hadoop守护进程在单独的Java进程中运行。
etc/hadoop/core-site.xml
文件配置在每个结点中都要配置本文件。该配置表示,HDFS的主节点访问地址为localhost:9000
。在伪分布式模式下,结点只有一个,主结点和从结点都为本节点,所以可以通过localhost
找到主结点。如果在分布式模式下,此处loaclhost
应当使用主结点的实际 IP 地址。
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://localhost:9000value>
property>
configuration>
etc/hadoop/hdfs-site.xml
文件配置该文件表示HDFS中每个文件有1个备份,如果在分布式模式下,最好设置为3
<configuration>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
该步骤的目的是为了在启动Hadoop的过程中,免去输入Linux登录密码。在分布式模式下,如果省略该步骤,就会要求主结点为每个从结点输入密码
现在检查是否可以通过ssh连接到本地主机,而无需密码:
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
无法在没有密码短语的情况下ssh到localhost
,请执行以下命令:
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ chmod 0600 ~/.ssh/authorized_keys
namenode
即为HDFS主结点,该命令为格式化,在后续实验中,如果出现未知错误无法解决,可以重新格式化。
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs namenode -format
2022-12-17 22:28:54,654 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = wdn/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.3.4
STARTUP_MSG: classpath = /home/dd/wdn_hdfs/hadoop-3.3.4/etc/hadoop:/home/dd/wdn_hdfs/hadoop-applicationhistoryservice-3.3.4.jar:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.3.4.jar
...
STARTUP_MSG: build = https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb; compiled by 'stevel' on 2022-07-29T12:32Z
STARTUP_MSG: java = 1.8.0_351
************************************************************/
2022-12-17 22:28:54,662 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
...
2022-12-17 22:28:55,544 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2022-12-17 22:28:55,550 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at wdn/127.0.1.1
************************************************************/
namenode
为主结点,datanode
为从结点,该命令启动所有。
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [wdn]
Hadoop守护程序日志输出被写入$HADOOP_LOG_DIR
目录(默认为$HADOOP_HOME/logs
)。
默认情况下,它位于:http://localhost:9870/
建立一个文件,名字可以自己起
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -mkdir /user
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -mkdir /user/wdn
其中:
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -mkdir -p input
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -put etc/hadoop/*.xml input
运行hadoop example当中的grep程序,该程序的具体作用可以自行查阅资料
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep input output 'dfs[a-z.]+'
2022-12-17 22:44:11,757 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
....
File System Counters
FILE: Number of bytes read=282303
FILE: Number of bytes written=918208
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=11765
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
HDFS: Number of bytes read erasure-coded=0
Map-Reduce Framework
Map input records=275
Map output records=1
Map output bytes=17
Map output materialized bytes=25
Input split bytes=118
Combine input records=1
Combine output records=1
Spilled Records=1
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60
Total committed heap usage (bytes)=388497408
File Input Format Counters
Bytes Read=11765
...
2022-12-17 22:44:15,567 INFO mapreduce.Job: Counters: 36
File System Counters
FILE: Number of bytes read=1143072
FILE: Number of bytes written=3661477
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=59544
HDFS: Number of bytes written=233
HDFS: Number of read operations=73
HDFS: Number of large read operations=0
HDFS: Number of write operations=14
HDFS: Number of bytes read erasure-coded=0
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=17
Map output materialized bytes=25
Input split bytes=128
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=25
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1460666368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=111
File Output Format Counters
Bytes Written=11
将输出文件从分布式文件系统复制到本地文件系统并检查它们
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -get output output
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ cat output/*
1 dfsadmin
1 dfs.replication
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -cat output/*
1 dfsadmin
1 dfs.replication
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ sbin/stop-dfs.sh
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [wdn]
通过设置一些参数并另外运行ResourceManager守护程序和NodeManager守护程序,可以在YARN上以伪分布式模式运行MapReduce作业。
以下指令假设已执行上述指令的1~4步骤
etc/hadoop/mapred-site.xml
文件配置
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
<property>
<name>mapreduce.application.classpathname>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*value>
property>
configuration>
etc/hadoop/yarn-site.xml
文件配置
<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.nodemanager.env-whitelistname>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOMEvalue>
property>
configuration>
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
默认情况下,它位于 http://localhost:8088
提交一个MapReduce作业到YARN上
hadoop的根目录下找到有一个share/hadoop/mapreduce
下,有一个hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
,这是hadoop提供的一个mapreduce测试jar文件。
输入命令:hadoop jar [jar文件地址] 别名 参数1 参数2
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar pi 2 3
Number of Maps = 2
Samples per Map = 3
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2022-12-18 14:52:29,787 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2022-12-18 14:52:30,510 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/dd/.staging/job_1671346336225_0001
2022-12-18 14:52:30,847 INFO input.FileInputFormat: Total input files to process : 2
...
进入yarn的管理界面,如下就可以看到运行状态的任务:
完成后,使用以下命令停止守护程序。
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ sbin/stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
需要三台虚拟机操作,其中一台为master主节点,两台为子节点;IP分配如下:
master
:192.168.118.129slave1
:192.168.118.130slave2
:192.168.118.131Note:hadoop根目录:/home/dd/wdn_hdfs/hadoop-3.3.4
etc/hadoop/hadoop-env.sh
文件配置
export JAVA_HOME=/usr/local/java/jdk1.8.0_351
# dd是我自己的用户名
export HDFS_NAMENODE_USER=dd
export HDFS_DATANODE_USER=dd
export HDFS_SECONDARYNAMENODE_USER=dd
export YARN_RESOURCEMANAGER_USER=dd
export YARN_NODEMANAGER_USER=dd
etc/hadoop/core-site.xml
文件配置<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://master:8020value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/home/dd/wdn_hdfs/hadoop-3.3.4/tmpvalue>
property>
<property>
<name>hadoop.http.staticuser.username>
<value>ddvalue>
property>
configuration>
etc/hadoop/hdfs-site.xml
文件配置<configuration>
<property>
<name>dfs.namenode.http-addressname>
<value>master:9870value>
property>
<property>
<name>dfs.namenode.secondary.http-addressname>
<value>slave2:9868value>
property>
configuration>
etc/hadoop/yarn-site.xml
文件配置<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>slave1value>
property>
<property>
<name>yarn.nodemanager.env-whitelistname>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOMEvalue>
property>
<property>
<name>yarn.application.classpathname>
<value>/home/dd/wdn_hdfs/hadoop-3.3.4/etc/hadoop:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/common/lib/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/common/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/hdfs:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/hdfs/lib/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/hdfs/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/mapreduce/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/yarn:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/yarn/lib/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/yarn/*
value>
property>
<property>
<name>yarn.log-aggregation-enablename>
<value>truevalue>
property>
<property>
<name>yarn.log.server.urlname>
<value>http://slave1:19888/jobhistory/logsvalue>
property>
<property>
<name>yarn.log-aggregation.retain-secondsname>
<value>604800value>
property>
<property>
<name>yarn.nodemanager.vmem-check-enabledname>
<value>falsevalue>
property>
configuration>
etc/hadoop/mapred-site.xml
文件配置<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
<property>
<name>mapreduce.application.classpathname>
<value>/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/mapreduce/*:/home/dd/wdn_hdfs/hadoop-3.3.4/share/hadoop/mapreduce/lib/*value>
property>
<property>
<name>mapreduce.jobhistory.addressname>
<value>slave1:10020value>
property>
<property>
<name>mapreduce.jobhistory.webapp.addressname>
<value>slave1:19888value>
property>
configuration>
workers
文件配置master
slave1
slave2
这里的master
是此台虚拟机的主机名,slave1
、slave2
是后面我们将会克隆的另外两台虚拟机的主机名。
关闭master
主结点的虚拟机,克隆虚拟机,最终克隆三台虚拟机:
设置master
节点IP为192.168.118.129
,主机名为master
设置slave1
节点IP为192.168.118.130
,主机名为slave1
设置slave2
节点IP为192.168.118.131
,主机名为slave2
设置master、slave1、slave2
节点的IP地址映射
# 打开 hosts 文件
dd@master:~$ sudo vim /etc/hosts
# 增加IP地址映射
192.168.118.129 master
192.168.118.130 slave1
192.168.118.131 slave2
# 保存
:wq
分别在三台机器的命令行中输入以下命令安装SSH服务端,因为Ubuntu操作系统已经默认安装了SSH的客户端,所以这里只需要安装SSH服务端就好了。
sudo apt-get install openssh-server
master
配置然后依次执行以下的命令:
# 生成密钥
dd@master:~/.ssh$ ssh-keygen -t rsa
# 将生成的密钥添加到同一目录下的authorized_keys授权文件中
dd@master:~/.ssh$ cat ./id_rsa.pub >> ./authorized_keys
# 显示生成文件
dd@master:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub
其中:
然后在将master
的公钥发送到slave1
和slave2
两台机器上去
# 发送至slave1
dd@master:~/.ssh$ scp id_rsa.pub dd@slave1:~/.ssh/master_rsa.pub
# 发送至slave2
dd@master:~/.ssh$ scp id_rsa.pub dd@slave2:~/.ssh/master_rsa.pub
slave1
配置然后依次执行以下的命令:
# 生成密钥
dd@slave1:~/.ssh$ ssh-keygen -t rsa
# 将生成的密钥添加到同一目录下的authorized_keys授权文件中
dd@slave1:~/.ssh$ cat ./id_rsa.pub >> ./authorized_keys
# 显示生成文件
dd@slave1:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub master_rsa.pub
其中:
然后在将slave1
的公钥发送到master
和slave2
两台机器上去
# 发送至master
dd@slave1:~/.ssh$ scp id_rsa.pub dd@master:~/.ssh/slave1_rsa.pub
# 发送至slave2
dd@slave1:~/.ssh$ scp id_rsa.pub dd@slave2:~/.ssh/slave1_rsa.pub
slave2
配置然后依次执行以下的命令:
# 生成密钥
dd@slave2:~/.ssh$ ssh-keygen -t rsa
# 将生成的密钥添加到同一目录下的authorized_keys授权文件中
dd@slave2:~/.ssh$ cat ./id_rsa.pub >> ./authorized_keys
# 显示生成文件
dd@slave2:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub master_rsa.pub slave1_rsa.pub
其中:
然后在将slave2
的公钥发送到slave1
和master
两台机器上去
# 发送至master
dd@slave2:~/.ssh$ scp id_rsa.pub dd@master:~/.ssh/slave2_rsa.pub
# 发送至slave2
dd@slave2:~/.ssh$ scp id_rsa.pub dd@slave1:~/.ssh/slave2_rsa.pub
在master
中执行以下命令将slave1
和slave2
的密钥添加到授权文件当中:
dd@master:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub known_hosts slave1_rsa.pub slave2_rsa.pub
# 添加slave1的密钥到授权文件中
dd@master:~/.ssh$ cat ./slave1_rsa.pub >> ./authorized_keys
# 添加slave2的密钥到授权文件中
dd@master:~/.ssh$ cat ./slave2_rsa.pub >> ./authorized_keys
# 配置权限
chmod 0600 authorized_keys
在slave1
中执行以下命令将master
和slave2
的密钥添加到授权文件当中:
dd@slave1:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub known_hosts master_rsa.pub slave2_rsa.pub
# 添加master的密钥到授权文件中
dd@slave1:~/.ssh$ cat ./master_rsa.pub >> ./authorized_keys
# 添加slave2的密钥到授权文件中
dd@slave1:~/.ssh$ cat ./slave2_rsa.pub >> ./authorized_keys
# 配置权限
chmod 0600 authorized_keys
在slave2
中执行以下命令将slave1
和master
的密钥添加到授权文件当中:
dd@slave2:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub known_hosts master_rsa.pub slave1_rsa.pub
# 添加master的密钥到授权文件中
dd@slave2:~/.ssh$ cat ./master_rsa.pub >> ./authorized_keys
# 添加slave1的密钥到授权文件中
dd@slave2:~/.ssh$ cat ./slave1_rsa.pub >> ./authorized_keys
# 配置权限
chmod 0600 authorized_keys
配置好了之后可以分别测试一下配置是否成功(第一次连接会输一次密码,后面就不会让你输密码了,可以第一次连上后再exit断开,再测试第二次,看看效果):
其余两个节点也可以免密登录,此处不再演示。
主机名 | NN | JJN | DN | RM | NM | SNN |
---|---|---|---|---|---|---|
master | NameNode | DataNode | NodeManager | |||
slave1 | JournalNode | DataNode | ResourceManager | NodeManager | ||
slave2 | DataNode | NodeManager | SecondaryNameNode |
master
节点第一次启动集群,需要将文件系统初始化,输入以下命令:
# 进入hadoop-3.3.4的安装目录
dd@master:~$ cd wdn_hdfs/hadoop-3.3.4/
dd@master:~/wdn_hdfs/hadoop-3.3.4$ pwd
/home/dd/wdn_hdfs/hadoop-3.3.4
# 初始化HDFS文件系统
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs namenode -format
WARNING: /home/dd/wdn_hdfs/hadoop-3.3.4/logs does not exist. Creating.
2022-12-18 18:04:51,256 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.3.4
STARTUP_MSG: classpath = /home/dd/wdn_hdfs/hadoop-3.3.4/etc/hadoop
...
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/127.0.1.1
************************************************************/
注意:只有第一次启动集群需要初始化HDFS文件系统,如果后续再执行初始化文件系统(格式化NameNode)由于每次初始化都会产生新的集群id,会导致NameNode和DataNode的集群id不一样,导致集群找不到以往的数据。如果非要再格式化一次,那就把三台机器下的/home/dd/wdn_hdfs/hadoop-3.3.4/
目录下的tmp
和logs
文件夹删除,再格式化NameNode。
初始化完文件系统后,继续输入,启动HDFS:
# 启动HDFS
dd@master:~/wdn_hdfs/hadoop-3.3.4$ sbin/start-dfs.sh
Starting namenodes on [master]
Starting datanodes
Starting secondary namenodes [slave2]
然后在三台虚拟机中分别输入jps
命令查看是否启动成功:
slave1
节点中启动YARN因为我们配置的ResourceManager结点是slave1
,所以在slave1
中执行下面的命令:
# 转到hadoop-3.3.4的安装目录下
dd@slave1:~$ cd ~/wdn_hdfs/hadoop-3.3.4/
# 启动yarn
dd@slave1:~/wdn_hdfs/hadoop-3.3.4$ sbin/start-yarn.sh
Starting resourcemanager
resourcemanager is running as process 3668. Stop it first and ensure /tmp/hadoop-dd-resourcemanager.pid file is empty before retry.
Starting nodemanagers
slave1: Warning: Permanently added 'slave1' (ECDSA) to the list of known hosts.
在三台机器中执行jps
查看是否启动成功:
打开浏览器输入 :http://master:9870
如果出现以下Hadoop页面则说明HDFS配置成功:
再输入网址(ResourceManager):http://slave1:8088
出现以下页面则说明启动成功:
按照以下命令创建并进入word.txt
文件:
# 创建上传文件
dd@master:~/wdn_hdfs/hadoop-3.3.4$ mkdir input
dd@master:~/wdn_hdfs/hadoop-3.3.4$ touch input/word.txt
dd@master:~/wdn_hdfs/hadoop-3.3.4$ chmod 777 input/word.txt
dd@master:~/wdn_hdfs/hadoop-3.3.4$ vim input/word.txt
然后在文件中添加以下内容以供测试:
hello hadoop hive hbase spark flink
hello hadoop hive hbase spark
hello hadoop hive hbase
hello hadoop hive
hello hadoop
hello
添加完成后退出文本编辑,然后再输入以下命令:
# 将word.txt文件传送到HDFS下的/input目录下
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop fs -put /home/dd/wdn_hdfs/hadoop-3.3.4/input/word.txt /input
# 执行wordcount程序来对文件中的单词计数
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount /
input /output
然后可以输入一下语句来查看统计的结果:
# 转到运行程序是设置的输出目录下面并查看其中的文件
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 3 dd supergroup 0 2022-12-18 18:51 /output/_SUCCESS
-rw-r--r-- 3 dd supergroup 48 2022-12-18 18:51 /output/part-r-00000
Wordcount
程序运行的结果所在:
# 查看这个结果储存文件
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hdfs dfs -cat /output/part-r-00000
flink 1
hadoop 5
hbase 3
hello 6
hive 4
spark 2
最后的统计结果就如下图所示:
除了命令行这种查看方式也可以在web端查看:
打开浏览器输入:http://master:9870
进入到如下页面:
转到slave1
节点下启动历史服务器服务:
# 启动历史服务器
dd@slave1:~/wdn_hdfs/hadoop-3.3.4$ bin/mapred --daemon start historyserver
然后回到浏览器输入以下网址:http://slave1:8088/cluster
然后再点击history可查看此任务执行的历史信息:
点开logs
后,可以看到历史记录:
进入master
节点,进入命令行输入以下命令关闭HDFS:
# master节点,关闭dfs
dd@master:~/wdn_hdfs/hadoop-3.3.4$ sbin/stop-dfs.sh
Stopping namenodes on [master]
Stopping datanodes
Stopping secondary namenodes [slave2]
进入slave1
节点下去输入以下命令关闭YARN和历史服务器:
# slave1节点,关闭YARN
dd@slave1:~/wdn_hdfs/hadoop-3.3.4$ sbin/stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
# slave1节点,关闭history服务
dd@slave1:~/wdn_hdfs/hadoop-3.3.4$ bin/mapred --daemon stop historyserver
错误位置:伪分布式运行第二步启动时出现此问题
错误信息:
dd@wdn:~/wdn_hdfs/hadoop-3.3.4$ sbin/start-dfs.sh
Starting namenodes on [localhost]
pdsh@wdn: localhost: rcmd: socket: Permission denied
Starting datanodes
pdsh@wdn: localhost: rcmd: socket: Permission denied
Starting secondary namenodes [wdn]
pdsh@wdn: wdn: rcmd: socket: Permission denied
解决方法:
dd@wdn:~$ export PDSH_RCMD_TYPE=ssh
dd@wdn:~$ source ~/.bashrc
错误位置:完全分布式上传文件出现此错误
错误信息:
# 上传文件
dd@master:~/wdn_hdfs/hadoop-3.3.4$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount /input /output
2022-12-18 18:40:41,240 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at slave1/192.168.118.130:8032
2022-12-18 18:40:43,001 INFO ipc.Client: Retrying connect to server: slave1/192.168.118.130:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2022-12-18 18:40:44,003 INFO ipc.Client: Retrying connect to server: slave1/192.168.118.130:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2022-12-18 18:40:45,005 INFO ipc.Client: Retrying connect to server: slave1/192.168.118.130:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
解决方法:
找到/etc/hosts文件,用sudo命令打开,将127.0.0.1和127.0.1.1的映射注释掉即可。
注意:如果及设计的是分布式的,就要把master和slave中的hosts文件都要修改,伪分布式的就无所谓了,因为它只有一个配置文件。
# 打开hosts文件
dd@master:~/wdn_hdfs/hadoop-3.3.4$ sudo vim /etc/hosts
[sudo] password for dd:
# 注释映射
# 127.0.0.1 localhost
# 127.0.1.1 master
192.168.118.129 master
192.168.118.130 slave1
192.168.118.131 slave2
# 关闭保存
:wq
通过本次实验,简单掌握了Hadoop整体的搭建过程,由于笔记本电脑性能不足,搭建三台虚拟机导致电脑数次崩溃,网页查看时多次出错。在Hadoop完全分布式的安装配置过程中,遇到许许多多的问题,安装jdk,配置环境变量,安装ssh等等。同时在本次实验中认识到了自己的不足,在以后的学习中需要进一步努力。实验考验是一个人的耐心,实验步骤要一步一步地做,每一步都要严谨认真。今后会加强对Linux系统知识的掌握,力求手到擒来。