docer安装hadoop

基于 Docker 构建 Hadoop 平台
0. 绪论
使⽤ Docker 搭建 Hadoop 技术平台,包括安装 Docker Java Scala Hadoop Hbase Spark
集群共有 5 台机器,主机名分别为 h01 h02 h03 h04 h05 。其中 h01 master ,其他的为
slave
虚拟机配置:建议 1 2 线程、 8G 内存、 30G 硬盘。最早配置 4G 内存, HBase Spark 运⾏异常。
JDK 1.8
Scala 2.11.12
Hadoop 3.3.3
Hbase 3.0.0
Spark 3.3.0
1. Docker
1.1 Ubuntu 22.04 安装 Docker
Ubuntu 下对 Docker 的操作都需要加上 sudo ,如果已经是 root 账号了,则不需要。
如果不加 sudo Docker 相关命令会⽆法执⾏。
Ubuntu 下安装 Docker 的时候需在管理员的账号下操作。
安装完成之后,以 sudo 启动 Docker 服务。
显⽰ Docker 中所有正在运⾏的容器,由于 Docker 才安装,我们没有运⾏任何容器,所以显⽰结果如
下所⽰。
1.2 使⽤ Docker
现在的 Docker ⽹络能够提供 DNS 解析功能,我们可以使⽤如下命令为接下来的 Hadoop 集群单独构
建⼀个虚拟的⽹络。可以采⽤直通、桥接或 macvlan ⽅式,这⾥采⽤桥接模式,可以做到 5 台主机互联,
并能访问宿主机和⽹关,可以连接外⽹,便于在线下载程序资源。
以上命令创建了⼀个名为 hadoop 的虚拟桥接⽹络,该虚拟⽹络内部提供了⾃动的 DNS 解析服务。使⽤
下⾯这个命令查看 Docker 中的⽹络,可以看到刚刚创建的名为 hadoop 的虚拟桥接⽹络。
mike@ubuntu2204:~$ wget -qO- https://get.docker.com/ | sh
mike@ubuntu2204:~$ sudo service docker start
mike@ubuntu2204:~$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
mike@ubuntu2204:~$
mike@ubuntu2204:~$sudo docker network create --driver=bridge hadoop mike@ubuntu2204:~$ sudo docker network ls
[sudo] password for mike:
NETWORK ID NAME DRIVER SCOPE
3948edc3e8f3 bridge bridge local
337965dd9b1e hadoop bridge local
cb8f2c453adc host host local
fff4bd1c15ee mynet macvlan local
30e1132ad754 none null local
mike@ubuntu2204:~$
查找 ubuntu 容器
打开 https://hub.docker.com/ 官⽹,搜索 ubuntu ,找到官⽅认证镜像,这⾥选取第⼀个
点击第⼀个 ubuntu ,查找可选⽤的版本,这⾥选取 22.04 下载 ubuntu 22.04 版本的镜像⽂件
mike@ubuntu2204:~$ sudo docker pull ubuntu:22.04
查看已经下载的镜像
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 3 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
根据镜像启动⼀个容器,可以看出 shell 已经是容器的 shell 了,这⾥注意 @ 后⾯的容器 ID 与上图镜像 ID
⼀致
mike@ubuntu2204:~$ sudo docker run -it ubuntu:22.04 /bin/bash
root@27941809078c:/#
输⼊ exit 可以退出容器,不过建议使⽤ Ctrl + P + Q ,退出容器状态,但仍让容器处于后台运⾏状
态。
mike@ubuntu2204:~$
查看本机上所有的容器 此处会看到刚刚创建好的容器,并在后台运⾏。这⾥因为是后期制作的教程,为了节省内存,只保留了 5
hadoop 的容器,最原始的容器已经删除。
启动⼀个状态为退出的容器,最后⼀个参数为容器 ID
进⼊⼀个容器
关闭⼀个容器
2. 安装集群
主要是安装 JDK 1.8 的环境,因为 Spark Scala Scala JDK 1.8 ,以及 Hadoop ,以此来构建基础
镜像。
2.1 安装 Java Scala
进⼊之前的 Ubuntu 容器
先更换 apt 的源
2.1.1 修改 apt
备份源
先删除就源⽂件,这个时候没有 vim ⼯具 ..
mike@ubuntu2204:~$ sudo docker ps -a
[sudo] password for mike:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
8016da5278ae newuhadoop "/bin/bash" 3 days ago Up 2 days
h05
409c7e8aa2e9 newuhadoop "/bin/bash" 3 days ago Up 2 days
h04
0d8af236e1e7 newuhadoop "/bin/bash" 3 days ago Up 2 days
h03
72d62b7d4874 newuhadoop "/bin/bash" 3 days ago Up 2 days
h02
d4d3ca3bbb61 newuhadoop "/bin/bash" 3 days ago Up 2 days 0.0.0.0:8088-
>8088/tcp, :::8088->8088/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp h01
mike@ubuntu2204:~$
mike@ubuntu2204:~$ sudo docker start 27941809078c
mike@ubuntu2204:~$ sudo docker attach 27941809078c
mike@ubuntu2204:~$ sudo docker stop 27941809078c
root@27941809078c:/# cp /etc/apt/sources.list /etc/apt/sources_init.list
root@27941809078c:/#
root@27941809078c:/# rm /etc/apt/sources.list 复制以下命令,回⻋,即可⼀键切换到阿⾥云 ubuntu 22.04 镜像:(此时已经是 root 权限,提⽰符为
#
再使⽤ apt update / apt upgrade 来更新, update 更列表, upgrade 更新包
2.1.2 安装 Java Scala
安装 jdk 1.8 ,直接输⼊命令
测试⼀下安装结果
接下来安装 scala
测试⼀下安装结果
bash -c "cat << EOF > /etc/apt/sources.list && apt update
deb http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted
universe multiverse
EOF"
root@27941809078c:/# apt update
root@27941809078c:/# apt upgrade
root@27941809078c:/# apt install openjdk-8-jdk
root@27941809078c:/# java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
root@27941809078c:/#
root@27941809078c:/# apt install scala 输⼊ :quit 退出 scala
2.2 安装 Hadoop
在当前容器中将配置配好
导⼊出为镜像
以此镜像为基础创建五个容器,并赋予 hostname
进⼊ h01 容器,启动 Hadoop
2.2.1 安装 Vim 与 ⽹络⼯具包
安装 vim ,⽤来编辑⽂件
安装 net-tools iputils-ping iproute2 ⽹络⼯具包,⽬的是为了使⽤ ping ifconfig ip traceroute
等命令
2.2.2 安装 SSH
安装 SSH ,并配置免密登录,由于后⾯的容器之间是由⼀个镜像启动的,就像同⼀个磨具出来的 5 把锁
与钥匙,可以互相开锁。所以在当前容器⾥配置 SSH ⾃⾝免密登录就 OK 了。
安装 SSH 服务器端
安装 SSH 的客⼾端
进⼊当前⽤⼾的⽤⼾根⽬录
⽣成密钥,不⽤输⼊,⼀直回⻋就⾏,⽣成的密钥在当前⽤⼾根⽬录下的 .ssh ⽂件夹中。以 . 开头
的⽂件与⽂件夹 ls 是隐藏的,需要 ls - al 才能查看。
将公钥追加到 authorized_keys ⽂件中
root@27941809078c:/# scala
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).
Type in expressions for evaluation. Or try :help.
scala>
root@27941809078c:/# apt install vim
root@27941809078c:/# apt install net-tools
root@27941809078c:/# apt install iputils-ping
root@27941809078c:/# apt install iproute2
root@27941809078c:/# apt install openssh-server
root@27941809078c:/# apt install openssh-client
root@27941809078c:/# cd ~
root@27941809078c:~#
root@27941809078c:~# ssh-keygen -t rsa -P "" root@27941809078c:~# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@27941809078c:~#
启动 SSH 服务
root@27941809078c:~# service ssh start
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@27941809078c:~#
免密登录⾃⼰
root@27941809078c:~# ssh 127.0.0.1
Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-41-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
Last login: Sun Jul 17 08:26:15 2022 from 172.18.0.1
* Starting OpenBSD Secure Shell server sshd
root@27941809078c:~#
修改 .bashrc ⽂件,启动 shell 的时候,⾃动启动 SSH 服务
vim 打开 .bashrc ⽂件
root@27941809078c:~# vim ~/.bashrc
按⼀下 i 键,使得 vim 进⼊插⼊模式,此时终端的左下⻆会显⽰为 -- INSERT -- ,将光标移动到最后
⾯,添加⼀⾏( Caps + g 可直接到最后⼀⾏ )
service ssh start
添加完的结果为,只显⽰最后⼏⾏
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
#if [ -f /etc/bash_completion ] && ! shopt -oq posix; then
# . /etc/bash_completion
#fi
service ssh start 按⼀下 Esc 键,使得 vim 退出插⼊模式
再输⼊英⽂模式下的冒号 : ,此时终端的左下⽅会有⼀个冒号 : 显⽰出来
再输⼊三个字符 wq! ,这是⼀个组合命令
w 是保存的意思
q 是退出的意思
! 是强制的意思
再输⼊回⻋,退出 vim
此时, SSH 免密登录已经完全配置好。
2.2.3 安装 Hadoop
下载 Hadoop 的安装⽂件
解压到 /usr/local ⽬录下⾯并重命名⽂件夹
修改 /etc/profile ⽂件,添加⼀下环境变量到⽂件中
先⽤ vim 打开 /etc/profile
追加以下内容
JAVA_HOME JDK 安装路径,使⽤ apt 安装就是这个,⽤ update - alternatives -- config java
查看
root@27941809078c:~# wget https://mirrors.aliyun.com/apache/hadoop/common/hadoop-
3.3.3/hadoop-3.3.3.tar.gz
root@27941809078c:~# tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local/
root@27941809078c:~# cd /usr/local/
root@27941809078c:/usr/local# mv hadoop-3.3.3 hadoop
root@27941809078c:/usr/local#
vim /etc/profile
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export HDFS_DATANODE_USER=root
export HDFS_DATANODE_SECURE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_NAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
使环境变量⽣效
root@27941809078c:/usr/local# source /etc/profile
root@27941809078c:/usr/local#
在⽬录 /usr/local/hadoop/etc/hadoop 下,修改 6 个重要配置⽂件
修改 hadoop-env.sh ⽂件,在⽂件末尾添加⼀下信息
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
修改 core-site.xml ,修改为
fs.default.name
hdfs://h01:9000
hadoop.tmp.dir
/home/hadoop3/hadoop/tmp
修改 hdfs-site.xml ,修改为
dfs.replication
2
dfs.namenode.name.dir
/home/hadoop3/hadoop/hdfs/name
dfs.namenode.data.dir
/home/hadoop3/hadoop/hdfs/data
修改 mapred-site.xml ,修改为 修改 yarn-site.xml ,修改为
修改 worker
此时, hadoop 已经配置好了
2.2.4 Docker 中启动集群
先将当前容器导出为镜像,并查看当前镜像。使⽤ ctrl + p + q ,退出容器,回到宿主机
mapreduce.framework.name
yarn
mapreduce.application.classpath
/usr/local/hadoop/etc/hadoop,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
yarn.resourcemanager.hostname
h01
yarn.nodemanager.aux-services
mapreduce_shuffle
h01
h02
h03
h04
h05 mike@ubuntu2204:~$ sudo docker commit -m "hadoop" -a "hadoop" 27941809078c
newuhadoop
sha256:648d8e082a231919faeaa14e09f5ce369b20879544576c03ef94074daf978823
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 4 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
启动 5 个终端,分别执⾏这⼏个命令
第⼀条命令启动的是 h01 是做 master 节点的,所以暴露了端⼝,以供访问 web ⻚⾯
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h01" --name "h01" -p
9870:9870 -p 8088:8088 newuhadoop /bin/bash
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h01:/#
其余的四条命令就是⼏乎⼀样的了,注意:启动容器后,使⽤ ctrl + p + q 退回到宿主机,之后再启动下
⼀个容器
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h02" --name "h02"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h02:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h03" --name "h03"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h03:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h04" --name "h04"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h04:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h05" --name "h05"
newuhadoop /bin/bash
[sudo] password for mike:
* Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h05:/# 接下来,在 h01 主机中,启动 Haddop 集群
先进⾏格式化操作,不格式化操作, hdfs 会起不来
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format
进⼊ hadoop sbin ⽬录
root@h01:/# cd /usr/local/hadoop/sbin/
root@h01:/usr/local/hadoop/sbin#
启动 hadoop
root@h01:/usr/local/hadoop/sbin# ./start-all.sh
Starting namenodes on [h01]
h01: Warning: Permanently added 'h01,172.18.0.2' (ECDSA) to the list of known
hosts.
Starting datanodes
h05: Warning: Permanently added 'h05,172.18.0.6' (ECDSA) to the list of known
hosts.
h02: Warning: Permanently added 'h02,172.18.0.3' (ECDSA) to the list of known
hosts.
h03: Warning: Permanently added 'h03,172.18.0.4' (ECDSA) to the list of known
hosts.
h04: Warning: Permanently added 'h04,172.18.0.5' (ECDSA) to the list of known
hosts.
h03: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h05: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h02: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h04: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [h01]
Starting resourcemanager
Starting nodemanagers
root@h01:/usr/local/hadoop/sbin#
使⽤ jps 查看集群启动状态 (这个状态不是固定不变的,随着应⽤不同⽽不同,但⾄少应该有 3 个)
root@h01:~# jps
10017 HRegionServer
10609 Master
9778 HQuorumPeer
8245 SecondaryNameNode
8087 DataNode
9881 HMaster
41081 Jps
10684 Worker
7965 NameNode
8477 ResourceManager
8591 NodeManager
root@h01:~#
使⽤命令 ./hdfs dfsadmin - report 可查看分布式⽂件系统的状态
root@h01:/usr/local/hadoop/bin# ./hdfs dfsadmin -report
Configured Capacity: 90810798080 (84.57 GB) Present Capacity: 24106247929 (22.45 GB)
DFS Remaining: 24097781497 (22.44 GB)
DFS Used: 8466432 (8.07 MB)
DFS Used%: 0.04%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (5):
Name: 172.18.0.2:9866 (h01)
Hostname: h01
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 2875392 (2.74 MB)
Non DFS Used: 11887669248 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.02%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:36:54 GMT 2022
Num of Blocks: 293
Name: 172.18.0.3:9866 (h02.hadoop)
Hostname: h02
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1396736 (1.33 MB)
Non DFS Used: 11889147904 (11.07 GB)
DFS Remaining: 4846399828 (4.51 GB)
DFS Used%: 0.01%
DFS Remaining%: 26.68%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 8
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:51:39 GMT 2022 Num of Blocks: 153
Name: 172.18.0.4:9866 (h03.hadoop)
Hostname: h03
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1323008 (1.26 MB)
Non DFS Used: 11889221632 (11.07 GB)
DFS Remaining: 5114835114 (4.76 GB)
DFS Used%: 0.01%
DFS Remaining%: 28.16%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 4
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:14:39 GMT 2022
Num of Blocks: 151
Name: 172.18.0.5:9866 (h04.hadoop)
Hostname: h04
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1527808 (1.46 MB)
Non DFS Used: 11889016832 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 00:42:09 GMT 2022
Num of Blocks: 134
Name: 172.18.0.6:9866 (h05.hadoop)
Hostname: h05
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1343488 (1.28 MB)
Non DFS Used: 11889201152 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10 访问宿主机的 8088 9870 端⼝就可以看到监控信息了
⾄此, Hadoop 集群已经构建好了
2.2.5 运⾏内置 WordCount 例⼦
license 作为需要统计的⽂件
HDFS 中创建 input ⽂件夹
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:36:21 GMT 2022
Num of Blocks: 149
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop# cat LICENSE.txt > file1.txt
root@h01:/usr/local/hadoop# ls root@h01:/usr/local/hadoop/bin# ./hadoop fs -mkdir /input
root@h01:/usr/local/hadoop/bin#
上传 file1.txt ⽂件到 HDFS
root@h01:/usr/local/hadoop/bin# ./hadoop fs -put ../file1.txt /input
root@h01:/usr/local/hadoop/bin#
查看 HDFS input ⽂件夹⾥的内容
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /input
Found 1 items
-rw-r--r-- 2 root supergroup 15217 2022-07-17 08:50 /input/file1.txt
root@h01:/usr/local/hadoop/bin#
运⾏ wordcount 例⼦程序
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
输出如下:
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
2022-07-20 05:12:38,394 INFO client.DefaultNoHARMFailoverProxyProvider:
Connecting to ResourceManager at h01/172.18.0.2:8032
2022-07-20 05:12:38,816 INFO mapreduce.JobResourceUploader: Disabling Erasure
Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1658047711391_0002
2022-07-20 05:12:39,076 INFO input.FileInputFormat: Total input files to process
: 1
2022-07-20 05:12:39,198 INFO mapreduce.JobSubmitter: number of splits:1
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1658047711391_0002
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-07-20 05:12:39,674 INFO conf.Configuration: resource-types.xml not found
2022-07-20 05:12:39,674 INFO resource.ResourceUtils: Unable to find 'resource
types.xml'.
2022-07-20 05:12:39,836 INFO impl.YarnClientImpl: Submitted application
application_1658047711391_0002
2022-07-20 05:12:39,880 INFO mapreduce.Job: The url to track the job:
http://h01:8088/proxy/application_1658047711391_0002/
2022-07-20 05:12:39,882 INFO mapreduce.Job: Running job: job_1658047711391_0002
2022-07-20 05:12:49,171 INFO mapreduce.Job: Job job_1658047711391_0002 running in
uber mode : false
2022-07-20 05:12:49,174 INFO mapreduce.Job: map 0% reduce 0%
2022-07-20 05:12:54,285 INFO mapreduce.Job: map 100% reduce 0%
2022-07-20 05:13:01,356 INFO mapreduce.Job: map 100% reduce 100%
2022-07-20 05:13:02,391 INFO mapreduce.Job: Job job_1658047711391_0002 completed
successfully
2022-07-20 05:13:02,524 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=12507
FILE: Number of bytes written=577413
FILE: Number of read operations=0 FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=15313
HDFS: Number of bytes written=9894
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3141
Total time spent by all reduces in occupied slots (ms)=3811
Total time spent by all map tasks (ms)=3141
Total time spent by all reduce tasks (ms)=3811
Total vcore-milliseconds taken by all map tasks=3141
Total vcore-milliseconds taken by all reduce tasks=3811
Total megabyte-milliseconds taken by all map tasks=3216384
Total megabyte-milliseconds taken by all reduce tasks=3902464
Map-Reduce Framework
Map input records=270
Map output records=1672
Map output bytes=20756
Map output materialized bytes=12507
Input split bytes=96
Combine input records=1672
Combine output records=657
Reduce input groups=657
Reduce shuffle bytes=12507
Reduce input records=657
Reduce output records=657
Spilled Records=1314
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=126
CPU time spent (ms)=1110
Physical memory (bytes) snapshot=474148864
Virtual memory (bytes) snapshot=5063700480
Total committed heap usage (bytes)=450887680
Peak Map Physical memory (bytes)=288309248
Peak Map Virtual memory (bytes)=2528395264
Peak Reduce Physical memory (bytes)=185839616
Peak Reduce Virtual memory (bytes)=2535305216
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=15217
File Output Format Counters
Bytes Written=9894
root@h01:/usr/local/hadoop/bin# 查看 HDFS 中的 /output ⽂件夹的内容
查看 part - r - 00000 ⽂件的内容
⾄此, hadoop 部分已经结束
2.3 安装 Hbase
Hadoop 集群的基础上安装 Hbase
下载 Hbase 3.0.0
解压到 /usr/local ⽬录下⾯
修改 /etc/profile 环境变量⽂件,添加 Hbase 的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤ ssh h02 可进⼊ h02 容器,修改 profile ⽂件如上。依次修改 h03 h04 h05
即是每个容器都要在 /etc/profile ⽂件后追加那两⾏环境变量
在⽬录 /usr/local/hbase - 3.0.0/conf 修改配置
修改 hbase-env.sh ,追加
修改 hbase-site.xml
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /output
Found 2 items
-rw-r--r-- 2 root supergroup 0 2022-07-20 05:13 /output/_SUCCESS
-rw-r--r-- 2 root supergroup 9894 2022-07-20 05:13 /output/part-r-00000
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop/bin# ./hadoop fs -cat /output/part-r-00000
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/3.0.0-alpha-
3/hbase-3.0.0-alpha-3-bin.tar.gz
root@h01:~# tar -zxvf hbase-3.0.0-bin.tar.gz -C /usr/local/
export HBASE_HOME=/usr/local/hbase-3.0.0
export PATH=$PATH:$HBASE_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true
hbase.rootdir hdfs://h01:9000/hbase
hbase.cluster.distributed
true
hbase.master
h01:60000
hbase.zookeeper.quorum
h01,h02,h03,h04,h05
hbase.zookeeper.property.dataDir
/home/hadoop/zoodata
修改 regionservers ⽂件为
h01
h02
h03
h04
h05
使⽤ scp 命令将配置好的 Hbase 复制到其他 4 个容器中
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h02:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h03:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h04:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h05:/usr/local/
启动 Hbase root@h01:/usr/local/hbase-3.0.0/bin# ./start-hbase.sh
h04: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h04.out
h02: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h02.out
h03: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h03.out
h05: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h05.out
h01: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h01.out
running master, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase--master
h01.out
h05: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h05.out
h01: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h01.out
h04: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h04.out
h03: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h03.out
h02: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h02.out
root@h01:/usr/local/hbase-3.0.0/bin#
打开 Hbase shell
root@h01:/usr/local/hbase-3.0.0/bin# ./hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/book.html#shell
Version 3.0.0-alpha-3, rb3657484850f9fa9679f2186bf53e7df768f21c7, Wed Jun 15
07:56:54 UTC 2022
Took 0.0017 seconds
hbase:001:0>
hbase 测试
创建表 member
hbase:006:0> create 'member','id','address','info'
Created table member
Took 0.6838 seconds
=> Hbase::Table - member
hbase:007:0>
添加数据,并查看表中数据
hbase:007:0> put 'member', 'debugo','id','11'
Took 0.1258 seconds
hbase:008:0> put 'member', 'debugo','info:age','27' 2.4 安装 Spark
Hadoop 的基础上安装 Spark
下载 Spark 3.3.0
解压到 /usr/local ⽬录下⾯
修改⽂件夹的名字
修改 /etc/profile 环境变量⽂件,添加 Hbase 的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤ ssh h02 可进⼊其他四个容器,依次修改。
即是每个容器都要在 /etc/profile ⽂件后追加那两⾏环境变量
在⽬录 /usr/local/spark - 3.3.0/conf 修改配置
修改⽂件名
Took 0.0108 seconds
hbase:009:0> count 'member'
1 row(s)
Took 0.0499 seconds
=> 1
hbase:010:0> scan 'member'
ROW COLUMN+CELL
debugo column=id:, timestamp=2022-07-
20T05:37:58.720, value=11
debugo column=info:age, timestamp=2022-07-
20T05:38:11.302, value=27
1 row(s)
Took 0.0384 seconds
hbase:011:0>
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-
3.3.0/spark-3.3.0-bin-hadoop3.tgz
root@h01:~# tar -zxvf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/
root@h01:~# cd /usr/local/
root@h01:/usr/local# mv spark-3.3.0-bin-hadoop3 spark-3.3.0
export SPARK_HOME=/usr/local/spark-3.3.0
export PATH=$PATH:$SPARK_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local# 修改 spark-env.sh ,追加
修改⽂件名
修改 slaves 如下
使⽤ scp 命令将配置好的 Hbase 复制到其他 4 个容器中
启动 Spark
3 其他
root@h01:/usr/local/spark-3.3.0/conf# mv spark-env.sh.template spark-env.sh
root@h01:/usr/local/spark-3.3.0/conf#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_HOST=h01
export SPARK_MASTER_IP=h01
export SPARK_WORKER_MEMORY=4g
root@h01:/usr/local/spark-3.3.0/conf# mv slaves.template slaves
root@h01:/usr/local/spark-3.3.0/conf#
h01
h02
h03
h04
h05
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h02:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h03:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h04:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h05:/usr/local/
root@h01:/usr/local/spark-3.3.0/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-
3.3.0/logs/spark--org.apache.spark.deploy.master.Master-1-h01.out
h03: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h03.out
h02: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h02.out
h04: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h04.out
h05: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h05.out
h01: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h01.out
root@h01:/usr/local/spark-3.3.0/sbin# 3.1 HDFS 重格式化问题
参考 https://blog.csdn.net/gis_101/article/details/52821946
重新格式化意味着集群的数据会被全部删除,格式化前需考虑数据备份或转移问题;
先删除主节点(即 namenode 节点), Hadoop 的临时存储⽬录 tmp namenode 存储永久性元数
据⽬录 dfs/name Hadoop 系统⽇志⽂件⽬录 log 中的内容 (注意是删除⽬录下的内容不是⽬
录);
删除所有数据节点 ( datanode 节点 ) Hadoop 的临时存储⽬录 tmp namenode 存储永久性元数
据⽬录 dfs/name Hadoop 系统⽇志⽂件⽬录 log 中的内容;
格式化⼀个新的分布式⽂件系统:
注意事项 :
Hadoop 的临时存储⽬录 tmp (即 core-site.xml 配置⽂件中的 hadoop.tmp.dir 属性,默认值
/tmp/hadoop-${user.name} ),如果没有配置 hadoop.tmp.dir 属性,那么 hadoop 格式化时将
会在 /tmp ⽬录下创建⼀个⽬录,例如在 cloud ⽤⼾下安装配置 hadoop ,那么 Hadoop 的临时存储⽬
录就位于 /tmp/hadoop-cloud ⽬录下
Hadoop namenode 元数据⽬录(即 hdfs-site.xml 配置⽂件中的 dfs.namenode.name.dir 属性,
默认值是 ${hadoop.tmp.dir}/dfs/name ),同样如果没有配置该属性,那么 hadoop 在格式化时将
⾃⾏创建。必须注意的是在格式化前必须清楚所有⼦节点(即 DataNode 节点) dfs/name 下的内
容,否则在启动 hadoop 时⼦节点的守护进程会启动失败。这是由于,每⼀次 format 主节点
namenode dfs/name/current ⽬录下的 VERSION ⽂件会产⽣新的 clusterID namespaceID 。但
是如果⼦节点的 dfs/name/current 仍存在, hadoop 格式化时就不会重建该⽬录,因此形成⼦节点
clusterID namespaceID 与主节点(即 namenode 节点)的 clusterID namespaceID 不⼀致。
最终导致 hadoop 启动失败。
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format

你可能感兴趣的:(hadoop,eureka,大数据)