数仓学习(一)环境搭建

文章目录

  • 1. 环境安装
    • 1.1 准备三台虚拟机,分别是hadoop102、hadoop103、hadoop104
      • 1.1.1 vagrant虚拟机配置
      • 1.1.2 虚拟机安装
      • 1.2 修改root用户密码
      • 1.3 修改host配置文件
      • 1.4 新增hadoop用户
      • 1.5 配置免密登录
    • 1.6 配置xsync脚本
    • 1.6 创建工作目录
    • 1.6 JDK安装
    • 1.7 bash脚本配置路径说明
    • 1.8 生成模拟数据
    • 1.9 xcall脚本
  • 2. 安装分布式群集
    • 2.1 安装Hadoop
      • 2.1.1 安装
      • 2.1.2 配置环境变量
      • 2.1.3 分发
      • 2.1.4 集群配置
      • 2.1.5 分发
      • 2.1.6 开启日志服务
        • 2.1.6.1 配置
        • 2.1.6.1 分发
      • 2.1.7 格式化服务
      • 2.1.9 启动hdfs
      • 2.1.10 启动yarn
      • 2.1.11 访问web端
      • 2.1.12 创建启停脚本
      • 2.1.13 Hadoop调优
        • 2.1.13.1 项目经验之HDFS存储多目录
        • 2.1.13.2 集群数据均衡
        • 2.1.13.3 项目经验之支持LZO压缩配置
        • 2.1.13.4 项目经验之LZO创建索引
        • 2.1.13.5 项目经验之基准测试
        • 2.1.13.5 项目经验之Hadoop参数调优
    • 2.2 安装Zookeeper
        • 2.2.1 集群规划
        • 2.2.2 解压安装
        • 2.2.3 配置服务
        • 2.2.4 启动服务
        • 2.2.5 ZK集群启动停止脚本
    • 2.3 安装kafka
      • 2.3.1 集群规划
      • 2.3.2 解压安装
      • 2.3.3 配置
      • 2.3.4 配置环境变量
      • 2.3.5 Kafka集群启动停止脚本
      • 2.3.6 项目经验之Kafka机器数量计算
    • 2.4 flume安装
      • 2.4.1 集群规划
      • 2.4.2安装地址
      • 2.4.3 安装部署
      • 2.4.4 分发
      • 2.4.5 项目经验之Flume组件选型
    • 2.5 安装MySql
      • 2.5.1 检查是否安装mysql并卸载
      • 2.5.2 安装依赖
      • 2.5.3 安装mysql
      • 2.5.4 启动服务
      • 2.5.5 查看服务状态
      • 2.5.6 查看初始密码
      • 2.5.7 修改密码
      • 2.5.8 开启远程登录
    • 2.6 生成数据
    • 2.7 安装sqoop
      • 2.7.1 解压安装
      • 2.7.2 重命名
      • 2.7.3 配置
      • 2.7.4 添加驱动到lib目录
      • 2.7.5 验证是否安装成功
      • 2.7.6 测试是否可以成功连接数据库
      • 2.7.7 测试sqoop是否可以正常倒入hdfs
    • 2.8 同步策略
    • 2.9 业务数据导入HDFS
      • 2.9.1 分析表同步策略
      • 2.9.2 业务数据首日同步脚本
        • 2.9.2.1 脚本编写
        • 2.9.2.2 脚本使用
      • 2.9.3 业务数据每日同步脚本
        • 2.9.3.1 脚本编写
        • 2.9.3.2 脚本使用
    • 2.10 安装Hive
      • 2.10.1 解压安装
      • 2.10.2 重命名
      • 2.10.3 配置
      • 2.10.4 配置环境变量
      • 2.10.5 拷贝驱动
      • 2.10.6 登陆mysql
      • 2.10.7 创建元数据库
      • 2.10.8 初始化元数据库
    • 2.11 安装spark
      • 2.11.1 区别
      • 2.11.2 安装
      • 2.11.3 配置
      • 2.11.4 向HDFS上传Spark纯净版jar包
      • 2.11.5 修改hive-site.xml配置文件
      • 2.11.6 增加Application Master资源比例
      • 2.11.7 测试
    • 2.12 安装Hbase
      • 2.12.1 解压到指定目录
      • 2.12.2 配置
      • 2.12.3 建立软连接
      • 2.12.4 配置环境变量
      • 2.12.5 分发
      • 2.12.6 启动服务
      • 2.12.7 访问web服务
      • 2.12.8 启动服务可能遇到的问题
    • 2.13 安装Solr
      • 2.13.1 创建系统用户
      • 2.13.2 解压并重命名
      • 2.13.3 分发
      • 2.13.4 授权给solr用户
      • 2.13.5 修改配置文件
      • 2.13.6 启动服务
      • 2.13.7 查看web服务
    • 2.14 安装atlas
      • 2.14.1 解压安装重命名
      • 2.14.2 配置
      • 2.14.3 配置solr
      • 2.14.4 配置kafka
      • 2.14.5 配置atlas
      • 2.14.6 Kerberos相关配置
    • 2.15 Atlas集成Hive
    • 2.16 扩展内容
      • 2.16.1 Atlas源码编译
          • 安装Maven
          • 编译Atlas源码
          • Atlas内存配置
          • 配置用户名密码
    • 2.14 Kylin安装
        • 2.14.1 解压安装
      • 2.14.2 修改兼容性配置
      • 2.14.3 启动
      • 2.14.4 查看web管理页面

1. 环境安装

1.1 准备三台虚拟机,分别是hadoop102、hadoop103、hadoop104

1.1.1 vagrant虚拟机配置

  • hadoop102
Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"
  config.vm.hostname = "hadoop102"
  config.vm.network "private_network", ip: "192.168.10.102"
  # 虚拟机配置
  config.vm.provider "virtualbox" do |vb|
    vb.gui = false
    vb.memory = "16384"
	vb.cpus = 2
  end
  # 初始化安装脚本
  config.vm.provision "shell", inline: <<-SHELL
    yum update
    yum install -y vim wget
	sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/'  /etc/ssh/sshd_config
	systemctl restart sshd	
	# 关闭防火墙
	systemctl stop firewalld.service 
	systemctl disable firewalld.service
	# 关闭selinux
	sed -i 's/enforcing/disabled/' /etc/selinux/config
	sysctl --system
	# 开启时间同步
	yum install ntpdate -y 
	ntpdate time.windows.com
	# 时区
	timedatectl set-timezone Asia/Shanghai
	# 安装lrzsz命令
	yum install -y lrzsz
	# 修改源为阿里云
	mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
	wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
	yum makecache
  SHELL
end
  • hadoop103
Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"
  config.vm.hostname = "hadoop103"
  config.vm.network "private_network", ip: "192.168.10.103"
  # 虚拟机配置
  config.vm.provider "virtualbox" do |vb|
    vb.gui = false
    vb.memory = "8192"
	vb.cpus = 2
  end
  # 初始化安装脚本
  config.vm.provision "shell", inline: <<-SHELL
    yum update
    yum install -y vim wget
	sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/'  /etc/ssh/sshd_config
	systemctl restart sshd	
	# 关闭防火墙
	systemctl stop firewalld.service 
	systemctl disable firewalld.service
	# 关闭selinux
	sed -i 's/enforcing/disabled/' /etc/selinux/config
	sysctl --system
	# 开启时间同步
	yum install ntpdate -y 
	ntpdate time.windows.com
	# 时区
	timedatectl set-timezone Asia/Shanghai
	# 安装lrzsz命令
	yum install -y lrzsz
	# 修改源为阿里云
	mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
	wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
	yum makecache
  SHELL
end
  • hadoop104
Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"
  config.vm.hostname = "hadoop104"
  config.vm.network "private_network", ip: "192.168.10.104"
  # 虚拟机配置
  config.vm.provider "virtualbox" do |vb|
    vb.gui = false
    vb.memory = "8192"
	vb.cpus = 2
  end
  # 初始化安装脚本
  config.vm.provision "shell", inline: <<-SHELL
    yum update
    yum install -y vim wget
	sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/'  /etc/ssh/sshd_config
	systemctl restart sshd	
	# 关闭防火墙
	systemctl stop firewalld.service 
	systemctl disable firewalld.service
	# 关闭selinux
	sed -i 's/enforcing/disabled/' /etc/selinux/config
	sysctl --system
	# 开启时间同步
	yum install ntpdate -y 
	ntpdate time.windows.com
	# 时区
	timedatectl set-timezone Asia/Shanghai
	# 安装lrzsz命令
	yum install -y lrzsz
	# 修改源为阿里云
	mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
	wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
	yum makecache
  SHELL
end

1.1.2 虚拟机安装

vagrant up

1.2 修改root用户密码

[root@hadoop102 ~]# passwd 
Changing password for user root.
New password: 
BAD PASSWORD: The password is shorter than 8 characters
Retype new password: 
passwd: all authentication tokens updated successfully.

1.3 修改host配置文件

192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104

1.4 新增hadoop用户

# 新增用户
useradd hadoop
# 设置密码
passwd hadoop
vim /etc/sudoers
# 配置sudo免密模式
hadoop      ALL=(ALL) NOPASSWD: ALL

1.5 配置免密登录

三台服务都需要做免密登陆,包括服务器本身。

  • hadoop102
[hadoop@hadoop102 ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:rdNMUzGFduBxFh8fSavrF9px7CnLmDnHldiNnAG3zjg hadoop@hadoop102
The key's randomart image is:
+---[RSA 2048]----+
|            =+*=.|
|           .+*oo=|
|           .o+ oo|
|         . .  +  |
|        S +  *o=o|
|         = .E.B*+|
|        o o .o+.=|
|         . .*= +.|
|           +o++  |
+----[SHA256]-----+

[hadoop@hadoop102 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop102 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (192.168.10.103)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop102 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (192.168.10.104)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password: 
Permission denied, please try again.
hadoop@hadoop104's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.

  • hadoop103
[hadoop@hadoop103 ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:apWniTAC+OP8VkuBklBAtGj+dd5wGJ7NKUIA8vFEY3E hadoop@hadoop103
The key's randomart image is:
+---[RSA 2048]----+
|=*+o*.E          |
|=..=.o           |
|++...o .         |
|ooo o o *..      |
| .+.oo OS=.      |
| o.o.o*+=+       |
|  o. o+oo.       |
|   ....          |
|   ..            |
+----[SHA256]-----+
[hadoop@hadoop103 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop102 (192.168.10.102)' can't be established.
ECDSA key fingerprint is SHA256:I2sJV0iLNvRNEeK50zp6/pYiwfmp+wiopWJiqiU2xrA.
ECDSA key fingerprint is MD5:94:99:9d:0f:a5:80:c4:2f:2c:af:88:cb:6a:12:91:65.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop103 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (127.0.1.1)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop103 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (192.168.10.104)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.

  • hadoop104
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:Ke7bWIHCwv1Q1NEjCYENcRayseJCGgQqBqdXk2xomzU hadoop@hadoop104
The key's randomart image is:
+---[RSA 2048]----+
|+..oB*B=.+       |
|+oo.EX. + o      |
|=+o=o..  . .     |
|==o+ . . .       |
|o + = o S        |
| . . = . .       |
|      o .        |
|     . +         |
|      +..        |
+----[SHA256]-----+

[hadoop@hadoop104 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop102 (192.168.10.102)' can't be established.
ECDSA key fingerprint is SHA256:I2sJV0iLNvRNEeK50zp6/pYiwfmp+wiopWJiqiU2xrA.
ECDSA key fingerprint is MD5:94:99:9d:0f:a5:80:c4:2f:2c:af:88:cb:6a:12:91:65.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop104 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (192.168.10.103)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.

[hadoop@hadoop104 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (127.0.1.1)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.

1.6 配置xsync脚本

[hadoop@hadoop102 ~]$ mkdir /home/hadoop/bin

写入内容

#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
  echo Not Enough Arguement!
  exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
  echo ====================  $host  ====================
  #3. 遍历所有目录,挨个发送
  for file in $@
  do
    #4. 判断文件是否存在
    if [ -e $file ]
    then
      #5. 获取父目录
      pdir=$(cd -P $(dirname $file); pwd)
      #6. 获取当前文件的名称
      fname=$(basename $file)
      ssh $host "mkdir -p $pdir"
      rsync -av $pdir/$fname $host:$pdir
    else
      echo $file does not exists!
    fi
  done
done

[hadoop@hadoop102 ~]$ cd /home/hadoop/bin
[hadoop@hadoop102 bin]$ chmod 777 xsync
[hadoop@hadoop102 bin]$ ll
total 4
-rwxrwxrwx. 1 hadoop hadoop 624 Oct  2 17:56 xsync

1.6 创建工作目录

# 创建工作目录
sudo mkdir /opt/software
sudo mkdir /opt/modules

# 授权
sudo chown -R hadoop:hadoop module/ software/

1.6 JDK安装

# 查询所有已有的Java安装包并卸载
rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps
# 上传jdk安装包并解压到指定目录
tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
  • 添加环境变量
# 编辑my_env.sh脚本
[hadoop@hadoop102 jdk1.8.0_212]$ sudo vim /etc/profile.d/my_env.sh

# 内容
# JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

# 使环境变量生效
[hadoop@hadoop102 jdk1.8.0_212]$ source  /etc/profile.d/my_env.sh

# 查看是否安装成功
[hadoop@hadoop102 jdk1.8.0_212]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
  • hadoop102分发
[hadoop@hadoop102 module]$ sudo /home/hadoop/bin/xsync /opt/modulejdk1.8.0_212/
[hadoop@hadoop102 module]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
  • hadoop103
[hadoop@hadoop103 opt]$ sudo chown -R hadoop:hadoop
[hadoop@hadoop103 opt]$ ll
total 0
drwxr-xr-x. 3 hadoop hadoop 26 Oct  2 18:45 module

[hadoop@hadoop103 opt]$ source /etc/profile.d/my_env.sh 
[hadoop@hadoop103 opt]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
  • hadoop104
[hadoop@hadoop104 opt]$ sudo chown -R hadoop:hadoop module/
[hadoop@hadoop104 opt]$ ll
total 0
drwxr-xr-x. 3 hadoop hadoop 26 Oct  2 18:45 module

[hadoop@hadoop104 opt]$ source /etc/profile.d/my_env.sh 
[hadoop@hadoop104 opt]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)

1.7 bash脚本配置路径说明

  1. /etc/profile
  2. /etc/profile.d/*.sh
  3. ~/.bash_profile

使用用户名和密码登陆时会加载1、2、3配置文件,使用非登陆shell(SSH)时只会加载第二种。

1.8 生成模拟数据

# 授权
[hadoop@hadoop102 ~]$ sudo chown -R hadoop:hadoop /opt
# 创建日志目录
[hadoop@hadoop102 opt]$ mkdir applog

待补充

1.9 xcall脚本

# 编辑xcall.sh文件
[hadoop@hadoop102 bin]$ vim /home/hadoop/xcall.sh

# 内容
#! /bin/bash
for i in hadoop102 hadoop103 hadoop104
do
  echo ============$i================
  ssh $i "$*"
done

# 授权
[hadoop@hadoop102 bin]$ chmod 777 /home/hadoop/xcall.sh

# 测试脚本
[hadoop@hadoop102 bin]$ xcall.sh jps
============hadoop102================
1290 Jps
============hadoop103================
1099 Jps
============hadoop104================
1106 Jps

2. 安装分布式群集

2.1 安装Hadoop

2.1.1 安装

# 上传并解压hadoop安装包
[hadoop@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

2.1.2 配置环境变量

sudo vim /etc/profile.d/my_env.sh

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

# 使生效
source /etc/profile.d/my_env.sh

2.1.3 分发

[hadoop@hadoop102 hadoop-3.1.3]$ sudo /home/hadoop/bin/xsync /opt/module/hadoop-3.1.3/
[hadoop@hadoop102 hadoop-3.1.3]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
[hadoop@hadoop103 ~]$ source /etc/profile.d/my_env.sh
[hadoop@hadoop104 ~]$ source /etc/profile.d/my_env.sh

2.1.4 集群配置

hadoop102 hadoop103 hadoop104
HDFS NameNode、DataNode DataNode
YARN NodeManger ResourceManager、NodeManger

NameNode、SecondNameNode、ResourceManager都不要放在同一台服务器上比较消耗内存

# 进入hadoop配置目录
/opt/module/hadoop-3.1.3/etc/hadoop
  • core-site.xml
<!-- 指定NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:8020</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
</property>

<!-- 配置HDFS网页登录使用的静态用户为hadoop -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>hadoop</value>
</property>

<!-- 配置该hadoop(superUser)允许通过代理访问的主机节点 -->
    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理用户所属组 -->
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理的用户-->
    <property>
        <name>hadoop.proxyuser.hadoop.users</name>
        <value>*</value>
</property>
  • hdfs-site.xml
<!-- nn web端访问地址-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
    
	<!-- 2nn web端访问地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
    
    <!-- 测试环境指定HDFS副本的数量1 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
  • yarn-site.xml
<!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>
    
    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
    
    <!-- yarn容器允许分配的最大最小内存 -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>
    
    <!-- yarn容器允许管理的物理内存大小 -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
    </property>
    
    <!-- 关闭yarn对虚拟内存的限制检查 -->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
  • mapred-site.xml
<!-- 指定MapReduce程序运行在Yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
  • workers
hadoop102
hadoop103
hadoop104

2.1.5 分发

[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync core-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 68 bytes  received 12 bytes  160.00 bytes/sec
total size is 1,758  speedup is 21.98
==================== hadoop103 ====================
sending incremental file list
core-site.xml

sent 1,177 bytes  received 47 bytes  2,448.00 bytes/sec
total size is 1,758  speedup is 1.44
==================== hadoop104 ====================
sending incremental file list
core-site.xml

sent 1,177 bytes  received 47 bytes  2,448.00 bytes/sec
total size is 1,758  speedup is 1.44
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync hdfs-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 68 bytes  received 12 bytes  53.33 bytes/sec
total size is 1,246  speedup is 15.57
==================== hadoop103 ====================
sending incremental file list
hdfs-site.xml

sent 665 bytes  received 47 bytes  1,424.00 bytes/sec
total size is 1,246  speedup is 1.75
==================== hadoop104 ====================
sending incremental file list
hdfs-site.xml

sent 665 bytes  received 47 bytes  1,424.00 bytes/sec
total size is 1,246  speedup is 1.75
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync yarn-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 67 bytes  received 12 bytes  158.00 bytes/sec
total size is 1,909  speedup is 24.16
==================== hadoop103 ====================
sending incremental file list
yarn-site.xml

sent 2,023 bytes  received 41 bytes  4,128.00 bytes/sec
total size is 1,909  speedup is 0.92
==================== hadoop104 ====================
sending incremental file list
yarn-site.xml

sent 2,023 bytes  received 41 bytes  1,376.00 bytes/sec
total size is 1,909  speedup is 0.92
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync mapred-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 70 bytes  received 12 bytes  54.67 bytes/sec
total size is 913  speedup is 11.13
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml

sent 334 bytes  received 47 bytes  762.00 bytes/sec
total size is 913  speedup is 2.40
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml

sent 334 bytes  received 47 bytes  762.00 bytes/sec
total size is 913  speedup is 2.40
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync workers 
==================== hadoop102 ====================
sending incremental file list

sent 62 bytes  received 12 bytes  148.00 bytes/sec
total size is 30  speedup is 0.41
==================== hadoop103 ====================
sending incremental file list
workers

sent 139 bytes  received 41 bytes  120.00 bytes/sec
total size is 30  speedup is 0.17
==================== hadoop104 ====================
sending incremental file list
workers

sent 139 bytes  received 41 bytes  360.00 bytes/sec
total size is 30  speedup is 0.17

2.1.6 开启日志服务

2.1.6.1 配置

  • mapred-site.xml
<!-- 历史服务器端地址 -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<!-- 历史服务器web端地址 -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>
  • yarn-site.xml
<!-- 开启日志聚集功能 -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<!-- 设置日志聚集服务器地址 -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://hadoop102:19888/jobhistory/logs</value>
</property>

<!-- 设置日志保留时间为7天 -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

注意:开始日志聚集功能,需要重启NodeManager、ResourceManager、HistoryManager

2.1.6.1 分发

[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync mapred-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 70 bytes  received 12 bytes  164.00 bytes/sec
total size is 1,211  speedup is 14.77
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml

sent 632 bytes  received 47 bytes  452.67 bytes/sec
total size is 1,211  speedup is 1.78
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml

sent 632 bytes  received 47 bytes  1,358.00 bytes/sec
total size is 1,211  speedup is 1.78
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync yarn-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 67 bytes  received 12 bytes  158.00 bytes/sec
total size is 1,909  speedup is 24.16
==================== hadoop103 ====================
sending incremental file list

sent 67 bytes  received 12 bytes  52.67 bytes/sec
total size is 1,909  speedup is 24.16
==================== hadoop104 ====================
sending incremental file list

sent 67 bytes  received 12 bytes  158.00 bytes/sec
total size is 1,909  speedup is 24.16

2.1.7 格式化服务

[hadoop@hadoop102 hadoop-3.1.3]$ bin/hdfs namenode -format
WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
2021-10-02 23:24:53,367 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop102/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.3
STARTUP_MSG:   classpath = /opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/asm-5.0.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/audience-annotations-0.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/avro-1.7.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/checker-qual-2.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-beanutils-1.9.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-codec-1.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-collections-3.2.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-compress-1.18.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-io-2.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-lang-2.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-lang3-3.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-logging-1.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-math3-3.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-net-3.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-client-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-framework-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-recipes-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/failureaccess-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/gson-2.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/hadoop-annotations-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/hadoop-auth-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/httpclient-4.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/httpcore-4.4.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/j2objc-annotations-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-core-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-databind-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jaxb-api-2.2.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-core-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-json-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-server-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-servlet-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jettison-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-http-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-io-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-servlet-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-webapp-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-xml-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsch-0.1.54.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/json-smart-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsr305-3.0.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsr311-api-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-admin-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-client-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-common-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-core-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-identity-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-server-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-config-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/netty-3.10.5.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/paranamer-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/re2j-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/snappy-java-1.0.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/stax2-api-3.1.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/token-provider-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/woodstox-core-5.0.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/zookeeper-3.4.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/metrics-core-3.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-nfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-kms-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-util-ajax-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/okio-1.6.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-servlet-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-json-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/hadoop-auth-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-codec-1.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/httpclient-4.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/httpcore-4.4.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/nimbus-jose-jwt-4.41.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jcip-annotations-1.0-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/json-smart-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/asm-5.0.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/zookeeper-3.4.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/netty-3.10.5.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-framework-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-client-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/guava-27.0-jre.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/failureaccess-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/error_prone_annotations-2.2.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-io-2.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-core-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-server-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-server-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-http-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-util-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-io-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-webapp-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-xml-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-servlet-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-security-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/hadoop-annotations-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-net-3.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jettison-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-jaxrs-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-xc-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-beanutils-1.9.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-configuration2-2.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-lang3-3.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/avro-1.7.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/paranamer-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/snappy-java-1.0.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-compress-1.18.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/re2j-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/gson-2.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsch-0.1.54.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-recipes-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-databind-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-core-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/stax2-api-3.1.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/woodstox-core-5.0.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-nfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-client-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-native-client-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-native-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-rbf-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-rbf-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-httpfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/junit-4.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/aopalliance-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/dnsjava-2.1.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/ehcache-3.3.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/fst-2.50.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/guice-4.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/guice-servlet-4.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-jaxrs-base-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/java-util-1.9.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/javax.inject-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jersey-client-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jersey-guice-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/json-io-2.5.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/metrics-core-3.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/objenesis-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/snakeyaml-1.16.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-api-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-registry-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-router-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-tests-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-services-api-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-services-core-3.1.3.jar
STARTUP_MSG:   build = https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579; compiled by 'ztang' on 2019-09-12T02:47Z
STARTUP_MSG:   java = 1.8.0_212
************************************************************/
2021-10-02 23:24:53,376 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2021-10-02 23:24:53,454 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-dee766cc-c898-4983-b137-1902d4f74c0e
2021-10-02 23:24:53,884 INFO namenode.FSEditLog: Edit logging is async:true
2021-10-02 23:24:53,894 INFO namenode.FSNamesystem: KeyProvider: null
2021-10-02 23:24:53,895 INFO namenode.FSNamesystem: fsLock is fair: true
2021-10-02 23:24:53,895 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: supergroup          = supergroup
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: isPermissionEnabled = true
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: HA Enabled: false
2021-10-02 23:24:53,941 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2021-10-02 23:24:53,949 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2021-10-02 23:24:53,956 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2021-10-02 23:24:53,961 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2021-10-02 23:24:53,961 INFO blockmanagement.BlockManager: The block deletion will start around 2021 Oct 02 23:24:53
2021-10-02 23:24:53,962 INFO util.GSet: Computing capacity for map BlocksMap
2021-10-02 23:24:53,962 INFO util.GSet: VM type       = 64-bit
2021-10-02 23:24:53,964 INFO util.GSet: 2.0% max memory 3.4 GB = 70.6 MB
2021-10-02 23:24:53,964 INFO util.GSet: capacity      = 2^23 = 8388608 entries
2021-10-02 23:24:53,974 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2021-10-02 23:24:54,029 INFO Configuration.deprecation: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
2021-10-02 23:24:54,029 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2021-10-02 23:24:54,029 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: defaultReplication         = 1
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxReplication             = 512
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: minReplication             = 1
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: redundancyRecheckInterval  = 3000ms
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2021-10-02 23:24:54,068 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
2021-10-02 23:24:54,079 INFO util.GSet: Computing capacity for map INodeMap
2021-10-02 23:24:54,079 INFO util.GSet: VM type       = 64-bit
2021-10-02 23:24:54,080 INFO util.GSet: 1.0% max memory 3.4 GB = 35.3 MB
2021-10-02 23:24:54,080 INFO util.GSet: capacity      = 2^22 = 4194304 entries
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: ACLs enabled? false
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: XAttrs enabled? true
2021-10-02 23:24:54,082 INFO namenode.NameNode: Caching file names occurring more than 10 times
2021-10-02 23:24:54,086 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2021-10-02 23:24:54,088 INFO snapshot.SnapshotManager: SkipList is disabled
2021-10-02 23:24:54,092 INFO util.GSet: Computing capacity for map cachedBlocks
2021-10-02 23:24:54,092 INFO util.GSet: VM type       = 64-bit
2021-10-02 23:24:54,092 INFO util.GSet: 0.25% max memory 3.4 GB = 8.8 MB
2021-10-02 23:24:54,092 INFO util.GSet: capacity      = 2^20 = 1048576 entries
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2021-10-02 23:24:54,102 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2021-10-02 23:24:54,102 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2021-10-02 23:24:54,103 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2021-10-02 23:24:54,103 INFO util.GSet: VM type       = 64-bit
2021-10-02 23:24:54,103 INFO util.GSet: 0.029999999329447746% max memory 3.4 GB = 1.1 MB
2021-10-02 23:24:54,104 INFO util.GSet: capacity      = 2^17 = 131072 entries
2021-10-02 23:24:54,126 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1474479574-127.0.1.1-1633188294119
2021-10-02 23:24:54,134 INFO common.Storage: Storage directory /opt/module/hadoop-3.1.3/data/dfs/name has been successfully formatted.
2021-10-02 23:24:54,153 INFO namenode.FSImageFormatProtobuf: Saving image file /opt/module/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-10-02 23:24:54,224 INFO namenode.FSImageFormatProtobuf: Image file /opt/module/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 393 bytes saved in 0 seconds .
2021-10-02 23:24:54,235 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-10-02 23:24:54,241 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2021-10-02 23:24:54,242 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop102/127.0.1.1
************************************************************/
[hadoop@hadoop102 hadoop-3.1.3]$

注意格式化之前,一定要先停止上次启动的所有namenode和datanode进程,然后再删除data和log数据

2.1.9 启动hdfs

[hadoop@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh 
Starting namenodes on [hadoop102]
Starting datanodes
hadoop103: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
hadoop104: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
Starting secondary namenodes [hadoop104]

[hadoop@hadoop102 hadoop-3.1.3]$ /home/hadoop/bin/xcall.sh jps
============hadoop102================
2610 DataNode
2838 Jps
2461 NameNode
============hadoop103================
1443 DataNode
1496 Jps
============hadoop104================
2002 Jps
1966 SecondaryNameNode
1871 DataNode

2.1.10 启动yarn

[hadoop@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh 
Starting resourcemanager
Starting nodemanagers
[hadoop@hadoop102 hadoop-3.1.3]$ /home/hadoop/bin/xcall.sh jps
============hadoop102================
2610 DataNode
3221 Jps
3079 NodeManager
2461 NameNode
============hadoop103================
1443 DataNode
1564 NodeManager
1660 Jps
============hadoop104================
2070 NodeManager
2166 Jps
1966 SecondaryNameNode
1871 DataNode

注意:启动服务和查看服务不在一台服务器上

2.1.11 访问web端

http://hadoop102:9870/
http://hadoop103:8088/cluster

注意:这里又个小坑,自安装的服务/etc/hosts中未删除localhost相关配置,导致一直无法从其他奇迹上访问到web页面。

2.1.12 创建启停脚本

vim /home/hadoop/bin/hdp.sh

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi
case $1 in
"start")
        echo " =================== 启动 hadoop集群 ==================="

        echo " --------------- 启动 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- 启动 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- 启动 historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== 关闭 hadoop集群 ==================="

        echo " --------------- 关闭 historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- 关闭 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- 关闭 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

[hadoop@hadoop102 hadoop-3.1.3]$ cd /home/hadoop/bin/
[hadoop@hadoop102 bin]$ ls
hdp.sh  xcall.sh  xsync
[hadoop@hadoop102 bin]$ chmod 777 hdp.sh 
[hadoop@hadoop102 bin]$ ll
total 12
-rwxrwxrwx  1 hadoop hadoop 1158 Oct  3 09:36 hdp.sh
-rwxrwxrwx. 1 hadoop hadoop  112 Oct  2 21:58 xcall.sh
-rwxrwxrwx. 1 hadoop hadoop  624 Oct  2 17:56 xsync
[hadoop@hadoop102 bin]$ ./hdp.sh stop
 =================== 关闭 hadoop集群 ===================
 --------------- 关闭 historyserver ---------------
 --------------- 关闭 yarn ---------------
Stopping nodemanagers
Stopping resourcemanager
 --------------- 关闭 hdfs ---------------
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[hadoop@hadoop102 bin]$ xcall.sh jps
============hadoop102================
2239 Jps
============hadoop103================
2114 Jps
============hadoop104================
1451 Jps

2.1.13 Hadoop调优

2.1.13.1 项目经验之HDFS存储多目录

(1)给Linux系统新增加一块硬盘
参考:https://www.cnblogs.com/yujianadu/p/10750698.html
(2)生产环境服务器磁盘情况

(3)在hdfs-site.xml文件中配置多目录,注意新挂载磁盘的访问权限问题
HDFS的DataNode节点保存数据的路径由dfs.datanode.data.dir参数决定,其默认值为file://${hadoop.tmp.dir}/dfs/data,若服务器有多个磁盘,必须对该参数进行修改。如服务器磁盘如上图所示,则该参数应修改为如下的值。

<property>
    <name>dfs.datanode.data.dir</name>
<value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>

注意:因为每台服务器节点的磁盘情况不同,所以这个配置配完之后,不需要分发

2.1.13.2 集群数据均衡

1)节点间数据均衡
(1)开启数据均衡命令

start-balancer.sh -threshold 10

对于参数10,代表的是集群中各个节点的磁盘空间利用率相差不超过10%,可根据实际情况进行调整。
(2)停止数据均衡命令

stop-balancer.sh

注意:于HDFS需要启动单独的Rebalance Server来执行Rebalance操作,所以尽量不要在NameNode上执行start-balancer.sh,而是找一台比较空闲的机器。
2)磁盘间数据均衡
(1)生成均衡计划(我们只有一块磁盘,不会生成计划)

hdfs diskbalancer -plan hadoop103

(2)执行均衡计划

hdfs diskbalancer -execute hadoop103.plan.json

(3)查看当前均衡任务的执行情况

hdfs diskbalancer -query hadoop103

(4)取消均衡任务

hdfs diskbalancer -cancel hadoop103.plan.json

2.1.13.3 项目经验之支持LZO压缩配置

1)hadoop-lzo编译
hadoop本身并不支持lzo压缩,故需要使用twitter提供的hadoop-lzo开源组件。hadoop-lzo需依赖hadoop和lzo进行编译,编译步骤如下。

Hadoop支持LZO

0. 环境准备
maven(下载安装,配置环境变量,修改sitting.xml加阿里云镜像)
gcc-c++
zlib-devel
autoconf
automake
libtool
通过yum安装即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool

1. 下载、安装并编译LZO

wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz

tar -zxvf lzo-2.10.tar.gz

cd lzo-2.10

./configure -prefix=/usr/local/hadoop/lzo/

make

make install

2. 编译hadoop-lzo源码

2.1 下载hadoop-lzo的源码,下载地址:https://github.com/twitter/hadoop-lzo/archive/master.zip
2.2 解压之后,修改pom.xml
    3.1.3
2.3 声明两个临时环境变量
     export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
     export LIBRARY_PATH=/usr/local/hadoop/lzo/lib 
2.4 编译
    进入hadoop-lzo-master,执行maven编译命令
    mvn package -Dmaven.test.skip=true
2.5 进入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即编译成功的hadoop-lzo组件

2)将编译好后的hadoop-lzo-0.4.20.jar 放入hadoop-3.1.3/share/hadoop/common/

[hadoop@hadoop102 common]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/common
[hadoop@hadoop102 common]$ ls
hadoop-lzo-0.4.20.jar

3)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104

[hadoop@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar

4)core-site.xml增加配置支持LZO压缩

<configuration>
    <property>
        <name>io.compression.codecs</name>
        <value>
            org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            org.apache.hadoop.io.compress.SnappyCodec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
        </value>
    </property>

    <property>
        <name>io.compression.codec.lzo.class</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
</configuration>

5)同步core-site.xml到hadoop103、hadoop104

[hadoop@hadoop102 hadoop]$ xsync core-site.xml

6)启动及查看集群

[hadoop@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hadoop@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

7)测试-数据准备

[hadoop@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /input
[hadoop@hadoop102 hadoop-3.1.3]$ hadoop fs -put README.txt /input

8)测试-压缩

[hadoop@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec  /input /output

2.1.13.4 项目经验之LZO创建索引

1)创建LZO文件的索引
LZO压缩文件的可切片特性依赖于其索引,故我们需要手动为LZO压缩文件创建索引。若无索引,则LZO文件的切片只有一个。

hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo

2)测试
(1)将bigtable.lzo(200M)上传到集群的根目录

[hadoop@hadoop102 module]$ hadoop fs -mkdir /input
[hadoop@hadoop102 module]$ hadoop fs -put bigtable.lzo /input
(2)执行wordcount程序
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output1
(3)对上传的LZO文件建索引
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar  com.hadoop.compression.lzo.DistributedLzoIndexer /input/bigtable.lzo
(4)再次执行WordCount程序
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output2

3)注意:如果以上任务,在运行过程中报如下异常
Container [pid=8468,containerID=container_1594198338753_0001_01_000002] is running 318740992B beyond the ‘VIRTUAL’ memory limit. Current usage: 111.5 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1594198338753_0001_01_000002 :
解决办法:在hadoop102的/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml文件中增加如下配置,然后分发到hadoop103、hadoop104服务器上,并重新启动集群。

<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
</property>

2.1.13.5 项目经验之基准测试

在企业中非常关心每天从Java后台拉取过来的数据,需要多久能上传到集群?消费者关心多久能从HDFS上拉取需要的数据?
为了搞清楚HDFS的读写性能,生产环境上非常需要对集群进行压测。

HDFS的读写性能主要受网络和磁盘影响比较大。为了方便测试,将hadoop102、hadoop103、hadoop104虚拟机网络都设置为100mbps。
100Mbps单位是bit;10M/s单位是byte ; 1byte=8bit,100Mbps/8=12.5M/s。

测试网速:
(1)来到hadoop102的/opt/module目录,创建一个

[hadoop@hadoop102 software]$ python -m SimpleHTTPServer

(2)在Web页面上访问
hadoop102:8000
1)测试HDFS写性能
(1)写测试底层原理

(2)测试内容:向HDFS集群写10个128M的文件

[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB

2021-02-09 10:43:16,853 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:             Date & time: Tue Feb 09 10:43:16 CST 2021
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:         Number of files: 10
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:  Total MBytes processed: 1280
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:       Throughput mb/sec: 1.61
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:  Average IO rate mb/sec: 1.9
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:   IO rate std deviation: 0.76
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:      Test exec time sec: 133.05
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:

注意:nrFiles n为生成mapTask的数量,生产环境一般可通过hadoop103:8088查看CPU核数,设置为(CPU核数 - 1)
Number of files:生成mapTask数量,一般是集群中(CPU核数 - 1),我们测试虚拟机就按照实际的物理内存-1分配即可。(目标,让每个节点都参与测试)
Total MBytes processed:单个map处理的文件大小
Throughput mb/sec:单个mapTak的吞吐量
计算方式:处理的总文件大小/每一个mapTask写数据的时间累加
集群整体吞吐量:生成mapTask数量*单个mapTak的吞吐量
Average IO rate mb/sec::平均mapTak的吞吐量
计算方式:每个mapTask处理文件大小/每一个mapTask写数据的时间
全部相加除以task数量
IO rate std deviation:方差、反映各个mapTask处理的差值,越小越均衡
注意:如果测试过程中,出现异常
①可以在yarn-site.xml中设置虚拟内存检测为false

<property>
     <name>yarn.nodemanager.vmem-check-enabled</name>
     <value>false</value>
</property>

②分发配置并重启Yarn集群
(3)测试结果分析
①由于副本1就在本地,所以该副本不参与测试

一共参与测试的文件:10个文件 * 2个副本 = 20个
压测后的速度:1.61
实测速度:1.61M/s * 20个文件 ≈ 32M/s
三台服务器的带宽:12.5 + 12.5 + 12.5 ≈ 30m/s
所有网络资源都已经用满。
如果实测速度远远小于网络,并且实测速度不能满足工作需求,可以考虑采用固态硬盘或者增加磁盘个数。
②如果客户端不在集群节点,那就三个副本都参与计算

2)测试HDFS读性能
(1)测试内容:读取HDFS集群10个128M的文件

[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 128MB

2021-02-09 11:34:15,847 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
2021-02-09 11:34:15,847 INFO fs.TestDFSIO:             Date & time: Tue Feb 09 11:34:15 CST 2021
2021-02-09 11:34:15,847 INFO fs.TestDFSIO:         Number of files: 10
2021-02-09 11:34:15,847 INFO fs.TestDFSIO:  Total MBytes processed: 1280
2021-02-09 11:34:15,848 INFO fs.TestDFSIO:       Throughput mb/sec: 200.28
2021-02-09 11:34:15,848 INFO fs.TestDFSIO:  Average IO rate mb/sec: 266.74
2021-02-09 11:34:15,848 INFO fs.TestDFSIO:   IO rate std deviation: 143.12
2021-02-09 11:34:15,848 INFO fs.TestDFSIO:      Test exec time sec: 20.83

(2)删除测试生成数据

[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -clean

(3)测试结果分析:为什么读取文件速度大于网络带宽?由于目前只有三台服务器,且有三个副本,数据读取就近原则,相当于都是读取的本地磁盘数据,没有走网络。

3)使用Sort程序评测MapReduce
(1)使用RandomWriter来产生随机数,每个节点运行10个Map任务,每个Map产生大约1G大小的二进制随机数

[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar randomwriter random-data

(2)执行Sort程序

[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar sort random-data sorted-data

(3)验证数据是否真正排好序了

[hadoop@hadoop102 mapreduce]$ 
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar testmapredsort -sortInput random-data -sortOutput sorted-data

2.1.13.5 项目经验之Hadoop参数调优

1)HDFS参数调优hdfs-site.xml
The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is not configured then Namenode RPC server threads listen to requests from all nodes.
NameNode有一个工作线程池,用来处理不同DataNode的并发心跳以及客户端并发的元数据操作。
对于大集群或者有大量客户端的集群来说,通常需要增大参数dfs.namenode.handler.count的默认值10。

<property>
    <name>dfs.namenode.handler.count</name>
    <value>10</value>
</property>

dfs.namenode.handler.count=,比如集群规模为8台时,此参数设置为41。可通过简单的python代码计算该值,代码如下。

[atguigu@hadoop102 ~]$ python
Python 2.7.5 (default, Apr 11 2018, 07:36:10) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> print int(20*math.log(8))
41
>>> quit()

2)YARN参数调优yarn-site.xml
(1)情景描述:总共7台机器,每天几亿条数据,数据源->Flume->Kafka->HDFS->Hive
面临问题:数据统计主要用HiveSQL,没有数据倾斜,小文件已经做了合并处理,开启的JVM重用,而且IO没有阻塞,内存用了不到50%。但是还是跑的非常慢,而且数据量洪峰过来时,整个集群都会宕掉。基于这种情况有没有优化方案。
(2)解决办法:
NodeManager内存和服务器实际内存配置尽量接近,如服务器有128g内存,但是NodeManager默认内存8G,不修改该参数最多只能用8G内存。NodeManager使用的CPU核数和服务器CPU核数尽量接近。

  • ①yarn.nodemanager.resource.memory-mb NodeManager使用内存数
  • ②yarn.nodemanager.resource.cpu-vcores NodeManager使用CPU核数

2.2 安装Zookeeper

2.2.1 集群规划

hadoop102 hadoop103 hadoop104
Zookeeper Zookeeper Zookeeper

2.2.2 解压安装

[hadoop@hadoop102 software]$ tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
# 分发
[hadoop@hadoop102 module]$ xsync zookeeper-3.5.7/

2.2.3 配置服务

# 创建数据目录
[hadoop@hadoop102 zookeeper-3.5.7]$ mkdir zkData
# 重命名
[hadoop@hadoop102 conf]$ mv zoo_sample.cfg zoo.cfg
[hadoop@hadoop102 conf]$ vim zoo.cfg
# 内容
dataDir=/opt/module/zookeeper-3.5.7/zkData

#######################cluster##########################
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888


# 创建服务ID
[hadoop@hadoop102 zkData]$ touch myid
[hadoop@hadoop102 zkData]$ vim myid
# 内容
2

# 分发
[hadoop@hadoop102 zookeeper-3.5.7]$ xsync zkData/
==================== hadoop102 ====================
sending incremental file list

sent 94 bytes  received 17 bytes  222.00 bytes/sec
total size is 2  speedup is 0.02
==================== hadoop103 ====================
sending incremental file list
zkData/
zkData/myid

sent 142 bytes  received 39 bytes  362.00 bytes/sec
total size is 2  speedup is 0.01
==================== hadoop104 ====================
sending incremental file list
zkData/
zkData/myid

sent 142 bytes  received 39 bytes  362.00 bytes/sec
total size is 2  speedup is 0.01

[hadoop@hadoop102 zookeeper-3.5.7]$ xsync conf/
configuration.xsl  log4j.properties   zoo.cfg            
[hadoop@hadoop102 zookeeper-3.5.7]$ xsync conf/zoo.cfg 
==================== hadoop102 ====================
sending incremental file list

sent 62 bytes  received 12 bytes  148.00 bytes/sec
total size is 942  speedup is 12.73
==================== hadoop103 ====================
sending incremental file list
zoo.cfg

sent 1,051 bytes  received 35 bytes  2,172.00 bytes/sec
total size is 942  speedup is 0.87
==================== hadoop104 ====================
sending incremental file list
zoo.cfg

sent 1,051 bytes  received 35 bytes  2,172.00 bytes/sec
total size is 942  speedup is 0.87

hadoop102服务ID为2、hadooop103服务ID为3、 hadoop104服务ID为4

2.2.4 启动服务

[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Error contacting service. It is probably not running.
[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower


[hadoop@hadoop103 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: leader


[hadoop@hadoop104 ~]$ /opt/module/zookeeper-3.5.7/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
    

2.2.5 ZK集群启动停止脚本

(1)在hadoop102的/home/atguigu/bin目录下创建脚本

[hadoop@hadoop102 bin]$ vim zk.sh
	在脚本中编写如下内容
#!/bin/bash

case $1 in
"start"){
	for i in hadoop102 hadoop103 hadoop104
	do
        echo ---------- zookeeper $i 启动 ------------
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"
	done
};;
"stop"){
	for i in hadoop102 hadoop103 hadoop104
	do
        echo ---------- zookeeper $i 停止 ------------    
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
	done
};;
"status"){
	for i in hadoop102 hadoop103 hadoop104
	do
        echo ---------- zookeeper $i 状态 ------------    
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"
	done
};;

esac
(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod u+x zk.sh

(3)Zookeeper集群启动脚本

[hadoop@hadoop102 module]$ zk.sh start

(4)Zookeeper集群停止脚本

[hadoop@hadoop102 module]$ zk.sh stop

2.3 安装kafka

2.3.1 集群规划

hadoop102 hadoop103 hadoop104
zk zk zk
kafka kafka kafka

2.3.2 解压安装

[hadoop@hadoop102 software]$ tar -zxvf kafka_2.11-2.4.1.tgz -C /opt/module/

# 重命名
[hadoop@hadoop102 software]$ cd ../module/
[hadoop@hadoop102 module]$ ls
hadoop-3.1.3  jdk1.8.0_212  kafka_2.11-2.4.1  zookeeper-3.5.7
[hadoop@hadoop102 module]$ mv kafka_2.11-2.4.1/ kafka

2.3.3 配置

[hadoop@hadoop102 kafka]$ cd config/
[hadoop@hadoop102 config]$ vi server.properties
修改或者增加以下内容:
#broker的全局唯一编号,不能重复
broker.id=0
#删除topic功能使能
delete.topic.enable=true
#kafka运行日志存放的路径
log.dirs=/opt/module/kafka/data
#配置连接Zookeeper集群地址
zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka

# 分发
[hadoop@hadoop102 module]$ xsync kafka/

注意:分别修改hadoop103、hadoop104两台机器上的broker.id为1和2.

2.3.4 配置环境变量

[hadoop@hadoop102 kafka]$ sudo vim  /etc/profile.d/my_env.sh 

# KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin
[hadoop@hadoop102 kafka]$ source /etc/profile.d/my_env.sh
# 分发
[hadoop@hadoop102 kafka]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh 
==================== hadoop102 ====================
Warning: Permanently added the ECDSA host key for IP address '192.168.10.102' to the list of known hosts.
root@hadoop102's password: 
root@hadoop102's password: 
sending incremental file list

sent 47 bytes  received 12 bytes  23.60 bytes/sec
total size is 300  speedup is 5.08
==================== hadoop103 ====================
root@hadoop103's password: 
root@hadoop103's password: 
sending incremental file list
my_env.sh

sent 394 bytes  received 41 bytes  174.00 bytes/sec
total size is 300  speedup is 0.69
==================== hadoop104 ====================
root@hadoop104's password: 
root@hadoop104's password: 
sending incremental file list
my_env.sh

sent 394 bytes  received 41 bytes  124.29 bytes/sec
total size is 300  speedup is 0.69

2.3.5 Kafka集群启动停止脚本

(1)在/home/atguigu/bin目录下创建脚本kf.sh

[hadoop@hadoop102 bin]$ vim kf.sh
	在脚本中填写如下内容
#! /bin/bash

case $1 in
"start"){
    for i in hadoop102 hadoop103 hadoop104
    do
        echo " --------启动 $i Kafka-------"
        ssh $i "/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties"
    done
};;
"stop"){
    for i in hadoop102 hadoop103 hadoop104
    do
        echo " --------停止 $i Kafka-------"
        ssh $i "/opt/module/kafka/bin/kafka-server-stop.sh stop"
    done
};;
esac

(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod u+x kf.sh

(3)kf集群启动脚本

[hadoop@hadoop102 module]$ kf.sh start

(4)kf集群停止脚本

[hadoop@hadoop102 module]$ kf.sh stop

2.3.6 项目经验之Kafka机器数量计算

Kafka机器数量(经验公式)= 2 *(峰值生产速度 * 副本数 / 100)+ 1
先拿到峰值生产速度,再根据设定的副本数,就能预估出需要部署Kafka的数量。
1)峰值生产速度
峰值生产速度可以压测得到。
2)副本数
副本数默认是1个,在企业里面2-3个都有,2个居多。
副本多可以提高可靠性,但是会降低网络传输效率。
比如我们的峰值生产速度是50M/s。副本数为2。
Kafka机器数量 = 2 *(50 * 2 / 100)+ 1 = 3台
4.4.5 项目经验之Kafka压力测试
1)Kafka压测
用Kafka官方自带的脚本,对Kafka进行压测。
kafka-consumer-perf-test.sh
kafka-producer-perf-test.sh
Kafka压测时,在硬盘读写速度一定的情况下,可以查看到哪些地方出现了瓶颈(CPU,内存,网络IO)。一般都是网络IO达到瓶颈。
2)Kafka Producer压力测试

(0)压测环境准备
	①hadoop102、hadoop103、hadoop104的网络带宽都设置为100mbps。
	②关闭hadoop102主机,并根据hadoop102克隆出hadoop105(修改IP和主机名称)
	③hadoop105的带宽不设限
	④创建一个test topic,设置为3个分区2个副本
[hadoop@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka --create --replication-factor 2 --partitions 3 --topic test

(1)在/opt/module/kafka/bin目录下面有这两个文件。我们来测试一下

[hadoop@hadoop102 kafka]$ bin/kafka-producer-perf-test.sh  --topic test --record-size 100 --num-records 10000000 --throughput -1 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092

说明:
record-size是一条信息有多大,单位是字节。
num-records是总共发送多少条信息。
throughput 是每秒多少条信息,设成-1,表示不限流,尽可能快的生产数据,可测出生产者最大吞吐量。
(2)Kafka会打印下面的信息
699884 records sent, 139976.8 records/sec (13.35 MB/sec), 1345.6 ms avg latency, 2210.0 ms max latency.
713247 records sent, 141545.3 records/sec (13.50 MB/sec), 1577.4 ms avg latency, 3596.0 ms max latency.
773619 records sent, 153862.2 records/sec (14.67 MB/sec), 2326.8 ms avg latency, 4051.0 ms max latency.
773961 records sent, 154206.2 records/sec (15.71 MB/sec), 1964.1 ms avg latency, 2917.0 ms max latency.
776970 records sent, 154559.4 records/sec (15.74 MB/sec), 1960.2 ms avg latency, 2922.0 ms max latency.
776421 records sent, 154727.2 records/sec (15.76 MB/sec), 1960.4 ms avg latency, 2954.0 ms max latency.
参数解析:Kafka的吞吐量15m/s左右是否符合预期呢?
hadoop102、hadoop103、hadoop104三台集群的网络总带宽30m/s左右,由于是两个副本,所以Kafka的吞吐量30m/s ➗ 2(副本) = 15m/s
结论:网络带宽和副本都会影响吞吐量。
(4)调整batch.size
batch.size默认值是16k。
batch.size较小,会降低吞吐量。比如说,批次大小为0则完全禁用批处理,会一条一条发送消息);
batch.size过大,会增加消息发送延迟。比如说,Batch设置为64k,但是要等待5秒钟Batch才凑满了64k,才能发送出去。那这条消息的延迟就是5秒钟。

[hadoop@hadoop102 kafka]$ bin/kafka-producer-perf-test.sh  --topic test --record-size 100 --num-records 10000000 --throughput -1 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092 batch.size=500

输出结果
69169 records sent, 13833.8 records/sec (1.32 MB/sec), 2517.6 ms avg latency, 4299.0 ms max latency.
105372 records sent, 21074.4 records/sec (2.01 MB/sec), 6748.4 ms avg latency, 9016.0 ms max latency.
113188 records sent, 22637.6 records/sec (2.16 MB/sec), 11348.0 ms avg latency, 13196.0 ms max latency.
108896 records sent, 21779.2 records/sec (2.08 MB/sec), 12272.6 ms avg latency, 12870.0 ms max latency.
(5)linger.ms
如果设置batch size为64k,但是比如过了10分钟也没有凑够64k,怎么办?
可以设置,linger.ms。比如linger.ms=5ms,那么就是要发送的数据没有到64k,5ms后,数据也会发出去。
(6)总结
同时设置batch.size和 linger.ms,就是哪个条件先满足就都会将消息发送出去
Kafka需要考虑高吞吐量与延时的平衡。
3)Kafka Consumer压力测试

(1)Consumer的测试,如果这四个指标(IO,CPU,内存,网络)都不能改变,考虑增加分区数来提升性能。

[hadoop@hadoop102 kafka]$ bin/kafka-consumer-perf-test.sh --broker-list hadoop102:9092,hadoop103:9092,hadoop104:9092 --topic test --fetch-size 10000 --messages 10000000 --threads 1

①参数说明:
–broker-list指定Kafka集群地址
–topic 指定topic的名称
–fetch-size 指定每次fetch的数据的大小
–messages 总共要消费的消息个数
②测试结果说明:
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2021-08-03 21:17:21:778, 2021-08-03 21:18:19:775, 514.7169, 8.8749, 5397198, 93059.9514
开始测试时间,测试结束数据,共消费数据514.7169MB,吞吐量8.8749MB/s
(2)调整fetch-size
①增加fetch-size值,观察消费吞吐量。

[hadoop@hadoop102 kafka]$ bin/kafka-consumer-perf-test.sh --broker-list hadoop102:9092,hadoop103:9092,hadoop104:9092 --topic test --fetch-size 100000 --messages 10000000 --threads 1

②测试结果说明:
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2021-08-03 21:22:57:671, 2021-08-03 21:23:41:938, 514.7169, 11.6276, 5397198, 121923.7355
(3)总结
吞吐量受网络带宽和fetch-size的影响
4.4.6 项目经验值Kafka分区数计算
(1)创建一个只有1个分区的topic
(2)测试这个topic的producer吞吐量和consumer吞吐量。
(3)假设他们的值分别是Tp和Tc,单位可以是MB/s。
(4)然后假设总的目标吞吐量是Tt,那么分区数 = Tt / min(Tp,Tc)
例如:producer吞吐量 = 20m/s;consumer吞吐量 = 50m/s,期望吞吐量100m/s;
分区数 = 100 / 20 = 5分区
https://blog.csdn.net/weixin_42641909/article/details/89294698
分区数一般设置为:3-10个

2.4 flume安装

2.4.1 集群规划

hadoop102 hadoop103 hadoop104
flume flume flume

2.4.2安装地址

(1) Flume官网地址:http://flume.apache.org/
(2)文档查看地址:http://flume.apache.org/FlumeUserGuide.html
(3)下载地址:http://archive.apache.org/dist/flume/

2.4.3 安装部署

(1)将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/software目录下
(2)解压apache-flume-1.9.0-bin.tar.gz到/opt/module/目录下

[hadoop@hadoop102 software]$ tar -zxf /opt/software/apache-flume-1.9.0-bin.tar.gz -C /opt/module/

(3)修改apache-flume-1.9.0-bin的名称为flume

[hadoop@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume

(4)将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3

[hadoop@hadoop102 module]$ rm /opt/module/flume/lib/guava-11.0.2.jar

注意:删除guava-11.0.2.jar的服务器节点,一定要配置hadoop环境变量。否则会报如下异常。
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Lists
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
… 1 more
(5)将flume/conf下的flume-env.sh.template文件修改为flume-env.sh,并配置flume-env.sh文件

[hadoop@hadoop102 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop102 conf]$ vi flume-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_212

2.4.4 分发

[hadoop@hadoop102 module]$ xsync flume/

2.4.5 项目经验之Flume组件选型

1)Source
(1)Taildir Source相比Exec Source、Spooling Directory Source的优势
TailDir Source:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传。不会丢数据,但是有可能会导致数据重复。
Exec Source可以实时搜集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失。
Spooling Directory Source监控目录,支持断点续传。
(2)batchSize大小如何设置?
答:Event 1K左右时,500-1000合适(默认为100)
2)Channel
采用Kafka Channel,省去了Sink,提高了效率。KafkaChannel数据存储在Kafka里面,所以数据是存储在磁盘中。
注意在Flume1.7以前,Kafka Channel很少有人使用,因为发现parseAsFlumeEvent这个配置起不了作用。也就是无论parseAsFlumeEvent配置为true还是false,都会转为Flume Event。这样的话,造成的结果是,会始终都把Flume的headers中的信息混合着内容一起写入Kafka的消息中,这显然不是我所需要的,我只是需要把内容写入即可。
4.5.3 日志采集Flume配置
1)Flume配置分析

Flume直接读log日志的数据,log日志的格式是app.yyyy-mm-dd.log。
2)Flume的具体配置如下:
(1)在/opt/module/flume/conf目录下创建file-flume-kafka.conf文件
[hadoop@hadoop102 conf]$
在文件配置如下内容

#为各组件命名
a1.sources = r1
a1.channels = c1

#描述source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.interceptors =  i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.ETLInterceptor$Builder

#描述channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false

#绑定source和channel以及sink和channel的关系
a1.sources.r1.channels = c1

​ 注意:com.atguigu.flume.interceptor.ETLInterceptor是自定义的拦截器的全类名。需要根据用户自定义的拦截器做相应修改。
4.5.4 Flume拦截器
1)创建Maven工程flume-interceptor
2)创建包名:com.atguigu.flume.interceptor
3)在pom.xml文件中添加如下配置

<dependencies>
    <dependency>
        <groupId>org.apache.flumegroupId>
        <artifactId>flume-ng-coreartifactId>
        <version>1.9.0version>
        <scope>providedscope>
    dependency>

    <dependency>
        <groupId>com.alibabagroupId>
        <artifactId>fastjsonartifactId>
        <version>1.2.62version>
    dependency>

dependencies>

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-pluginartifactId>
            <version>2.3.2version>
            <configuration>
                <source>1.8source>
                <target>1.8target>
            configuration>
        plugin>
        <plugin>
            <artifactId>maven-assembly-pluginartifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependenciesdescriptorRef>
                descriptorRefs>
            configuration>
            <executions>
                <execution>
                    <id>make-assemblyid>
                    <phase>packagephase>
                    <goals>
                        <goal>singlegoal>
                    goals>
                execution>
            executions>
        plugin>
    plugins>
build>

​ 注意:scope中provided的含义是编译时用该jar包。打包时时不用。因为集群上已经存在flume的jar包。只是本地编译时用一下。
4)在com.atguigu.flume.interceptor包下创建JSONUtils类

package com.atguigu.flume.interceptor;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONException;

public class JSONUtils {
    public static boolean isJSONValidate(String log){
        try {
            JSON.parse(log);
            return true;
        }catch (JSONException e){
            return false;
        }
    }
}

5)在com.atguigu.flume.interceptor包下创建LogInterceptor类

package com.atguigu.flume.interceptor;

import com.alibaba.fastjson.JSON;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import java.util.List;

public class ETLInterceptor implements Interceptor {

    @Override
    public void initialize() {
    
    }
    
    @Override
    public Event intercept(Event event) {
    
        byte[] body = event.getBody();
        String log = new String(body, StandardCharsets.UTF_8);
    
        if (JSONUtils.isJSONValidate(log)) {
            return event;
        } else {
            return null;
        }
    }
    
    @Override
    public List<Event> intercept(List<Event> list) {
    
        Iterator<Event> iterator = list.iterator();
    
        while (iterator.hasNext()){
            Event next = iterator.next();
            if(intercept(next)==null){
                iterator.remove();
            }
        }
    
        return list;
    }
    
    public static class Builder implements Interceptor.Builder{
    
        @Override
        public Interceptor build() {
            return new ETLInterceptor();
        }
        @Override
        public void configure(Context context) {
    
        }
    
    }
    
    @Override
    public void close() {
    
    }

}

6)打包

7)需要先将打好的包放入到hadoop102的/opt/module/flume/lib文件夹下面。

[hadoop@hadoop102 lib]$ ls | grep interceptor
flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar

8)分发Flume到hadoop103、hadoop104

[hadoop@hadoop102 module]$ xsync flume/

9)分别在hadoop102、hadoop103上启动Flume

[hadoop@hadoop102 flume]$ bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &
[hadoop@hadoop103 flume]$ bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &

4.5.5 测试Flume-Kafka通道

(1)生成日志
[hadoop@hadoop102 ~]$ lg.sh
(2)消费Kafka数据,观察控制台是否有数据获取到

[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --from-beginning --topic topic_log

说明:如果获取不到数据,先检查Kafka、Flume、Zookeeper是否都正确启动。再检查Flume的拦截器代码是否正常。
4.5.6 日志采集Flume启动停止脚本
(1)在/home/hadoop/bin目录下创建脚本f1.sh

[hadoop@hadoop102 bin]$ vim f1.sh
	在脚本中填写如下内容
#! /bin/bash

case $1 in
"start"){
        for i in hadoop102 hadoop103
        do
                echo " --------启动 $i 采集flume-------"
                ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log1.txt 2>&1  &"
        done
};;	
"stop"){
        for i in hadoop102 hadoop103
        do
                echo " --------停止 $i 采集flume-------"
                ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk  '{print \$2}' | xargs -n1 kill -9 "
        done

};;
esac

说明1:nohup,该命令可以在你退出帐户/关闭终端之后继续运行相应的进程。nohup就是不挂起的意思,不挂断地运行命令。
说明2:awk 默认分隔符为空格
说明3:$2是在“”双引号内部会被解析为脚本的第二个参数,但是这里面想表达的含义是awk的第二个值,所以需要将他转义,用$2表示。
说明4:xargs 表示取出前面命令运行的结果,作为后面命令的输入参数。
(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod u+x f1.sh

(3)f1集群启动脚本

[hadoop@hadoop102 module]$ f1.sh start

(4)f1集群停止脚本

[hadoop@hadoop102 module]$ f1.sh stop

4.6 消费Kafka数据Flume

集群规划
服务器hadoop102 服务器hadoop103 服务器hadoop104
Flume(消费Kafka) Flume
4.6.1 项目经验之Flume组件选型
1)FileChannel和MemoryChannel区别
MemoryChannel传输数据速度更快,但因为数据保存在JVM的堆内存中,Agent进程挂掉会导致数据丢失,适用于对数据质量要求不高的需求。
FileChannel传输速度相对于Memory慢,但数据安全保障高,Agent进程挂掉也可以从失败中恢复数据。
选型:
金融类公司、对钱要求非常准确的公司通常会选择FileChannel
传输的是普通日志信息(京东内部一天丢100万-200万条,这是非常正常的),通常选择MemoryChannel。
2)FileChannel优化
通过配置dataDirs指向多个路径,每个路径对应不同的硬盘,增大Flume吞吐量。
官方说明如下:
Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance
checkpointDir和backupCheckpointDir也尽量配置在不同硬盘对应的目录中,保证checkpoint坏掉后,可以快速使用backupCheckpointDir恢复数据。

3)Sink:HDFS Sink
(1)HDFS存入大量小文件,有什么影响?
元数据层面:每个小文件都有一份元数据,其中包括文件路径,文件名,所有者,所属组,权限,创建时间等,这些信息都保存在Namenode内存中。所以小文件过多,会占用Namenode服务器大量内存,影响Namenode性能和使用寿命
计算层面:默认情况下MR会对每个小文件启用一个Map任务计算,非常影响计算性能。同时也影响磁盘寻址时间。
(2)HDFS小文件处理
官方默认的这三个参数配置写入HDFS后会产生小文件,hdfs.rollInterval、hdfs.rollSize、hdfs.rollCount
基于以上hdfs.rollInterval=3600,hdfs.rollSize=134217728,hdfs.rollCount =0几个参数综合作用,效果如下:
①文件在达到128M时会滚动生成新文件
②文件创建超3600秒时会滚动生成新文件
4.6.2 消费者Flume配置
1)Flume配置分析

2)Flume的具体配置如下:
(1)在hadoop104的/opt/module/flume/conf目录下创建kafka-flume-hdfs.conf文件

[hadoop@hadoop104 conf]$ vim kafka-flume-hdfs.conf
在文件配置如下内容

## 组件

a1.sources=r1
a1.channels=c1
a1.sinks=k1

## source1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimeStampInterceptor$Builder

## channel1

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/


## sink1

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.round = false

#控制生成的小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0

## 控制输出文件是原生文件。

a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop

## 拼装

a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1

4.6.3 Flume时间戳拦截器
由于Flume默认会用Linux系统时间,作为输出到HDFS路径的时间。如果数据是23:59分产生的。Flume消费Kafka里面的数据时,有可能已经是第二天了,那么这部门数据会被发往第二天的HDFS路径。我们希望的是根据日志里面的实际时间,发往HDFS的路径,所以下面拦截器作用是获取日志中的实际时间。
解决的思路:拦截json日志,通过fastjson框架解析json,获取实际时间ts。将获取的ts时间写入拦截器header头,header的key必须是timestamp,因为Flume框架会根据这个key的值识别为时间,写入到HDFS。
1)在com.atguigu.flume.interceptor包下创建TimeStampInterceptor类

package com.atguigu.flume.interceptor;

import com.alibaba.fastjson.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TimeStampInterceptor implements Interceptor {

    private ArrayList<Event> events = new ArrayList<>();
    
    @Override
    public void initialize() {
    
    }
    
    @Override
    public Event intercept(Event event) {
    
        Map<String, String> headers = event.getHeaders();
        String log = new String(event.getBody(), StandardCharsets.UTF_8);
    
        JSONObject jsonObject = JSONObject.parseObject(log);
    
        String ts = jsonObject.getString("ts");
        headers.put("timestamp", ts);
    
        return event;
    }
    
    @Override
    public List<Event> intercept(List<Event> list) {
        events.clear();
        for (Event event : list) {
            events.add(intercept(event));
        }
    
        return events;
    }
    
    @Override
    public void close() {
    
    }
    
    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new TimeStampInterceptor();
        }
    
        @Override
        public void configure(Context context) {
        }
    }

}

2)重新打包

3)需要先将打好的包放入到hadoop102的/opt/module/flume/lib文件夹下面。

[hadoop@hadoop102 lib]$ ls | grep interceptor
flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar

4)分发Flume到hadoop103、hadoop104

[hadoop@hadoop102 module]$ xsync flume/

4.6.4 消费者Flume启动停止脚本
(1)在/home/hadoop/bin目录下创建脚本f2.sh

[hadoop@hadoop102 bin]$ vim f2.sh
	在脚本中填写如下内容
#! /bin/bash

case $1 in
"start"){
        for i in hadoop104
        do
                echo " --------启动 $i 消费flume-------"
                ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log2.txt   2>&1 &"
        done
};;
"stop"){
        for i in hadoop104
        do
                echo " --------停止 $i 消费flume-------"
                ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
        done

};;
esac

(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod u+x f2.sh

(3)f2集群启动脚本

[hadoop@hadoop102 module]$ f2.sh start

(4)f2集群停止脚本

[hadoop@hadoop102 module]$ f2.sh stop

4.6.5 项目经验之Flume内存优化
1)问题描述:如果启动消费Flume抛出如下异常
ERROR hdfs.HDFSEventSink: process failed
java.lang.OutOfMemoryError: GC overhead limit exceeded
2)解决方案步骤
(1)在hadoop102服务器的/opt/module/flume/conf/flume-env.sh文件中增加如下配置

export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

(2)同步配置到hadoop103、hadoop104服务器

[hadoop@hadoop102 conf]$ xsync flume-env.sh

3)Flume内存参数设置及优化
JVM heap一般设置为4G或更高
-Xmx与-Xms最好设置一致,减少内存抖动带来的性能影响,如果设置不一致容易导致频繁fullgc。
-Xms表示JVM Heap(堆内存)最小尺寸,初始分配;-Xmx 表示JVM Heap(堆内存)最大允许的尺寸,按需分配。如果不设置一致,容易在初始化时,由于内存不够,频繁触发fullgc。
4.7 采集通道启动/停止脚本
(1)在/home/hadoop/bin目录下创建脚本cluster.sh

[hadoop@hadoop102 bin]$ vim cluster.sh
	在脚本中填写如下内容
#!/bin/bash

case $1 in
"start"){
        echo ================== 启动 集群 ==================

        #启动 Zookeeper集群
        zk.sh start
    
        #启动 Hadoop集群
        hdp.sh start
    
        #启动 Kafka采集集群
        kf.sh start
    
        #启动 Flume采集集群
        f1.sh start
    
        #启动 Flume消费集群
        f2.sh start
    
        };;

"stop"){
        echo ================== 停止 集群 ==================

        #停止 Flume消费集群
        f2.sh stop
    
        #停止 Flume采集集群
        f1.sh stop
    
        #停止 Kafka采集集群
        kf.sh stop
    
        #停止 Hadoop集群
        hdp.sh stop
    
        #停止 Zookeeper集群
        zk.sh stop

};;
esac

(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod u+x cluster.sh	

(3)cluster集群启动脚本

[hadoop@hadoop102 module]$ cluster.sh start

(4)cluster集群停止脚本

[hadoop@hadoop102 module]$ cluster.sh stop

常见问题及解决方案
5.1 2NN页面不能显示完整信息
1)问题描述
访问2NN页面http://hadoop104:9868,看不到详细信息
2)解决办法
(1)在浏览器上按F12,查看问题原因。定位bug在61行
(2)找到要修改的文件

[hadoop@hadoop102 static]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/hdfs/webapps/static

[hadoop@hadoop102 static]$ vim dfs-dust.js
:set nu
修改61行 
return new Date(Number(v)).toLocaleString();

(3)分发dfs-dust.js

[hadoop@hadoop102 static]$ xsync dfs-dust.js

(4)在http://hadoop104:9868/status.html 页面强制刷新

2.5 安装MySql

2.5.1 检查是否安装mysql并卸载

[hadoop@hadoop102 mysql]$ rpm -qa | grep -i -E mysql\|mariadb | xargs -n1 sudo rpm -e --nodeps

2.5.2 安装依赖

[hadoop@hadoop102 mysql]$ sudo yum install -y libaio
[hadoop@hadoop102 mysql]$ sudo yum -y install autoconf

2.5.3 安装mysql

[hadoop@hadoop102 mysql]$ rpm -ivh 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm 
warning: 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Permission denied)
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm 
warning: 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:mysql-community-common-5.7.16-1.e################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm 
warning: 02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:mysql-community-libs-5.7.16-1.el7################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm 
warning: 03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:mysql-community-libs-compat-5.7.1################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 04_mysql-community-client-5.7.16-1.el7.x86_64.rpm 
warning: 04_mysql-community-client-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:mysql-community-client-5.7.16-1.e################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 05_mysql-community-server-5.7.16-1.el7.x86_64.rpm 
warning: 05_mysql-community-server-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:mysql-community-server-5.7.16-1.e################################# [100%]

2.5.4 启动服务

[hadoop@hadoop102 mysql]$ sudo systemctl start mysqld.service

2.5.5 查看服务状态

[hadoop@hadoop102 mysql]$ sudo systemctl status mysqld.service 
● mysqld.service - MySQL Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-10-06 16:28:44 CST; 13s ago
  Process: 4932 ExecStart=/usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS (code=exited, status=0/SUCCESS)
  Process: 4858 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited, status=0/SUCCESS)
 Main PID: 4936 (mysqld)
   CGroup: /system.slice/mysqld.service
           └─4936 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid

Oct 06 16:28:41 hadoop102 systemd[1]: Starting MySQL Server...
Oct 06 16:28:44 hadoop102 systemd[1]: Started MySQL Server.

2.5.6 查看初始密码

[hadoop@hadoop102 mysql]$ sudo cat /var/log/mysqld.log | grep password
2021-10-06T08:28:41.617251Z 1 [Note] A temporary password is generated for root@localhost: iu0TV/6TpLn4

2.5.7 修改密码

[hadoop@hadoop102 mysql]$ mysql -uroot -piu0TV/6TpLn4
# 修改密码规则
mysql> set global validate_password_length=4;
Query OK, 0 rows affected (0.00 sec)
mysql> set global validate_password_policy=0;
Query OK, 0 rows affected (0.00 sec)
# 修改密码
mysql> alter user 'root'@'localhost' identified by '000000';
Query OK, 0 rows affected (0.00 sec)

2.5.8 开启远程登录

mysql> grant all privileges on *.* to 'root'@'%' identified by '000000';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select user,host from user;
+-----------+-----------+
| user      | host      |
+-----------+-----------+
| root      | %         |
| mysql.sys | localhost |
| root      | localhost |
+-----------+-----------+
3 rows in set (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

2.6 生成数据

将mock目录中业务目录下gmail.sql倒入到gmail数据库即可。

[hadoop@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-01-22.jar 

2.7 安装sqoop

2.7.1 解压安装

[hadoop@hadoop102 software]$ tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module

2.7.2 重命名

[hadoop@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

2.7.3 配置

[hadoop@hadoop102 conf]$ mv sqoop-env-template.sh sqoop-env.sh
[hadoop@hadoop102 conf]$ vim sqoop-env.sh 
# 添加配置
export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3
export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7
export ZOOCFGDIR=/opt/module/zookeeper-3.5.7/conf

2.7.4 添加驱动到lib目录

[hadoop@hadoop102 conf]$ mv /opt/software/mysql/mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop/lib/

2.7.5 验证是否安装成功

[hadoop@hadoop102 sqoop]$ bin/sqoop help
Warning: /opt/module/sqoop/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/module/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2021-10-06 18:07:14,693 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

2.7.6 测试是否可以成功连接数据库

[hadoop@hadoop102 sqoop]$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 000000
Warning: /opt/module/sqoop/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/module/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2021-10-06 18:08:05,762 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
2021-10-06 18:08:05,777 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2021-10-06 18:08:05,848 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
gmall
mysql
performance_schema
sys

2.7.7 测试sqoop是否可以正常倒入hdfs

bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/gmall \
--username root \
--password 000000 \
--table user_info \
--columns id,login_name \
--where "id>=10 and id<=30" \
--target-dir /test \
--delete-target-dir \
--fields-terminated-by '\t' \
--num-mappers 2 \
--split-by id

2.8 同步策略

数据同步策略的类型包括:全量同步、增量同步、新增及变化同步、特殊情况

  • 全量表:存储完整的数据。
  • 增量表:存储新增加的数据。
  • 新增及变化表:存储新增加的数据和变化的数据。
  • 特殊表:只需要存储一次。

某些特殊的表,可不必遵循上述同步策略。例如某些不会发生变化的表(地区表,省份表,民族表)可以只存一份固定值。

2.9 业务数据导入HDFS

2.9.1 分析表同步策略

在生产环境,个别小公司,为了简单处理,所有表全量导入。

中大型公司,由于数据量比较大,还是严格按照同步策略导入数据。

2.9.2 业务数据首日同步脚本

2.9.2.1 脚本编写

(1)在/home/atguigu/bin目录下创建

[hadoop@hadoop102 bin]$ vim mysql_to_hdfs_init.sh

添加如下内容:

#! /bin/bash

APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop

if [ -n "$2" ] ;then
   do_date=$2
else 
   echo "请传入日期参数"
   exit
fi 

import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop102:3306/$APP \
--username root \
--password 000000 \
--target-dir /origin_data/$APP/db/$1/$do_date \
--delete-target-dir \
--query "$2 where \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--compress \
--compression-codec lzop \
--null-string '\\N' \
--null-non-string '\\N'

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
}

import_order_info(){
  import_data order_info "select
                            id, 
                            total_amount, 
                            order_status, 
                            user_id, 
                            payment_way,
                            delivery_address,
                            out_trade_no, 
                            create_time, 
                            operate_time,
                            expire_time,
                            tracking_no,
                            province_id,
                            activity_reduce_amount,
                            coupon_reduce_amount,                            
                            original_total_amount,
                            feight_fee,
                            feight_fee_reduce      
                        from order_info"
}

import_coupon_use(){
  import_data coupon_use "select
                          id,
                          coupon_id,
                          user_id,
                          order_id,
                          coupon_status,
                          get_time,
                          using_time,
                          used_time,
                          expire_time
                        from coupon_use"
}

import_order_status_log(){
  import_data order_status_log "select
                                  id,
                                  order_id,
                                  order_status,
                                  operate_time
                                from order_status_log"
}

import_user_info(){
  import_data "user_info" "select 
                            id,
                            login_name,
                            nick_name,
                            name,
                            phone_num,
                            email,
                            user_level, 
                            birthday,
                            gender,
                            create_time,
                            operate_time
                          from user_info"
}

import_order_detail(){
  import_data order_detail "select 
                              id,
                              order_id, 
                              sku_id,
                              sku_name,
                              order_price,
                              sku_num, 
                              create_time,
                              source_type,
                              source_id,
                              split_total_amount,
                              split_activity_amount,
                              split_coupon_amount
                            from order_detail"
}

import_payment_info(){
  import_data "payment_info"  "select 
                                id,  
                                out_trade_no, 
                                order_id, 
                                user_id, 
                                payment_type, 
                                trade_no, 
                                total_amount,  
                                subject, 
                                payment_status,
                                create_time,
                                callback_time 
                              from payment_info"
}

import_comment_info(){
  import_data comment_info "select
                              id,
                              user_id,
                              sku_id,
                              spu_id,
                              order_id,
                              appraise,
                              create_time
                            from comment_info"
}

import_order_refund_info(){
  import_data order_refund_info "select
                                id,
                                user_id,
                                order_id,
                                sku_id,
                                refund_type,
                                refund_num,
                                refund_amount,
                                refund_reason_type,
                                refund_status,
                                create_time
                              from order_refund_info"
}

import_sku_info(){
  import_data sku_info "select 
                          id,
                          spu_id,
                          price,
                          sku_name,
                          sku_desc,
                          weight,
                          tm_id,
                          category3_id,
                          is_sale,
                          create_time
                        from sku_info"
}

import_base_category1(){
  import_data "base_category1" "select 
                                  id,
                                  name 
                                from base_category1"
}

import_base_category2(){
  import_data "base_category2" "select
                                  id,
                                  name,
                                  category1_id 
                                from base_category2"
}

import_base_category3(){
  import_data "base_category3" "select
                                  id,
                                  name,
                                  category2_id
                                from base_category3"
}

import_base_province(){
  import_data base_province "select
                              id,
                              name,
                              region_id,
                              area_code,
                              iso_code,
                              iso_3166_2
                            from base_province"
}

import_base_region(){
  import_data base_region "select
                              id,
                              region_name
                            from base_region"
}

import_base_trademark(){
  import_data base_trademark "select
                                id,
                                tm_name
                              from base_trademark"
}

import_spu_info(){
  import_data spu_info "select
                            id,
                            spu_name,
                            category3_id,
                            tm_id
                          from spu_info"
}

import_favor_info(){
  import_data favor_info "select
                          id,
                          user_id,
                          sku_id,
                          spu_id,
                          is_cancel,
                          create_time,
                          cancel_time
                        from favor_info"
}

import_cart_info(){
  import_data cart_info "select
                        id,
                        user_id,
                        sku_id,
                        cart_price,
                        sku_num,
                        sku_name,
                        create_time,
                        operate_time,
                        is_ordered,
                        order_time,
                        source_type,
                        source_id
                      from cart_info"
}

import_coupon_info(){
  import_data coupon_info "select
                          id,
                          coupon_name,
                          coupon_type,
                          condition_amount,
                          condition_num,
                          activity_id,
                          benefit_amount,
                          benefit_discount,
                          create_time,
                          range_type,
                          limit_num,
                          taken_count,
                          start_time,
                          end_time,
                          operate_time,
                          expire_time
                        from coupon_info"
}

import_activity_info(){
  import_data activity_info "select
                              id,
                              activity_name,
                              activity_type,
                              start_time,
                              end_time,
                              create_time
                            from activity_info"
}

import_activity_rule(){
    import_data activity_rule "select
                                    id,
                                    activity_id,
                                    activity_type,
                                    condition_amount,
                                    condition_num,
                                    benefit_amount,
                                    benefit_discount,
                                    benefit_level
                                from activity_rule"
}

import_base_dic(){
    import_data base_dic "select
                            dic_code,
                            dic_name,
                            parent_code,
                            create_time,
                            operate_time
                          from base_dic"
}


import_order_detail_activity(){
    import_data order_detail_activity "select
                                                                id,
                                                                order_id,
                                                                order_detail_id,
                                                                activity_id,
                                                                activity_rule_id,
                                                                sku_id,
                                                                create_time
                                                            from order_detail_activity"
}


import_order_detail_coupon(){
    import_data order_detail_coupon "select
                                                                id,
								                                                order_id,
                                                                order_detail_id,
                                                                coupon_id,
                                                                coupon_use_id,
                                                                sku_id,
                                                                create_time
                                                            from order_detail_coupon"
}


import_refund_payment(){
    import_data refund_payment "select
                                                        id,
                                                        out_trade_no,
                                                        order_id,
                                                        sku_id,
                                                        payment_type,
                                                        trade_no,
                                                        total_amount,
                                                        subject,
                                                        refund_status,
                                                        create_time,
                                                        callback_time
                                                    from refund_payment"                                                    

}

import_sku_attr_value(){
    import_data sku_attr_value "select
                                                    id,
                                                    attr_id,
                                                    value_id,
                                                    sku_id,
                                                    attr_name,
                                                    value_name
                                                from sku_attr_value"
}


import_sku_sale_attr_value(){
    import_data sku_sale_attr_value "select
                                                            id,
                                                            sku_id,
                                                            spu_id,
                                                            sale_attr_value_id,
                                                            sale_attr_id,
                                                            sale_attr_name,
                                                            sale_attr_value_name
                                                        from sku_sale_attr_value"
}

case $1 in
  "order_info")
     import_order_info
;;
  "base_category1")
     import_base_category1
;;
  "base_category2")
     import_base_category2
;;
  "base_category3")
     import_base_category3
;;
  "order_detail")
     import_order_detail
;;
  "sku_info")
     import_sku_info
;;
  "user_info")
     import_user_info
;;
  "payment_info")
     import_payment_info
;;
  "base_province")
     import_base_province
;;
  "base_region")
     import_base_region
;;
  "base_trademark")
     import_base_trademark
;;
  "activity_info")
      import_activity_info
;;
  "cart_info")
      import_cart_info
;;
  "comment_info")
      import_comment_info
;;
  "coupon_info")
      import_coupon_info
;;
  "coupon_use")
      import_coupon_use
;;
  "favor_info")
      import_favor_info
;;
  "order_refund_info")
      import_order_refund_info
;;
  "order_status_log")
      import_order_status_log
;;
  "spu_info")
      import_spu_info
;;
  "activity_rule")
      import_activity_rule
;;
  "base_dic")
      import_base_dic
;;
  "order_detail_activity")
      import_order_detail_activity
;;
  "order_detail_coupon")
      import_order_detail_coupon
;;
  "refund_payment")
      import_refund_payment
;;
  "sku_attr_value")
      import_sku_attr_value
;;
  "sku_sale_attr_value")
      import_sku_sale_attr_value
;;
  "all")
   import_base_category1
   import_base_category2
   import_base_category3
   import_order_info
   import_order_detail
   import_sku_info
   import_user_info
   import_payment_info
   import_base_region
   import_base_province
   import_base_trademark
   import_activity_info
   import_cart_info
   import_comment_info
   import_coupon_use
   import_coupon_info
   import_favor_info
   import_order_refund_info
   import_order_status_log
   import_spu_info
   import_activity_rule
   import_base_dic
   import_order_detail_activity
   import_order_detail_coupon
   import_refund_payment
   import_sku_attr_value
   import_sku_sale_attr_value
;;
esac

说明1:

[ -n 变量值 ] 判断变量的值,是否为空

– 变量的值,非空,返回true

– 变量的值,为空,返回false

说明2:

查看date命令的使用,

[atguigu@hadoop102 ~]$ date --help

(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod +x mysql_to_hdfs_init.sh

2.9.2.2 脚本使用

[hadoop@hadoop102 bin]$ mysql_to_hdfs_init.sh all 2020-06-14

2.9.3 业务数据每日同步脚本

2.9.3.1 脚本编写

(1)在/home/atguigu/bin目录下创建

[hadoop@hadoop102 bin]$ vim mysql_to_hdfs.sh

添加如下内容:

#! /bin/bash

APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop

if [ -n "$2" ] ;then
    do_date=$2
else
    do_date=`date -d '-1 day' +%F`
fi

import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop102:3306/$APP \
--username root \
--password 000000 \
--target-dir /origin_data/$APP/db/$1/$do_date \
--delete-target-dir \
--query "$2 and  \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--compress \
--compression-codec lzop \
--null-string '\\N' \
--null-non-string '\\N'

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
}

import_order_info(){
  import_data order_info "select
                            id, 
                            total_amount, 
                            order_status, 
                            user_id, 
                            payment_way,
                            delivery_address,
                            out_trade_no, 
                            create_time, 
                            operate_time,
                            expire_time,
                            tracking_no,
                            province_id,
                            activity_reduce_amount,
                            coupon_reduce_amount,                            
                            original_total_amount,
                            feight_fee,
                            feight_fee_reduce      
                        from order_info
                        where (date_format(create_time,'%Y-%m-%d')='$do_date' 
                        or date_format(operate_time,'%Y-%m-%d')='$do_date')"
}

import_coupon_use(){
  import_data coupon_use "select
                          id,
                          coupon_id,
                          user_id,
                          order_id,
                          coupon_status,
                          get_time,
                          using_time,
                          used_time,
                          expire_time
                        from coupon_use
                        where (date_format(get_time,'%Y-%m-%d')='$do_date'
                        or date_format(using_time,'%Y-%m-%d')='$do_date'
                        or date_format(used_time,'%Y-%m-%d')='$do_date'
                        or date_format(expire_time,'%Y-%m-%d')='$do_date')"
}

import_order_status_log(){
  import_data order_status_log "select
                                  id,
                                  order_id,
                                  order_status,
                                  operate_time
                                from order_status_log
                                where date_format(operate_time,'%Y-%m-%d')='$do_date'"
}

import_user_info(){
  import_data "user_info" "select 
                            id,
                            login_name,
                            nick_name,
                            name,
                            phone_num,
                            email,
                            user_level, 
                            birthday,
                            gender,
                            create_time,
                            operate_time
                          from user_info 
                          where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date' 
                          or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"
}

import_order_detail(){
  import_data order_detail "select 
                              id,
                              order_id, 
                              sku_id,
                              sku_name,
                              order_price,
                              sku_num, 
                              create_time,
                              source_type,
                              source_id,
                              split_total_amount,
                              split_activity_amount,
                              split_coupon_amount
                            from order_detail 
                            where DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'"
}

import_payment_info(){
  import_data "payment_info"  "select 
                                id,  
                                out_trade_no, 
                                order_id, 
                                user_id, 
                                payment_type, 
                                trade_no, 
                                total_amount,  
                                subject, 
                                payment_status,
                                create_time,
                                callback_time 
                              from payment_info 
                              where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date' 
                              or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"
}

import_comment_info(){
  import_data comment_info "select
                              id,
                              user_id,
                              sku_id,
                              spu_id,
                              order_id,
                              appraise,
                              create_time
                            from comment_info
                            where date_format(create_time,'%Y-%m-%d')='$do_date'"
}

import_order_refund_info(){
  import_data order_refund_info "select
                                id,
                                user_id,
                                order_id,
                                sku_id,
                                refund_type,
                                refund_num,
                                refund_amount,
                                refund_reason_type,
                                refund_status,
                                create_time
                              from order_refund_info
                              where date_format(create_time,'%Y-%m-%d')='$do_date'"
}

import_sku_info(){
  import_data sku_info "select 
                          id,
                          spu_id,
                          price,
                          sku_name,
                          sku_desc,
                          weight,
                          tm_id,
                          category3_id,
                          is_sale,
                          create_time
                        from sku_info where 1=1"
}

import_base_category1(){
  import_data "base_category1" "select 
                                  id,
                                  name 
                                from base_category1 where 1=1"
}

import_base_category2(){
  import_data "base_category2" "select
                                  id,
                                  name,
                                  category1_id 
                                from base_category2 where 1=1"
}

import_base_category3(){
  import_data "base_category3" "select
                                  id,
                                  name,
                                  category2_id
                                from base_category3 where 1=1"
}

import_base_province(){
  import_data base_province "select
                              id,
                              name,
                              region_id,
                              area_code,
                              iso_code,
                              iso_3166_2
                            from base_province
                            where 1=1"
}

import_base_region(){
  import_data base_region "select
                              id,
                              region_name
                            from base_region
                            where 1=1"
}

import_base_trademark(){
  import_data base_trademark "select
                                id,
                                tm_name
                              from base_trademark
                              where 1=1"
}

import_spu_info(){
  import_data spu_info "select
                            id,
                            spu_name,
                            category3_id,
                            tm_id
                          from spu_info
                          where 1=1"
}

import_favor_info(){
  import_data favor_info "select
                          id,
                          user_id,
                          sku_id,
                          spu_id,
                          is_cancel,
                          create_time,
                          cancel_time
                        from favor_info
                        where 1=1"
}

import_cart_info(){
  import_data cart_info "select
                        id,
                        user_id,
                        sku_id,
                        cart_price,
                        sku_num,
                        sku_name,
                        create_time,
                        operate_time,
                        is_ordered,
                        order_time,
                        source_type,
                        source_id
                      from cart_info
                      where 1=1"
}

import_coupon_info(){
  import_data coupon_info "select
                          id,
                          coupon_name,
                          coupon_type,
                          condition_amount,
                          condition_num,
                          activity_id,
                          benefit_amount,
                          benefit_discount,
                          create_time,
                          range_type,
                          limit_num,
                          taken_count,
                          start_time,
                          end_time,
                          operate_time,
                          expire_time
                        from coupon_info
                        where 1=1"
}

import_activity_info(){
  import_data activity_info "select
                              id,
                              activity_name,
                              activity_type,
                              start_time,
                              end_time,
                              create_time
                            from activity_info
                            where 1=1"
}

import_activity_rule(){
    import_data activity_rule "select
                                    id,
                                    activity_id,
                                    activity_type,
                                    condition_amount,
                                    condition_num,
                                    benefit_amount,
                                    benefit_discount,
                                    benefit_level
                                from activity_rule
                                where 1=1"
}

import_base_dic(){
    import_data base_dic "select
                            dic_code,
                            dic_name,
                            parent_code,
                            create_time,
                            operate_time
                          from base_dic
                          where 1=1"
}


import_order_detail_activity(){
    import_data order_detail_activity "select
                                                                id,
                                                                order_id,
                                                                order_detail_id,
                                                                activity_id,
                                                                activity_rule_id,
                                                                sku_id,
                                                                create_time
                                                            from order_detail_activity
                                                            where date_format(create_time,'%Y-%m-%d')='$do_date'"
}


import_order_detail_coupon(){
    import_data order_detail_coupon "select
                                                                id,
								                                                order_id,
                                                                order_detail_id,
                                                                coupon_id,
                                                                coupon_use_id,
                                                                sku_id,
                                                                create_time
                                                            from order_detail_coupon
                                                            where date_format(create_time,'%Y-%m-%d')='$do_date'"
}


import_refund_payment(){
    import_data refund_payment "select
                                                        id,
                                                        out_trade_no,
                                                        order_id,
                                                        sku_id,
                                                        payment_type,
                                                        trade_no,
                                                        total_amount,
                                                        subject,
                                                        refund_status,
                                                        create_time,
                                                        callback_time
                                                    from refund_payment
                                                    where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date' 
                                                    or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"                                                    

}

import_sku_attr_value(){
    import_data sku_attr_value "select
                                                    id,
                                                    attr_id,
                                                    value_id,
                                                    sku_id,
                                                    attr_name,
                                                    value_name
                                                from sku_attr_value
                                                where 1=1"
}


import_sku_sale_attr_value(){
    import_data sku_sale_attr_value "select
                                                            id,
                                                            sku_id,
                                                            spu_id,
                                                            sale_attr_value_id,
                                                            sale_attr_id,
                                                            sale_attr_name,
                                                            sale_attr_value_name
                                                        from sku_sale_attr_value
                                                        where 1=1"
}

case $1 in
  "order_info")
     import_order_info
;;
  "base_category1")
     import_base_category1
;;
  "base_category2")
     import_base_category2
;;
  "base_category3")
     import_base_category3
;;
  "order_detail")
     import_order_detail
;;
  "sku_info")
     import_sku_info
;;
  "user_info")
     import_user_info
;;
  "payment_info")
     import_payment_info
;;
  "base_province")
     import_base_province
;;
  "activity_info")
      import_activity_info
;;
  "cart_info")
      import_cart_info
;;
  "comment_info")
      import_comment_info
;;
  "coupon_info")
      import_coupon_info
;;
  "coupon_use")
      import_coupon_use
;;
  "favor_info")
      import_favor_info
;;
  "order_refund_info")
      import_order_refund_info
;;
  "order_status_log")
      import_order_status_log
;;
  "spu_info")
      import_spu_info
;;
  "activity_rule")
      import_activity_rule
;;
  "base_dic")
      import_base_dic
;;
  "order_detail_activity")
      import_order_detail_activity
;;
  "order_detail_coupon")
      import_order_detail_coupon
;;
  "refund_payment")
      import_refund_payment
;;
  "sku_attr_value")
      import_sku_attr_value
;;
  "sku_sale_attr_value")
      import_sku_sale_attr_value
;;
"all")
   import_base_category1
   import_base_category2
   import_base_category3
   import_order_info
   import_order_detail
   import_sku_info
   import_user_info
   import_payment_info
   import_base_trademark
   import_activity_info
   import_cart_info
   import_comment_info
   import_coupon_use
   import_coupon_info
   import_favor_info
   import_order_refund_info
   import_order_status_log
   import_spu_info
   import_activity_rule
   import_base_dic
   import_order_detail_activity
   import_order_detail_coupon
   import_refund_payment
   import_sku_attr_value
   import_sku_sale_attr_value
;;
esac

(2)增加脚本执行权限

[hadoop@hadoop102 bin]$ chmod +x mysql_to_hdfs.sh

2.9.3.2 脚本使用

[hadoop@hadoop102 bin]$ mysql_to_hdfs.sh all 2020-06-15

2.10 安装Hive

2.10.1 解压安装

[hadoop@hadoop102 hive]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module/

2.10.2 重命名

[hadoop@hadoop102 module]$ mv apache-hive-3.1.2-bin/ hive

2.10.3 配置

[hadoop@hadoop102 conf]$ vim hive-site.xml

添加内容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>000000</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>

    <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop102</value>
    </property>

    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>

    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

2.10.4 配置环境变量

# 配置环境变量
[hadoop@hadoop102 conf]$ sudo vim /etc/profile.d/my_env.sh
添加内容
# HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin
# 使生效
[hadoop@hadoop102 conf]$ source /etc/profile.d/my_env.sh

2.10.5 拷贝驱动

[hadoop@hadoop102 mysql]$ cp /opt/software/mysql/mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/

2.10.6 登陆mysql

[hadoop@hadoop102 mysql]$ mysql -uroot -p000000

2.10.7 创建元数据库

mysql> create database metastore;
Query OK, 1 row affected (0.00 sec)

mysql> quit
Bye

2.10.8 初始化元数据库

[hadoop@hadoop102 mysql]$ schematool -initSchema -dbType mysql -verbose

2.10.9 验证Hive是否安装成功

[hadoop@hadoop102 mysql]$ hive
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 762f0e4f-4a2a-45d8-9a1b-51c2432ab65f

Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 89ed2aaf-1470-4b72-931c-f23063aa0ddf
hive (default)> show databases;
OK
database_name
default
Time taken: 0.614 seconds, Fetched: 1 row(s)

2.11 安装spark

2.11.1 区别

Hive引擎包括:默认MR、tez、spark

  • Hive on Spark:Hive既作为存储元数据又负责SQL的解析优化,语法是HQL语法,执行引擎变成了Spark,Spark负责采用RDD执行。
  • Spark on Hive : Hive只作为存储元数据,Spark负责SQL解析优化,语法是Spark SQL语法,Spark负责采用RDD执行。

2.11.2 安装

兼容性说明

注意:官网下载的Hive3.1.2和Spark3.0.0默认是不兼容的。因为Hive3.1.2支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。

编译步骤:官网下载Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。

# 解压安装
[hadoop@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv spark-3.0.0-bin-hadoop3.2/ spark
# 添加环境变量
[hadoop@hadoop102 spark]$ sudo vim /etc/profile.d/my_env.sh
# 内容 

# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin

# 使环境变量生效
[hadoop@hadoop102 spark]$ source /etc/profile.d/my_env.sh

2.11.3 配置

# 创建配置文件
[hadoop@hadoop102 module]$ vim /opt/module/hive/conf/spark-default.xml
# 添加内容
spark.master                               yarn
spark.eventLog.enabled                   true
spark.eventLog.dir                        hdfs://hadoop102:8020/spark-history
spark.executor.memory                    1g
spark.driver.memory					   1g

# 在hdfs创建spark- history路径
[hadoop@hadoop102 module]$ hadoop dfs -mkdir /spark-history
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

[hadoop@hadoop102 module]$ hadoop dfs -ls
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

ls: `.': No such file or directory
[hadoop@hadoop102 module]$ hadoop dfs -ls /
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2021-10-06 20:00 /origin_data
drwxr-xr-x   - hadoop supergroup          0 2021-10-07 00:58 /spark-history
drwxr-xr-x   - hadoop supergroup          0 2021-10-06 18:14 /test
drwxrwx---   - hadoop supergroup          0 2021-10-06 21:29 /tmp

2.11.4 向HDFS上传Spark纯净版jar包

  • 说明1:由于Spark3.0.0非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。

  • 说明2:Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。

# 解压纯净版压缩
[hadoop@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-without-hadoop.tgz
# 在hdfs创建/spark-jars目录
[hadoop@hadoop102 spark]$ hdfs dfs -mkdir /spark-jars
# 上传纯净版jar到hdfs
[hadoop@hadoop102 spark]$ hdfs dfs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars

2.11.5 修改hive-site.xml配置文件

# 编辑配置文件
[hadoop@hadoop102 spark]$ vim /opt/module/hive/conf/hive-site.xml

# 添加内容 
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
    <name>spark.yarn.jars</name>
    <value>hdfs://hadoop102:8020/spark-jars/*</value>
</property>
  
<!--Hive执行引擎-->
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>

# 重启集群
[hadoop@hadoop102 ~]$ hdp.sh stop
 =================== 关闭 hadoop集群 ===================
 --------------- 关闭 historyserver ---------------
 --------------- 关闭 yarn ---------------
Stopping nodemanagers
hadoop104: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop103: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
Stopping resourcemanager
 --------------- 关闭 hdfs ---------------
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[hadoop@hadoop102 ~]$ hdp.sh start
 =================== 启动 hadoop集群 ===================
 --------------- 启动 hdfs ---------------
Starting namenodes on [hadoop102]
Starting datanodes
Starting secondary namenodes [hadoop104]
 --------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers
 --------------- 启动 historyserver ---------------

2.11.6 增加Application Master资源比例

容量调度器对每个资源队列中同时运行的Application Master占用的资源进行了限制,该限制通过yarn.scheduler.capacity.maximum-am-resource-percent参数实现,其默认值是0.1,表示每个资源队列上Application Master最多可使用的资源为该队列总资源的10%,目的是防止大部分资源都被Application Master占用,而导致Map/Reduce Task无法执行。

生产环境该参数可使用默认值。但学习环境,集群资源总数很少,如果只分配10%的资源给Application Master,则可能出现,同一时刻只能运行一个Job的情况,因为一个Application Master使用的资源就可能已经达到10%的上限了。故此处可将该值适当调大。

# 编辑配置文件
[hadoop@hadoop102 spark]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/capacity-scheduler.xml
# 添加内容
<property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.8</value>
</property
S
# 分发
[hadoop@hadoop102 spark]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/capacity-scheduler.xml 
==================== hadoop102 ====================
sending incremental file list

sent 77 bytes  received 12 bytes  178.00 bytes/sec
total size is 8,260  speedup is 92.81
==================== hadoop103 ====================
sending incremental file list
capacity-scheduler.xml

sent 868 bytes  received 107 bytes  1,950.00 bytes/sec
total size is 8,260  speedup is 8.47
==================== hadoop104 ====================
sending incremental file list
capacity-scheduler.xml

sent 868 bytes  received 107 bytes  1,950.00 bytes/sec
total size is 8,260  speedup is 8.47

2.11.7 测试

[hadoop@hadoop102 spark]$ hive
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin:/opt/module/spark/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = a4481e73-37f7-42da-a8e8-3c2af926495b

Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = e71805fb-3229-4654-88ee-10cec1ff16e8
hive (default)> create table student(id int, name string);
OK
Time taken: 0.951 seconds
hive (default)> insert into table student values(1,'abc');
Query ID = hadoop_20211007011117_f6444aac-6394-4733-b1a3-61c6a45590b5
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Running with YARN Application = application_1633523263782_0055
Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1633523263782_0055
Hive on Spark Session Web UI URL: http://hadoop103:40693

Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      1          1        0        0       0  
Stage-1 ........         0      FINISHED      1          1        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 6.14 s     
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 6.14 second(s)
Loading data to table default.student
OK
col1	col2
Time taken: 22.071 seconds

2.12 安装Hbase

2.12.1 解压到指定目录

# 解压
[hadoop@hadoop102 hbase]$ tar -zxvf hbase-2.0.5-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv hbase-2.0.5/ hbase

2.12.2 配置

[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/hbase-site.xml
# 添加内容
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://hadoop102:8020/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop102,hadoop103,hadoop104</value>
  </property>
  
# 修改配置,不自己管理zk
[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/hbase-env.sh
export HBASE_MANAGES_ZK=false

# 配置region服务
[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/regionservers 
hadoop102
hadoop103
hadoop104

2.12.3 建立软连接

[hadoop@hadoop102 module]$ ln -s /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml /opt/module/hbase/conf/core-site.xml
[hadoop@hadoop102 module]$ ln -s /opt/module/hadoop-3.1.3/etc/hadoop/hdfs-site.xml /opt/module/hbase/conf/hdfs-site.xml

2.12.4 配置环境变量

[hadoop@hadoop102 hbase]$ vim /etc/profile.d/my_env.sh 

# 添加内容
# HBASE_HOME
export HBASE_HOME=/opt/module/hbase
export PATH=$PATH:$HBASE_HOME/bin

# 是环境变量生效
[hadoop@hadoop102 hbase]$ source /etc/profile.d/my_env.sh

2.12.5 分发

# 分发hbase安装包
[hadoop@hadoop102 module]$ xsync hbase/
# 环境变量
[hadoop@hadoop102 hbase]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh 
==================== hadoop102 ====================
root@hadoop102's password: 
root@hadoop102's password: 
sending incremental file list

sent 48 bytes  received 12 bytes  24.00 bytes/sec
total size is 548  speedup is 9.13
==================== hadoop103 ====================
root@hadoop103's password: 
root@hadoop103's password: 
sending incremental file list
my_env.sh

sent 643 bytes  received 41 bytes  273.60 bytes/sec
total size is 548  speedup is 0.80
==================== hadoop104 ====================
root@hadoop104's password: 
root@hadoop104's password: 
sending incremental file list
my_env.sh

sent 643 bytes  received 41 bytes  456.00 bytes/sec
total size is 548  speedup is 0.80

2.12.6 启动服务

[hadoop@hadoop102 hbase]$ start-hbase.sh 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
running master, logging to /opt/module/hbase/logs/hbase-hadoop-master-hadoop102.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop103: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop103.out
hadoop104: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop104.out
hadoop102: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop102.out
hadoop103: SLF4J: Class path contains multiple SLF4J bindings.
hadoop103: SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop103: SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop103: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
hadoop103: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop104: SLF4J: Class path contains multiple SLF4J bindings.
hadoop104: SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop104: SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop104: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
hadoop104: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[hadoop@hadoop102 hbase]$ xcall.sh jps
============hadoop102================
22785 NameNode
23218 NodeManager
22582 QuorumPeerMain
27078 HMaster
27590 Jps
27243 HRegionServer
22941 DataNode
23389 JobHistoryServer
23791 Kafka
============hadoop103================
30499 QuorumPeerMain
30979 NodeManager
30597 DataNode
31589 Kafka
30776 ResourceManager
32060 HRegionServer
32254 Jps
============hadoop104================
27393 Jps
25895 SecondaryNameNode
25784 DataNode
25980 NodeManager
26444 Kafka
26476 Application
25693 QuorumPeerMain
[hadoop@hadoop102 hbase]$ 

2.12.7 访问web服务

http://hadoop102:16010/master-status

2.12.8 启动服务可能遇到的问题

[hadoop@hadoop102 hbase]$ bin/hbase-daemon.sh start master
[hadoop@hadoop102 hbase]$ bin/hbase-daemon.sh start regionserver

提示:如果集群之间的节点时间不同步,会导致regionserver无法启动,抛出ClockOutOfSyncException异常。

修复提示:

a、同步时间服务

请参看帮助文档:《尚硅谷大数据技术之Hadoop入门》

b、属性:hbase.master.maxclockskew设置更大的值

<property>
        <name>hbase.master.maxclockskewname>
        <value>180000value>
        <description>Time difference of regionserver from masterdescription>
property>

2.13 安装Solr

2.13.1 创建系统用户

  • Hadoop102
[hadoop@hadoop102 hbase]$ sudo useradd solr
[hadoop@hadoop102 hbase]$ sudo echo solr | passwd --stdin solr
Only root can do that.
[hadoop@hadoop102 hbase]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.
  • Hadoop103
[hadoop@hadoop103 module]$ sudo useradd solr
[hadoop@hadoop103 module]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.
  • hadoop104
[hadoop@hadoop104 conf]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.

2.13.2 解压并重命名

# 解压
hadoop@hadoop102 solr]$ tar -zxvf solr-7.7.3.tgz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv solr-7.7.3/ solr

2.13.3 分发

[hadoop@hadoop102 module]$ xsync solr/

2.13.4 授权给solr用户

[hadoop@hadoop102 module]$ sudo chown -R solr:solr solr/
[hadoop@hadoop103 module]$ sudo chown -R solr:solr solr/
[hadoop@hadoop104 module]$ sudo chown -R solr:solr solr/

2.13.5 修改配置文件

[hadoop@hadoop102 module]$ sudo vim solr/bin/solr.in.sh 
# 添加内容
ZK_HOST="hadoop102:2181,hadoop103:2181,hadoop104:2181"

2.13.6 启动服务

[hadoop@hadoop102 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.  
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] ***  Your Max Processes Limit is currently 4096. 
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 [\]  
Started Solr server on port 8983 (pid=28181). Happy searching!
[hadoop@hadoop103 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.  
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] ***  Your Max Processes Limit is currently 4096. 
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.
Started Solr server on port 8983 (pid=32545). Happy searching!
[hadoop@hadoop104 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.  
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] ***  Your Max Processes Limit is currently 4096. 
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.
Started Solr server on port 8983 (pid=27686). Happy searching!

说明:上述警告内容是:solr推荐系统允许的最大进程数和最大打开文件数分别为65000和65000,而系统默认值低于推荐值。如需修改可参考以下步骤,修改完需要重启方可生效,此处可暂不修改。

(1)修改打开文件数限制

修改/etc/security/limits.conf文件,增加以下内容

* soft nofile 65000
* hard nofile 65000

(2)修改进程数限制

修改/etc/security/limits.d/20-nproc.conf文件

* soft   nproc   65000

2.13.7 查看web服务

http://hadoop103:8983/solr/#/

2.14 安装atlas

2.14.1 解压安装重命名

[hadoop@hadoop102 atlas]$ tar -zxvf apache-atlas-2.1.0-bin.tar.gz -C /opt/module/
[hadoop@hadoop102 module]$ mv apache-atlas-2.1.0/ atlas

2.14.2 配置

# 配置hbase
[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
atlas.graph.storage.hostname=hadoop102:2181,hadoop103:2181,hadoop104:2181

# 修改环境变量
[hadoop@hadoop102 atlas]$ vim conf/atlas-env.sh 
# 添加内容
export HBASE_CONF_DIR=/opt/module/hbase/conf

# 

2.14.3 配置solr

[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容 
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=hadoop102:2181,hadoop103:2181,hadoop104:2181

# 创建collection
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create  -c vertex_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'vertex_index' with 3 shard(s), 2 replica(s) with config-set 'vertex_index'
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create -c edge_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'edge_index' with 3 shard(s), 2 replica(s) with config-set 'edge_index'
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create -c fulltext_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'fulltext_index' with 3 shard(s), 2 replica(s) with config-set 'fulltext_index'

2.14.4 配置kafka

[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
atlas.notification.embedded=false
atlas.kafka.data=/opt/module/kafka/data
atlas.kafka.zookeeper.connect= hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka
atlas.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092

2.14.5 配置atlas

[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
#########  Server Properties  #########
atlas.rest.address=http://hadoop102:21000
# If enabled and set to true, this will run setup steps when the server starts
atlas.server.run.setup.on.start=false

#########  Entity Audit Configs  #########
atlas.audit.hbase.zookeeper.quorum=hadoop102:2181,hadoop103:2181,hadoop104:2181

# 配置日志
[hadoop@hadoop101 conf]# vim atlas-log4j.xml

#去掉如下代码的注释
<appender name="perf_appender" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="file" value="${atlas.log.dir}/atlas_perf.log" />
    <param name="datePattern" value="'.'yyyy-MM-dd" />
    <param name="append" value="true" />
    <layout class="org.apache.log4j.PatternLayout">
        <param name="ConversionPattern" value="%d|%t|%m%n" />
    </layout>
</appender>

<logger name="org.apache.atlas.perf" additivity="false">
    <level value="debug" />
    <appender-ref ref="perf_appender" />
</logger>

2.14.6 Kerberos相关配置

若Hadoop集群开启了Kerberos认证,Atlas与Hadoop集群交互之前就需要先进行Kerberos认证。若Hadoop集群未开启Kerberos认证,则本节可跳过。

1.为Atlas创建Kerberos主体,并生成keytab文件

[hadoop@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"addprinc -randkey atlas/hadoop102"
[hadoop@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"xst -k /etc/security/keytab/atlas.service.keytab atlas/hadoop102"

2.修改/opt/module/atlas/conf/atlas-application.properties配置文件,增加以下参数

atlas.authentication.method=kerberos
atlas.authentication.principal=atlas/[email protected]
atlas.authentication.keytab=/etc/security/keytab/atlas.service.keytab

2.15 Atlas集成Hive

1.安装Hive Hook

1)解压Hive Hook

[hadoop@hadoop102 ~]# tar -zxvf apache-atlas-2.1.0-hive-hook.tar.gz

2)将Hive Hook依赖复制到Atlas安装路径

[hadoop@hadoop102 ~]# cp -r apache-atlas-hive-hook-2.1.0/* /opt/module/atlas/

3)修改/opt/module/hive/conf/hive-env.sh配置文件

注:需先需改文件名

[hadoop@hadoop102 ~]# mv hive-env.sh.template hive-env.sh
增加如下参数
export HIVE_AUX_JARS_PATH=/opt/module/atlas/hook/hive

2.修改Hive配置文件,在/opt/module/hive/conf/hive-site.xml文件中增加以下参数,配置Hive Hook。

<property>
   <name>hive.exec.post.hooksname>
   <value>org.apache.atlas.hive.hook.HiveHookvalue>
property>

3.修改/opt/module/atlas/conf/atlas-application.properties配置文件中的以下参数

######### Hive Hook Configs #######
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary

4)将Atlas配置文件/opt/module/atlas/conf/atlas-application.properties

拷贝到/opt/module/hive/conf目录

[hadoop@hadoop102 ~]# cp /opt/module/atlas/conf/atlas-application.properties  /opt/module/hive/conf/
  1. 启动服务
[hadoop@hadoop102 atlas]$ bin/atlas_start.py

2.16 扩展内容

2.16.1 Atlas源码编译

安装Maven

1)Maven下载:https://maven.apache.org/download.cgi

2)把apache-maven-3.6.1-bin.tar.gz上传到linux的/opt/software目录下

3)解压apache-maven-3.6.1-bin.tar.gz到/opt/module/目录下面

[root@hadoop102 software]# tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/

4)修改apache-maven-3.6.1的名称为maven

[root@hadoop102 module]# mv apache-maven-3.6.1/ maven

5)添加环境变量到/etc/profile中

#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven
export PATH=$PATH:$MAVEN_HOME/bin

6)测试安装结果

[root@hadoop102 module]# source /etc/profile
[root@hadoop102 module]# mvn -v

7)修改setting.xml,指定为阿里云

[root@hadoop101 module]# cd /opt/module/maven/conf/
[root@hadoop102 maven]# vim settings.xml


<mirror>

  <id>nexus-aliyunid>

  <mirrorOf>centralmirrorOf>

  <name>Nexus aliyunname>

<url>http://maven.aliyun.com/nexus/content/groups/publicurl>

mirror>

<mirror>

  <id>UKid>

  <name>UK Centralname>

  <url>http://uk.maven.org/maven2url>

  <mirrorOf>centralmirrorOf>

mirror>

<mirror>

  <id>repo1id>

  <mirrorOf>centralmirrorOf>

  <name>Human Readable Name for this Mirror.name>

  <url>http://repo1.maven.org/maven2/url>

mirror>

<mirror>

  <id>repo2id>

  <mirrorOf>centralmirrorOf>

  <name>Human Readable Name for this Mirror.name>

  <url>http://repo2.maven.org/maven2/url>

mirror>
编译Atlas源码

1)把apache-atlas-2.1.0-sources.tar.gz上传到hadoop102的/opt/software目录下

2)解压apache-atlas-2.1.0-sources.tar.gz到/opt/module/目录下面

[root@hadoop101 software]# tar -zxvf apache-atlas-2.1.0-sources.tar.gz -C /opt/module/

3)下载Atlas依赖

[root@hadoop101 software]# export MAVEN_OPTS="-Xms2g -Xmx2g"
[root@hadoop101 software]# cd /opt/module/apache-atlas-sources-2.1.0/
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests install
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests package -Pdis

#一定要在${atlas_home}执行

[root@hadoop101 apache-atlas-sources-2.1.0]# cd distro/target/
[root@hadoop101 target]# mv apache-atlas-2.1.0-server.tar.gz /opt/software/
[root@hadoop101 target]# mv apache-atlas-2.1.0-hive-hook.tar.gz /opt/software/

提示:执行过程比较长,会下载很多依赖,大约需要半个小时,期间如果报错很有可能是因为TimeOut造成的网络中断,重试即可。

Atlas内存配置

如果计划存储数万个元数据对象,建议调整参数值获得最佳的JVM GC性能。以下是常见的服务器端选项

1)修改配置文件/opt/module/atlas/conf/atlas-env.sh

#设置Atlas内存

#设置Atlas内存
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"

#建议JDK1.7使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=3072m -XX:PermSize=100M -XX:MaxPermSize=512m"

#建议JDK1.8使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"

#如果是Mac OS用户需要配置
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

参数说明: -XX:SoftRefLRUPolicyMSPerMB 此参数对管理具有许多并发用户的查询繁重工作负载的GC性能特别有用。

配置用户名密码

Atlas支持以下身份验证方法:File、Kerberos协议、LDAP协议

通过修改配置文件atlas-application.properties文件开启或关闭三种验证方法

atlas.authentication.method.kerberos=true|false
atlas.authentication.method.ldap=true|false
atlas.authentication.method.file=true|false

如果两个或多个身份证验证方法设置为true,如果较早的方法失败,则身份验证将回退到后一种方法。例如,如果Kerberos身份验证设置为true并且ldap身份验证也设置为true,那么,如果对于没有kerberos principal和keytab的请求,LDAP身份验证将作为后备方案。

本文主要讲解采用文件方式修改用户名和密码设置。其他方式可以参见官网配置即可。

1)打开/opt/module/atlas/conf/users-credentials.properties文件

[atguigu@hadoop102 conf]$ vim users-credentials.properties

#username=group::sha256-password
admin=ADMIN::8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918
rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d

(1)admin是用户名称

(2)8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918是采用sha256加密的密码,默认密码为admin。

2)例如:修改用户名称为atguigu,密码为atguigu

​ (1)获取sha256加密的atguigu密码

[atguigu@hadoop102 conf]$ echo -n "atguigu"|sha256sum
2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2

(2)修改用户名和密码

[atguigu@hadoop102 conf]$ vim users-credentials.properties

#username=group::sha256-password
atguigu=ADMIN::2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2
rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d

2.14 Kylin安装

2.14.1 解压安装

[hadoop@hadoop102 kylin]$ tar -zxvf apache-kylin-3.0.2-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv apache-kylin-3.0.2-bin/ kylin

2.14.2 修改兼容性配置

# 43行添加内容
[hadoop@hadoop102 kylin]$ vim /opt/module/kylin/bin/find-spark-dependency.sh
# 添加内容
! -name '*jackson*' ! -name '*metastore*'
# 修改后的内容
spark_dependency=`find -L $spark_home/jars -name '*.jar' ! -name '*slf4j*' ! -name '*jackson*' ! -name '*metastore*' ! -name '*calcite*' ! -name '*doc*' ! -name '*test*' ! -name '*sources*' ''-printf '%p:' | sed 's/:$//'`
if [ -z "$spark_dependency" ]

2.14.3 启动

需要启动hadoop、hbase、zk

[hadoop@hadoop102 bin]$ ./kylin.sh start

2.14.4 查看web管理页面

http://hadoop102:7070/kylin

你可能感兴趣的:(vagrant,虚拟机,hadoop,hdfs,big,data)