一、部署前准备
1. 机器数量
测试环境可使用3台服务器进行集群搭建:
主机名称 | 主机IP | 系统 |
---|---|---|
data1 | 192.168.66.152 | CentOS7 |
data2 | 192.168.66.153 | CentOS7 |
data3 | 192.168.66.154 | CentOS7 |
2. 组件版本及下载
组件 | 版本 | 下载地址 |
---|---|---|
hadoop | hadoop-2.6.0-cdh5.15.0 | https://archive.cloudera.com/... |
hive | hive-1.1.0-cdh5.15.0 | https://archive.cloudera.com/... |
zookeeper | zookeeper-3.4.5-cdh5.15.0 | https://archive.cloudera.com/... |
hbase | hbase-1.2.0-cdh5.15.0 | https://archive.cloudera.com/... |
kafka | kafka_2.12-0.11.0.3 | http://kafka.apache.org/downl... |
flink | flink-1.10.1-bin-scala_2.12 | https://flink.apache.org/down... |
jdk | jdk-8u251-linux-x64 | https://www.oracle.com/java/t... |
3.集群节点规划
机器名称 | 服务名称 |
---|---|
data1 | NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka |
data2 | NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka |
data3 | DataNode、NodeManager、HRegionServer、JournalNode、QuorumPeerMain、Kafka |
二、开始部署
1.更改hostname
3台服务器的默认主机名为localhost,为了便于后续使用hostname来通信,需要更改下这3台服务器的hostname。
登录服务器,分别更改3台服务器的/etc/hostname文件,给3台机器分别命名为data1、data2、data3。
继续更改3台机器的/etc/hosts文件,添加3台机器的相互映射访问:
在/etc/hosts文件末尾增加上图中红框部分。
以上更改hostname需要重启机器才能生效。
2.添加hadoop用户和用户组
在3台服务器上专门添加一个名叫hadoop的用户组和用户名,用于操作hadoop集群。
# 添加hadoop用户组
sudo groupadd hadoop
# 添加hadoop用户,并使之属于hadoop用户组
sudo useradd -g hadoop hadoop
# 给haoop用户设置密码
sudo passwd hadoop
# 给hadoop用户添加sudo权限,编辑/etc/sudoers文件
sudo vi /etc/sudoers
# 在"root ALL=(ALL) ALL"的下面增加一行
hadoop ALL=(ALL) ALL
# 切换到刚刚添加的hadoop用户,后续集群安装过程中都是用该hadoop用户来操作
su hadoop
3.ssh免密配置
在Hadoop集群安装过程中,需要多次将配置好的服务包分发到其他机器上,为了避免每次ssh都需要输入密码,可配置ssh免密登录。
# 在data1机器上使用ssh-keygen生成公钥/私钥
# -t 指定rsa加密算法
# -P 表示密码,-P '' 就表示空密码;也可以不用-P参数,这样就要三车回车,用-P就一次回车
# -f 指定秘钥生成的文件路径
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# 进入到.ssh目录,可以看到该目录下有id_rsa(私钥)和id_rsa.pub(公钥)
cd ~/.ssh
# 将公钥拷贝到一个authorized_keys文件中
cat id_rsa.pub >> authorized_keys
# 将上面生成的authorized_keys分别拷贝到data2和data3主机上
scp ~/.ssh/authorized_keys hadoop@data2:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys hadoop@data3:~/.ssh/authorized_keys
# 将authorized_keys的权限设置为600
chmod 600 ~/.ssh/authorized_keys
# 验证ssh免密登录是否设置成功
# 在data1机器上ssh data2或ssh data3时,不再提示输入密码,则表示ssh免密登录设置成功
ssh data2
ssh data3
4.关闭防火墙
因hadoop集群都是在内网环境部署,为了避免在部署过程中出现一些奇怪问题,建议将防火墙事先关闭
# 查看防火墙状态(也可使用systemctl status firewalld命令查看)
firewall-cmd --state
# 临时关闭防火墙,重启后不生效
sudo systemctl stop firewalld
# 开机禁止启动防火墙
sudo systemctl disable firewalld
5.服务器时间同步配置
因为在集群环境中,有些服务是需要服务器进行时间同步的,特别是HBase服务,如果3台机器的时间相差太大,HBase服务启动会报错,故需要事先配置服务器时间同步。时间同步方式有ntp和chrony方式(推荐使用chrony),在centOS7下,默认已经安装了chrony,只需要增加配置即可。
5.1 Chrony服务端配置
我们将data1机器作为chrony的服务端,另外两台机器(data2、data3)作为chrony客户端,即data2和data3机器将会从data1机器上进行同步时间。
# 登录data1机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf
# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP,表示以自身的机器时间为准进行同步(不能访问外网的情况下可使用这种方式)
# 也可以配置成阿里云的时间同步服务器地址(需要能访问外网),如下:
# server ntp1.aliyun.com iburst
# server ntp2.aliyun.com iburst
# server ntp3.aliyun.com iburst
# server ntp4.aliyun.com iburst
server 192.168.66.152 iburst
# 设置允许被同步时间的机器IP网段
allow 192.168.66.0/24
# 设置时间同步服务级别
local stratum 10
# 重启chrony服务
sudo systemctl restart chronyd.service
# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service
# 查看chrony服务状态
systemctl status chronyd.service
5.2 Chrony客户端配置
在data2和data3机器上进行操作:
# 登录data1和data2机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf
# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP,表示将同步data1机器的时间
server 192.168.66.152 iburst
# 重启chrony服务
sudo systemctl restart chronyd.service
# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service
# 查看chrony服务状态
systemctl status chronyd.service
5.3 查看是否同步成功
# 可使用timedatectl命令查看是否同步成功,分别在data1、data2、data3机器上查看
timedatectl
# 命令返回如下信息:
Local time: Wed 2020-06-17 18:46:41 CST
Universal time: Wed 2020-06-17 10:46:41 UTC
RTC time: Wed 2020-06-17 10:46:40
Time zone: Asia/Shanghai (CST, +0800)
NTP enabled: yes
NTP synchronized: yes (同步成功后此处为yes)
RTC in local TZ: no
DST active: n/a
# 如果上面的NTP synchronized为no,说明同步失败,需检查配置是否正确
# 如果配置正确,上面还是显示no,可尝试设置:sudo timedatectl set-local-rtc 0
sudo timedatectl set-local-rtc 0
6.安装jdk
分别在3台机器上安装jdk8,并配置好环境变量,注意更改环境变量的配置文件后一定记得source一下。
7.部署zookeeper
# 将zookeeper的安装包上传到data1机器上
# 先将zookeeper压缩包解压到指定目录下
tar -zxvf zookeeper-3.4.5-cdh5.15.0.tar.gz /usr/local/zookeeper
# 进入到zookeeper的conf目录下,修改配置文件
# 将默认的zoo_sample.cfg文件复制并重命名为zoo.cfg
cp zoo_sample.cfg zoo.cfg
# 编辑该zoo.cfg配置文件
vi zoo.cfg
# 更改dataDir=/tmp/zookeeper参数
dataDir=/usr/local/zookeeper/data
# 在zoo.cfg文件末尾增加zk集群服务器配置
# 配置参数的模板为:server.X=A:B:C,其中X是一个数字, 表示这是第几号server(就是myid)
# A是该server所在的IP地址或hostname.
# B配置该server和集群中的leader交换消息所使用的端口.
# C配置选举leader时所使用的端口
server.1=data1:2888:3888
server.2=data2:2888:3888
server.3=data3:2888:3888
# zoo.cfg配置文件的其余参数可保持默认不变
# 创建上述配置中dataDir参数指定的data目录
mdkir /usr/local/zookeeper/data
# 进入到该data目录下,创建myid文件,并写入一个唯一标识数字id
# 该id用来唯一标识这个服务,一定要保证在整个集群中唯一
# zookeeper会根据这个id来取出server.x上的配置。比如当前id为1,则对应着zoo.cfg里的server.1的配置
cd /usr/local/zookeeper/data
touch myid
echo 1 > myid
# 至此,data1上的zk配置完毕
# 现在需要将data1上的zk分发到另外两台机器上(data2和data3)
scp /usr/local/zookeeper hadoop@data2:/usr/local/zookeeper
scp /usr/local/zookeeper hadoop@data3:/usr/local/zookeeper
# 分别到data2和data3上更改/usr/local/zookeeper/data/myid文件
# data2机器上的myid文件内容改为2
# data2机器上的myid文件内容改为3
vi /usr/local/zookeeper/data/myid
# 分别在3台机器上都配置上zookeeper的环境变量
# 为了避免每次操作zookeeper命令时都需要进入到zk的bin目录,需要配置zk的环境变量
sudo vi /etc/profile
# 在文件末尾增加以下两行
export ZK_HOME=/usr/local/zookeeper
export PATH=$ZK_HOME/bin:$PATH
# 改完后切记source一下
source /etc/profile
# 至此,所有配置已完毕,可分别在3台机器上启动zk服务
zkServer.sh start
# 启动完毕后,可查看当前zk的状态
zkServer.sh status
8.部署hadoop
8.1 安装hadoop
# 将hadoop安装包上传至data1机器
# 解压hadoop安装包到指定目录
tar -zxvf hadoop-2.6.0-cdh5.15.0.tar.gz /usr/local/hadoop
# 配置hadoop的环境变量
sudo vi /etc/profile
# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# 改完后切记source一下
source /etc/profile
8.2 修改hadoop-env.sh文件
# 进入到hadoop的配置文件目录
cd /usr/local/hadoop/etc/hadoop
# 编辑hadoop-env.sh
vi hadoop-env.sh
# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251
8.3 修改core-site.xml文件
# 该文件默认只有一个空的标签,需要在该标签中添加一下配置
fs.defaultFS
hdfs://cdhbds
The name of the default file system.
A URI whose scheme and authority determine the FileSystem implementation.
Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class.
The uri's authority is used to determine the host, port, etc. for a filesystem.
hadoop.tmp.dir
/data/hadooptmp
A base for other temporary directories.
io.native.lib.available
true
Should native hadoop libraries, if present, be used.
io.compression.codecs
org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
A comma-separated list of the compression codec classes that can
be used for compression/decompression. In addition to any classes specified
with this property (which take precedence), codec classes on the classpath
are discovered using a Java ServiceLoader.
fs.trash.interval
1440
Number of minutes between trash checkpoint. if zero, the trash feature is disabled.
fs.trash.checkpoint.interval
1440
Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.
If zero, the value is set to the value of fs.trash.interval
ha.zookeeper.quorum
data1:2181,data2:2181,data3:2181
3个zookeeper节点
8.4 修改hdfs-site.xml文件
# 该文件默认只有一个空的标签,需要在该标签中添加一下配置
dfs.nameservices
cdhbds
Comma-separated list of nameservices.
dfs.datanode.address
0.0.0.0:50010
The datanode server address and port for data transfer.
If the port is 0 then the server will start on a free port.
dfs.datanode.balance.bandwidthPerSec
52428800
dfs.datanode.balance.max.concurrent.moves
250
dfs.datanode.http.address
0.0.0.0:50075
The datanode http server address and port.
If the port is 0 then the server will start on a free port.
dfs.datanode.ipc.address
0.0.0.0:50020
The datanode ipc server address and port.
If the port is 0 then the server will start on a free port.
dfs.ha.namenodes.cdhbds
nn1,nn2
dfs.namenode.rpc-address.cdhbds.nn1
data1:8020
节点NN1的RPC地址
dfs.namenode.rpc-address.cdhbds.nn2
data2:8020
节点NN2的RPC地址
dfs.namenode.http-address.cdhbds.nn1
data1:50070
节点NN1的HTTP地址
dfs.namenode.http-address.ocdccluster.nn2
data2:50070
节点NN2的HTTP地址
dfs.namenode.name.dir
/data/namenode
Determines where on the local filesystem the DFS name node should store the name table.
If this is a comma-delimited list of directories,then name table is replicated in all of the directories,
for redundancy.
true
dfs.namenode.checkpoint.dir
/data/checkpoint
dfs.datanode.data.dir
/data/datanode
Determines where on the local filesystem an DFS data node should store its blocks.
If this is a comma-delimited list of directories,then data will be stored in all named directories,
typically on different devices.Directories that do not exist are ignored.
true
dfs.replication
3
dfs.datanode.hdfs-blocks-metadata.enabled
true
Boolean which enables backend datanode-side support for the experimental DistributedFileSystem*getFileVBlockStorageLocations API.
dfs.permissions.enabled
true
If "true", enable permission checking in HDFS.
If "false", permission checking is turned off,but all other behavior is unchanged.
Switching from one parameter value to the other does not change the mode,owner or group of files or directories.
dfs.namenode.shared.edits.dir
qjournal://data1:8485;data2:8485;data3:8485/cdhbds
采用3个journalnode节点存储元数据,这是IP与端口
dfs.journalnode.edits.dir
/data/journaldata/
journaldata的存储路径
dfs.journalnode.rpc-address
0.0.0.0:8485
dfs.journalnode.http-address
0.0.0.0:8480
dfs.client.failover.proxy.provider.cdhbds
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
该类用来判断哪个namenode处于生效状态
dfs.ha.fencing.methods
shell(/bin/true)
dfs.ha.fencing.ssh.connect-timeout
10000
dfs.ha.automatic-failover.enabled
true
Whether automatic failover is enabled. See the HDFS High Availability documentation for details
on automatic HA configuration.
dfs.namenode.handler.count
20
The number of server threads for the namenode.
8.5 修改mapred-site.xml文件
# 将mapred-site.xml.template复制并重命名为mapred-site.xml
# cp mapred-site.xml.template mapred-site.xml
# 编辑该mapred-site.xml文件,在标签中增加以下内容
mapreduce.framework.name
yarn
mapreduce.shuffle.port
8350
mapreduce.jobhistory.address
0.0.0.0:10121
mapreduce.jobhistory.webapp.address
0.0.0.0:19868
mapreduce.jobtracker.http.address
0.0.0.0:50330
mapreduce.tasktracker.http.address
0.0.0.0:50360
mapreduce.map.output.compress
true
mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.SnappyCodec
mapred.output.compression.type
BLOCK
mapreduce.job.counters.max
560
Limit on the number of counters allowed per job.
mapred.child.java.opts
-Xmx4096m
mapreduce.map.memory.mb
3072
mapreduce.reduce.memory.mb
4096
mapreduce.map.cpu.vcores
1
mapreduce.reduce.cpu.vcores
1
mapreduce.task.io.sort.mb
300
8.6 修改yarn-env.sh文件
# 编辑yarn-env.sh文件
vi yarn-env.sh
# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251
8.7 修改yarn-site.xml文件
# 在标签中增加以下内容
yarn.resourcemanager.connect.retry-interval.ms
2000
yarn.resourcemanager.ha.enabled
true
yarn.resourcemanager.ha.automatic-failover.enabled
true
yarn.resourcemanager.ha.automatic-failover.embedded
true
yarn.resourcemanager.cluster-id
yarn-rm-cluster
yarn.resourcemanager.ha.rm-ids
rm1,rm2
Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.
yarn.resourcemanager.ha.id
rm1
yarn.resourcemanager.recovery.enabled
true
yarn.resourcemanager.store.class
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
yarn.resourcemanager.zk-address
data1:2181,data2:2181,data3:2181
yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
5000
yarn.resourcemanager.scheduler.class
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.scheduler.fair.allocation.file
fair-scheduler.xml
yarn.resourcemanager.address.rm1
data1:8032
yarn.resourcemanager.scheduler.address.rm1
data1:8030
yarn.resourcemanager.webapp.address.rm1
data1:50030
yarn.resourcemanager.resource-tracker.address.rm1
data1:8031
yarn.resourcemanager.admin.address.rm1
data1:8033
yarn.resourcemanager.ha.admin.address.rm1
data1:8034
yarn.resourcemanager.address.rm2
data2:8032
yarn.resourcemanager.scheduler.address.rm2
data2:8030
yarn.resourcemanager.webapp.address.rm2
data2:50030
yarn.resourcemanager.resource-tracker.address.rm2
data2:8031
yarn.resourcemanager.admin.address.rm2
data2:8033
yarn.resourcemanager.ha.admin.address.rm2
data2:8034
Address where the localizer IPC is.
yarn.nodemanager.localizer.address
0.0.0.0:23344
NM Webapp address.
yarn.nodemanager.webapp.address
0.0.0.0:23999
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce_shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.nodemanager.resource.memory-mb
112640
yarn.scheduler.minimum-allocation-mb
1024
yarn.nodemanager.resource.cpu-vcores
31
yarn.scheduler.increment-allocation-mb
512
yarn.nodemanager.vmem-pmem-ratio
2.1
yarn.nodemanager.local-dirs
/data/yarn/local
yarn.nodemanager.log-dirs
/data/yarn/logs
8.8 修改fair-scheduler.xml文件
# 创建一个fair-scheduler.xml文件
10240 mb, 10 vcores
51200 mb, 18 vcores
fair
5
30
8.9 修改slaves文件
# 编辑slaves文件
vi slaves
# 将localhost替换为以下三行,表示这3台机器将作为hadoop集群的从节点
# 即会在这3台机器上启动datanode和nodeManager服务
data1
data2
data3
8.10 分发hadoop包
# 将data1上配置好的hadoop包分发到另外两台机器上(data2和data3)
scp -rp /usr/local/hadoop hadoop@data2:/usr/local/hadoop
scp -rp /usr/local/hadoop hadoop@data3:/usr/local/hadoop
# 分发完毕后,还需要修改data2上的yarn-site.xml文件里的一个配置
# 我们是将data1和data2机器作为ResourceManager的HA模式部署机器
# 将下面这个属性的值从rm1改为rm2,否则在data2上启动ResourcManager服务的时候会报data1:端口被占用
Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.
yarn.resourcemanager.ha.id
rm2
# 在data2和data3机器上也配置好hadoop的环境变量
sudo vi /etc/profile
# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# 改完后切记source一下
source /etc/profile
8.11 初始化并启动集群
# 首先要确保前面已经将zookeeper集群启动成功了
# 分别在3台机器上启动journalnode节点
hadoop-daemon.sh start journalnode
# 在data1上初始化NameNode
hadoop namenode -format
# 将data1上的namenode元数据目录拷贝至data2机器,使两台nameNode节点的元数据在初始化后一致
# 即hdfs-site.xml 文件中配置的这个目录/data/namenode
#
# dfs.namenode.name.dir
# /data/namenode
#
# Determines where on the local filesystem the DFS name node should store the name table.
# If this is a comma-delimited list of directories,then name table is replicated in all of the #directories,
# for redundancy.
# true
#
scp -rp /data/namenode hadoop@data2:/data/namenode
# 在data1机器上初始化ZFCK
hdfs zkfc -formatZK
# 在data1机器上启动hdfs分布式存储系统
start-dfs.sh
# 在data1机器上启动yarn集群
# 此命令将会在data1上启动ResourceManager服务,在data1、data2、data3上启动NodeManager服务
start-yarn.sh
# 在data2机器上启动ResourceManager服务
yarn-daemon.sh start resourcemanager
8.12 验证hadoop集群
(1)访问hdfs集群的web UI管理页面
在Windows机器上访问http://data1:50070,将呈现hdfs集群的基本信息:
(2)访问yarn集群的web UI管理页面
在Windows机器上访问http://data1:50030,将呈现yarn集群资源的基本信息“
9.部署hive
因hive的元数据我们是存的MySQL,故需要提前安装好MySQL数据库。
(1)安装hive包
# 将hive安装包上传至data1机器,并解压至指定目录
tar -zxvf hive-1.1.0-cdh5.15.0.tar.gz /usr/local/hive
# 配置hive的环境变量
# 在/etc/profile 文件中增加hive的path变量配置
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
# 配置完环境变量后要source /etc/profile
source /etc/profile
(2)拷贝MySQL驱动包
将MySQL的驱动jar包拷贝到hive安装目录的lib目录下,即/usr/local/hive/lib.
(3)修改hive-env.sh文件
# 进入到hive的conf目录下,将hive-env.sh.template文件拷贝并重命名为hive-env.sh
cp hive-env.sh.tmplate hive-env.sh
# 修改hive-env.sh文件的以下两个配置参数
HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf
(4)修改hive-site.xml文件
# 修改hive-site.xml文件内容
javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.66.240:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
root
username to use against metastore database
javax.jdo.option.ConnectionPassword
123456
password to use against metastore database
hive.exec.compress.output
true
This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
hive.exec.compress.intermediate
true
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
datanucleus.autoCreateSchema
true
creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once
hive.mapjoin.check.memory.rows
100000
The number means after how many rows processed it needs to check the memory usage
hive.auto.convert.join
true
Whether Hive enables the optimization about converting common join into mapjoin based on the input file size
hive.auto.convert.join.noconditionaltask
true
Whether Hive enables the optimization about converting common join into mapjoin based on the input file
size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the
specified size, the join is directly converted to a mapjoin (there is no conditional task).
hive.auto.convert.join.noconditionaltask.size
10000000
If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it
is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly
converted to a mapjoin(there is no conditional task). The default is 10MB
hive.auto.convert.join.use.nonstaged
false
For conditional joins, if input stream from a small alias can be directly applied to join operator without
filtering or projection, the alias need not to be pre-staged in distributed cache via mapred local task.
Currently, this is not working with vectorization or tez execution engine.
hive.mapred.mode
nonstrict
The mode in which the Hive operations are being performed.
In strict mode, some risky queries are not allowed to run. They include:
Cartesian Product.
No partition being picked up for a query.
Comparing bigints and strings.
Comparing bigints and doubles.
Orderby without limit.
hive.exec.parallel
true
Whether to execute jobs in parallel
hive.exec.parallel.thread.number
8
How many jobs at most can be executed in parallel
hive.exec.dynamic.partition
true
Whether or not to allow dynamic partitions in DML/DDL.
hive.exec.dynamic.partition.mode
nonstrict
In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.
hive.metastore.uris
thrift://data1:9083
Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
hive.server2.enable.impersonation
Enable user impersonation for HiveServer2
false
hive.server2.enable.doAs
false
hive.input.format
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
hive.merge.mapfiles
true
hive.merge.mapredfiles
true
hive.merge.size.per.task
256000000
hive.merge.smallfiles.avgsize
256000000
hive.server2.logging.operation.enabled
true
hbase.rootdir
hdfs://cdhbds/hbase
dfs.nameservices
cdhbds
hbase.cluster.distributed
true
hbase.tmp.dir
/data/hbase/tmp
hbase.master.port
16000
hbase.zookeeper.quorum
data1,data2,data3
hbase.zookeeper.property.clientPort
2181
(4)拷贝hdfs的配置文件
将hadoop配置目录下的core-site.xml和hdfs-site.xml文件拷贝至hbase的conf目录下
(5)配置regionservers文件
# 配置HRegionServer节点的主机名,编辑regionservers文件,添加以下内容
data1
data2
data3
(6)配置HMaster高可用
# 在hbase的conf目录下创建一个backup-masters文件,并添加HMaster备用节点的主机名
data2
(7)分发hbase包
# 将data1上配置好的hbase包分发到另外两台机器(data2和data3)
scp -rp /usr/local/hbase hadoop@data2:/usr/local/hbase
scp -rp /usr/local/hbase hadoop@data3:/usr/local/hbase
(8)启动hbase集群
# 在data1机器上执行启动hbase集群命令
start-hbase.sh
# 启动完成后,分别在3台机器上执行jps命令,可以查看到每台机器上启动的hbase进程
# data1机器上,可看到HMaster和HRegionServer进程
# data2机器上,可看到HMaster和HRegionServer进程
# data3机器上,可看到HRegionServer进程
jps
(9)验证hbase
# 在data1机器上输入hbase shell命令,将进入到hbase的命令行客户端
hbase shell
此外,也可访问hbase的UI管理页面:http://data1:60010
11.部署kafka
(1)安装kafka包
# 将kafka安装包上传至data1机器,并解压至指定位置
tar -zxvf kafka_2.12-0.11.0.3.tgz /usr/local/kafka
(2)修改server.properties文件
# 编辑server.properties文件,修改以下内容
# broker的唯一标识
broker.id=0
# 修改为当前机器的hostname
listeners=PLAINTEXT://data1:9092
# 修改kafka的日志路径(也是kafka消息数据存储的路径)
log.dirs=/data/kafka-logs
# 修改zookeeper连接地址
zookeeper.connect=data1:2181,data2:2181,data3:2181
(3)分发kafka包
# 将data1上的kafka包分发到另外两台机器(data2和data3)
scp -rp /usr/loca/kafka hadoop@data2:/usr/local/kafka
scp -rp /usr/loca/kafka hadoop@data3:/usr/local/kafka
# 更改data2上的server.properties文件里的以下两个配置
# broker的唯一标识
broker.id=1
# 修改为当前机器的hostname
listeners=PLAINTEXT://data2:9092
# 同理,更改data3上对应的配置
# broker的唯一标识
broker.id=2
# 修改为当前机器的hostname
listeners=PLAINTEXT://data3:9092
(4)启动kafka集群
# 在3台机器上分别启动kafka
# 进入到kafka的bin目录,执行启动命令
cd /usr/local/kafka/bin
./kafka-server-start.sh -daemon ../config/server.properties
# 命令中的-daemon参数表示后台启动kafka
(5)验证kafka
# 分别在3台机器上执行jps命令,可查看到kafka进程是否启动成功
jps
# 创建topic命令
bin/kafka-topics.sh --create --zookeeper data1:2181,data2:2181,data3:2181 --replication-factor 3 --partitions 1 --topic test
# 模拟生产者命令
bin/kafka-console-producer.sh --broker-list data1:9092,data2:9092,data3:9092 --topic test
# 模拟消费者命令
bin/kafka-console-consumer.sh --bootstrap-server data1:9092,data2:9092,data3:9092 --from-beginning --topic test
12.部署flink on yarn
本次flink部署采用的是on yarn模式(HA)。
(1)安装flink包
# 将flink安装包上传至data1机器,并解压至指定位置
tar -zxvf flink-1.10.1-bin-scala_2.12.tgz /usr/local/flink
(2)修改flink-conf.yaml文件
# 进入到flink的conf目录,编辑flink-conf.yaml文件
vi flink-conf.yaml
# 修改以下内容的配置
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://cdhbds/flink/ha/
high-availability.zookeeper.quorum: data1:2181,data2:2181,data3:2181
high-availability.zookeeper.path.root: /flink
state.backend: filesystem
state.checkpoints.dir: hdfs://cdhbds/flink/flink-checkpoints
state.savepoints.dir: hdfs://cdhbds/flink/flink-checkpoints
jobmanager.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/
historyserver.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/
yarn.application-attempts: 10
(3)修改日志配置文件
因为flink的conf目录下有log4j和logback的配置文件,在启动flink集群的时候会有一个警告:
org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration directory ('/root/flink-1.7.1/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
故需要去掉一个日志配置文件,我们可以将log4j.properties给重命名为log4j.properties.bak即可.
(4)配置hadoop classpath
# 该版本的flink默认是没有和hadoop集成的,官方文档指出需要我们自己去完成hadoop集成的配置,
# 官网上给出了两种集成方案,一种是添加hadoop classpath的配置,
# 另一种是将flink-shaded-hadoop-2-uber-xx.jar拷贝至flink的lib目录下,
# 这里我们采用第一种方式,通过配置hadoop classpath
# 编辑/etc/profile文件,添加以下内容
export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
# 配置完环境变量后要source /etc/profile
source /etc/profile
(5)以yarn-session方式启动flink集群
# 这里我们以yarn-session方式来启动集群,进入到bin目录,启动命令如下:
./yarn-session.sh -s 4 -jm 1024m -tm 4096m -nm flink-test -d
(6)启动historyServer服务
# 启动historyServer服务命令
./historyserver.sh start