hadoop-2.6.0-cdh5.15.0集群环境搭建

一、部署前准备

1. 机器数量

测试环境可使用3台服务器进行集群搭建：

主机名称	主机IP	系统
data1	192.168.66.152	CentOS7
data2	192.168.66.153	CentOS7
data3	192.168.66.154	CentOS7

2. 组件版本及下载

组件	版本	下载地址
hadoop	hadoop-2.6.0-cdh5.15.0	https://archive.cloudera.com/...
hive	hive-1.1.0-cdh5.15.0	https://archive.cloudera.com/...
zookeeper	zookeeper-3.4.5-cdh5.15.0	https://archive.cloudera.com/...
hbase	hbase-1.2.0-cdh5.15.0	https://archive.cloudera.com/...
kafka	kafka_2.12-0.11.0.3	http://kafka.apache.org/downl...
flink	flink-1.10.1-bin-scala_2.12	https://flink.apache.org/down...
jdk	jdk-8u251-linux-x64	https://www.oracle.com/java/t...

3.集群节点规划

机器名称	服务名称
data1	NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka
data2	NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka
data3	DataNode、NodeManager、HRegionServer、JournalNode、QuorumPeerMain、Kafka

二、开始部署

1.更改hostname

3台服务器的默认主机名为localhost，为了便于后续使用hostname来通信，需要更改下这3台服务器的hostname。
登录服务器，分别更改3台服务器的/etc/hostname文件，给3台机器分别命名为data1、data2、data3。
继续更改3台机器的/etc/hosts文件，添加3台机器的相互映射访问：

在/etc/hosts文件末尾增加上图中红框部分。
以上更改hostname需要重启机器才能生效。

2.添加hadoop用户和用户组

在3台服务器上专门添加一个名叫hadoop的用户组和用户名，用于操作hadoop集群。

# 添加hadoop用户组
sudo groupadd hadoop

# 添加hadoop用户，并使之属于hadoop用户组
sudo useradd -g hadoop hadoop

# 给haoop用户设置密码
sudo passwd hadoop

# 给hadoop用户添加sudo权限，编辑/etc/sudoers文件
sudo vi /etc/sudoers

# 在"root    ALL=(ALL)     ALL"的下面增加一行
hadoop  ALL=(ALL)       ALL

# 切换到刚刚添加的hadoop用户，后续集群安装过程中都是用该hadoop用户来操作
su hadoop

3.ssh免密配置

在Hadoop集群安装过程中，需要多次将配置好的服务包分发到其他机器上，为了避免每次ssh都需要输入密码，可配置ssh免密登录。

# 在data1机器上使用ssh-keygen生成公钥/私钥
# -t 指定rsa加密算法
# -P 表示密码，-P '' 就表示空密码；也可以不用-P参数，这样就要三车回车，用-P就一次回车
# -f 指定秘钥生成的文件路径
ssh-keygen  -t rsa -P '' -f ~/.ssh/id_rsa

# 进入到.ssh目录，可以看到该目录下有id_rsa（私钥）和id_rsa.pub（公钥）
cd ~/.ssh

# 将公钥拷贝到一个authorized_keys文件中
cat id_rsa.pub >> authorized_keys

# 将上面生成的authorized_keys分别拷贝到data2和data3主机上
scp ~/.ssh/authorized_keys hadoop@data2:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys hadoop@data3:~/.ssh/authorized_keys

# 将authorized_keys的权限设置为600
chmod 600 ~/.ssh/authorized_keys

# 验证ssh免密登录是否设置成功
# 在data1机器上ssh data2或ssh data3时，不再提示输入密码，则表示ssh免密登录设置成功
ssh data2
ssh data3

4.关闭防火墙

因hadoop集群都是在内网环境部署，为了避免在部署过程中出现一些奇怪问题，建议将防火墙事先关闭

# 查看防火墙状态（也可使用systemctl status firewalld命令查看）
firewall-cmd --state

# 临时关闭防火墙，重启后不生效
sudo systemctl stop firewalld

# 开机禁止启动防火墙
sudo systemctl disable firewalld

5.服务器时间同步配置

因为在集群环境中，有些服务是需要服务器进行时间同步的，特别是HBase服务，如果3台机器的时间相差太大，HBase服务启动会报错，故需要事先配置服务器时间同步。时间同步方式有ntp和chrony方式（推荐使用chrony），在centOS7下，默认已经安装了chrony，只需要增加配置即可。

5.1 Chrony服务端配置

我们将data1机器作为chrony的服务端，另外两台机器（data2、data3）作为chrony客户端，即data2和data3机器将会从data1机器上进行同步时间。

# 登录data1机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf

# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP，表示以自身的机器时间为准进行同步（不能访问外网的情况下可使用这种方式）
# 也可以配置成阿里云的时间同步服务器地址（需要能访问外网），如下：
# server ntp1.aliyun.com iburst 
# server ntp2.aliyun.com iburst
# server ntp3.aliyun.com iburst
# server ntp4.aliyun.com iburst
server 192.168.66.152 iburst

# 设置允许被同步时间的机器IP网段
allow 192.168.66.0/24

# 设置时间同步服务级别
local stratum 10

# 重启chrony服务
sudo systemctl restart chronyd.service

# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service

# 查看chrony服务状态
systemctl status chronyd.service

5.2 Chrony客户端配置

在data2和data3机器上进行操作：

# 登录data1和data2机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf

# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP，表示将同步data1机器的时间
server 192.168.66.152 iburst

# 重启chrony服务
sudo systemctl restart chronyd.service

# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service

# 查看chrony服务状态
systemctl status chronyd.service

5.3 查看是否同步成功

# 可使用timedatectl命令查看是否同步成功，分别在data1、data2、data3机器上查看
timedatectl

# 命令返回如下信息：
      Local time: Wed 2020-06-17 18:46:41 CST
  Universal time: Wed 2020-06-17 10:46:41 UTC
        RTC time: Wed 2020-06-17 10:46:40
       Time zone: Asia/Shanghai (CST, +0800)
     NTP enabled: yes
NTP synchronized: yes  （同步成功后此处为yes）
 RTC in local TZ: no
      DST active: n/a

# 如果上面的NTP synchronized为no，说明同步失败，需检查配置是否正确
# 如果配置正确，上面还是显示no，可尝试设置：sudo timedatectl set-local-rtc 0
sudo timedatectl set-local-rtc 0

6.安装jdk

分别在3台机器上安装jdk8，并配置好环境变量，注意更改环境变量的配置文件后一定记得source一下。

7.部署zookeeper

# 将zookeeper的安装包上传到data1机器上
# 先将zookeeper压缩包解压到指定目录下
tar -zxvf zookeeper-3.4.5-cdh5.15.0.tar.gz /usr/local/zookeeper

# 进入到zookeeper的conf目录下，修改配置文件
# 将默认的zoo_sample.cfg文件复制并重命名为zoo.cfg
cp zoo_sample.cfg zoo.cfg

# 编辑该zoo.cfg配置文件
vi zoo.cfg

# 更改dataDir=/tmp/zookeeper参数
dataDir=/usr/local/zookeeper/data

# 在zoo.cfg文件末尾增加zk集群服务器配置
# 配置参数的模板为：server.X=A:B:C，其中X是一个数字, 表示这是第几号server(就是myid)  
# A是该server所在的IP地址或hostname.  
# B配置该server和集群中的leader交换消息所使用的端口.  
# C配置选举leader时所使用的端口
server.1=data1:2888:3888
server.2=data2:2888:3888
server.3=data3:2888:3888
# zoo.cfg配置文件的其余参数可保持默认不变

# 创建上述配置中dataDir参数指定的data目录
mdkir /usr/local/zookeeper/data

# 进入到该data目录下，创建myid文件，并写入一个唯一标识数字id
# 该id用来唯一标识这个服务，一定要保证在整个集群中唯一
# zookeeper会根据这个id来取出server.x上的配置。比如当前id为1，则对应着zoo.cfg里的server.1的配置
cd /usr/local/zookeeper/data
touch myid
echo 1 > myid

# 至此，data1上的zk配置完毕
# 现在需要将data1上的zk分发到另外两台机器上（data2和data3）
scp /usr/local/zookeeper hadoop@data2:/usr/local/zookeeper
scp /usr/local/zookeeper hadoop@data3:/usr/local/zookeeper

# 分别到data2和data3上更改/usr/local/zookeeper/data/myid文件
# data2机器上的myid文件内容改为2
# data2机器上的myid文件内容改为3
vi /usr/local/zookeeper/data/myid

# 分别在3台机器上都配置上zookeeper的环境变量
# 为了避免每次操作zookeeper命令时都需要进入到zk的bin目录，需要配置zk的环境变量
sudo vi /etc/profile

# 在文件末尾增加以下两行
export ZK_HOME=/usr/local/zookeeper
export PATH=$ZK_HOME/bin:$PATH

# 改完后切记source一下
source /etc/profile

# 至此，所有配置已完毕，可分别在3台机器上启动zk服务
zkServer.sh start

# 启动完毕后，可查看当前zk的状态
zkServer.sh status

8.部署hadoop

8.1 安装hadoop

# 将hadoop安装包上传至data1机器
# 解压hadoop安装包到指定目录
tar -zxvf hadoop-2.6.0-cdh5.15.0.tar.gz /usr/local/hadoop

# 配置hadoop的环境变量
sudo vi /etc/profile

# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

# 改完后切记source一下
source /etc/profile

8.2 修改hadoop-env.sh文件

# 进入到hadoop的配置文件目录
cd /usr/local/hadoop/etc/hadoop

# 编辑hadoop-env.sh
vi hadoop-env.sh

# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251

8.3 修改core-site.xml文件

# 该文件默认只有一个空的标签，需要在该标签中添加一下配置

  fs.defaultFS
  hdfs://cdhbds
  
   The name of the default file system.  
   A URI whose scheme and authority determine the FileSystem implementation.  
   Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class.  
   The uri's authority is used to determine the host, port, etc. for a filesystem.
  



  hadoop.tmp.dir
  /data/hadooptmp
  A base for other temporary directories.



  io.native.lib.available
  true
  Should native hadoop libraries, if present, be used.



  io.compression.codecs
  org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
  
    A comma-separated list of the compression codec classes that can
    be used for compression/decompression. In addition to any classes specified
    with this property (which take precedence), codec classes on the classpath
    are discovered using a Java ServiceLoader.



  fs.trash.interval
  1440
  Number of minutes between trash checkpoint. if zero, the trash feature is disabled.



  fs.trash.checkpoint.interval
  1440
   
    Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval. 
    If zero, the value is set to the value of fs.trash.interval 



    ha.zookeeper.quorum
    data1:2181,data2:2181,data3:2181
    3个zookeeper节点

8.4 修改hdfs-site.xml文件

# 该文件默认只有一个空的标签，需要在该标签中添加一下配置

    dfs.nameservices
    cdhbds
    
        Comma-separated list of nameservices.
    



    dfs.datanode.address
    0.0.0.0:50010
    
       The datanode server address and port for data transfer.
       If the port is 0 then the server will start on a free port.
    



    dfs.datanode.balance.bandwidthPerSec
    52428800



    dfs.datanode.balance.max.concurrent.moves
    250



    dfs.datanode.http.address
    0.0.0.0:50075
    
       The datanode http server address and port.
       If the port is 0 then the server will start on a free port.
    



    dfs.datanode.ipc.address
    0.0.0.0:50020
    
       The datanode ipc server address and port.
       If the port is 0 then the server will start on a free port.
    



    dfs.ha.namenodes.cdhbds
    nn1,nn2
    



    dfs.namenode.rpc-address.cdhbds.nn1
    data1:8020
    节点NN1的RPC地址

                        

    dfs.namenode.rpc-address.cdhbds.nn2
    data2:8020
    节点NN2的RPC地址

                                    

    dfs.namenode.http-address.cdhbds.nn1
    data1:50070
    节点NN1的HTTP地址

                                                

    dfs.namenode.http-address.ocdccluster.nn2
    data2:50070
    节点NN2的HTTP地址



    dfs.namenode.name.dir
    /data/namenode
    
      Determines where on the local filesystem the DFS name node should store the name table.
      If this is a comma-delimited list of directories,then name table is replicated in all of the directories,
      for redundancy.
    true



    dfs.namenode.checkpoint.dir
    /data/checkpoint
    



    dfs.datanode.data.dir
    /data/datanode
    Determines where on the local filesystem an DFS data node should store its blocks.
         If this is a comma-delimited list of directories,then data will be stored in all named directories,
         typically on different devices.Directories that do not exist are ignored.
    
true



    dfs.replication
    3



    dfs.datanode.hdfs-blocks-metadata.enabled
    true
    
   Boolean which enables backend datanode-side support for the experimental DistributedFileSystem*getFileVBlockStorageLocations API.
    



    dfs.permissions.enabled
    true
    
        If "true", enable permission checking in HDFS.
        If "false", permission checking is turned off,but all other behavior is unchanged.
        Switching from one parameter value to the other does not change the mode,owner or group of files or directories.
    



    dfs.namenode.shared.edits.dir
    qjournal://data1:8485;data2:8485;data3:8485/cdhbds
    采用3个journalnode节点存储元数据，这是IP与端口

            

    dfs.journalnode.edits.dir
    /data/journaldata/
    journaldata的存储路径



    dfs.journalnode.rpc-address
    0.0.0.0:8485

        

    dfs.journalnode.http-address
    0.0.0.0:8480



    dfs.client.failover.proxy.provider.cdhbds
    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
    该类用来判断哪个namenode处于生效状态



    dfs.ha.fencing.methods
    shell(/bin/true)



    dfs.ha.fencing.ssh.connect-timeout
    10000



    dfs.ha.automatic-failover.enabled
    true
    
          Whether automatic failover is enabled. See the HDFS High Availability documentation for details 
          on automatic HA configuration.
    



     dfs.namenode.handler.count
     20
     The number of server threads for the namenode.

8.5 修改mapred-site.xml文件

# 将mapred-site.xml.template复制并重命名为mapred-site.xml
# cp mapred-site.xml.template mapred-site.xml

# 编辑该mapred-site.xml文件，在标签中增加以下内容

   mapreduce.framework.name
   yarn



    mapreduce.shuffle.port
    8350



    mapreduce.jobhistory.address
    0.0.0.0:10121



    mapreduce.jobhistory.webapp.address
    0.0.0.0:19868



    mapreduce.jobtracker.http.address
    0.0.0.0:50330



    mapreduce.tasktracker.http.address
    0.0.0.0:50360



    mapreduce.map.output.compress 
    true

              

    mapreduce.map.output.compress.codec 
    org.apache.hadoop.io.compress.SnappyCodec



    mapred.output.compression.type
    BLOCK



    mapreduce.job.counters.max
    560
    Limit on the number of counters allowed per job.



    mapred.child.java.opts
    -Xmx4096m



    mapreduce.map.memory.mb
    3072



    mapreduce.reduce.memory.mb
    4096



    mapreduce.map.cpu.vcores
    1



    mapreduce.reduce.cpu.vcores
    1



    mapreduce.task.io.sort.mb
    300

8.6 修改yarn-env.sh文件

# 编辑yarn-env.sh文件
vi yarn-env.sh

# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251

8.7 修改yarn-site.xml文件

# 在标签中增加以下内容


    yarn.resourcemanager.connect.retry-interval.ms
    2000



    yarn.resourcemanager.ha.enabled
    true



    yarn.resourcemanager.ha.automatic-failover.enabled
    true



    yarn.resourcemanager.ha.automatic-failover.embedded
    true



    yarn.resourcemanager.cluster-id
    yarn-rm-cluster



    yarn.resourcemanager.ha.rm-ids
    rm1,rm2



    Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.
    yarn.resourcemanager.ha.id
    rm1



    yarn.resourcemanager.recovery.enabled
    true



    yarn.resourcemanager.store.class
    org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore



    yarn.resourcemanager.zk-address
    data1:2181,data2:2181,data3:2181



    yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
    5000



    yarn.resourcemanager.scheduler.class
    org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




    yarn.scheduler.fair.allocation.file
    fair-scheduler.xml




    yarn.resourcemanager.address.rm1
    data1:8032



    yarn.resourcemanager.scheduler.address.rm1
    data1:8030



    yarn.resourcemanager.webapp.address.rm1
    data1:50030



    yarn.resourcemanager.resource-tracker.address.rm1
    data1:8031



    yarn.resourcemanager.admin.address.rm1
    data1:8033



    yarn.resourcemanager.ha.admin.address.rm1
    data1:8034




    yarn.resourcemanager.address.rm2
    data2:8032



    yarn.resourcemanager.scheduler.address.rm2
    data2:8030



    yarn.resourcemanager.webapp.address.rm2
    data2:50030



    yarn.resourcemanager.resource-tracker.address.rm2
    data2:8031



    yarn.resourcemanager.admin.address.rm2
    data2:8033



    yarn.resourcemanager.ha.admin.address.rm2
    data2:8034




    Address where the localizer IPC is.
    yarn.nodemanager.localizer.address
    0.0.0.0:23344



    NM Webapp address.
    yarn.nodemanager.webapp.address
    0.0.0.0:23999



    yarn.nodemanager.aux-services
    mapreduce_shuffle



    yarn.nodemanager.aux-services.mapreduce_shuffle.class
    org.apache.hadoop.mapred.ShuffleHandler



    yarn.nodemanager.resource.memory-mb
    112640



    yarn.scheduler.minimum-allocation-mb
    1024



    yarn.nodemanager.resource.cpu-vcores
    31



    yarn.scheduler.increment-allocation-mb
    512



    yarn.nodemanager.vmem-pmem-ratio
    2.1



    yarn.nodemanager.local-dirs
    /data/yarn/local



    yarn.nodemanager.log-dirs
    /data/yarn/logs

8.8 修改fair-scheduler.xml文件

# 创建一个fair-scheduler.xml文件






    10240 mb, 10 vcores
    51200 mb, 18 vcores
    fair
    5
    30

8.9 修改slaves文件

# 编辑slaves文件
vi slaves

# 将localhost替换为以下三行，表示这3台机器将作为hadoop集群的从节点
# 即会在这3台机器上启动datanode和nodeManager服务
data1
data2
data3

8.10 分发hadoop包

# 将data1上配置好的hadoop包分发到另外两台机器上（data2和data3）
scp -rp /usr/local/hadoop hadoop@data2:/usr/local/hadoop
scp -rp /usr/local/hadoop hadoop@data3:/usr/local/hadoop

# 分发完毕后，还需要修改data2上的yarn-site.xml文件里的一个配置
# 我们是将data1和data2机器作为ResourceManager的HA模式部署机器
# 将下面这个属性的值从rm1改为rm2,否则在data2上启动ResourcManager服务的时候会报data1:端口被占用

    Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.
    yarn.resourcemanager.ha.id
    rm2


# 在data2和data3机器上也配置好hadoop的环境变量
sudo vi /etc/profile

# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

# 改完后切记source一下
source /etc/profile

8.11 初始化并启动集群

# 首先要确保前面已经将zookeeper集群启动成功了
# 分别在3台机器上启动journalnode节点
hadoop-daemon.sh start journalnode

# 在data1上初始化NameNode
hadoop namenode -format

# 将data1上的namenode元数据目录拷贝至data2机器，使两台nameNode节点的元数据在初始化后一致
# 即hdfs-site.xml 文件中配置的这个目录/data/namenode
# 
#    dfs.namenode.name.dir
#    /data/namenode
#    
#      Determines where on the local filesystem the DFS name node should store the name table.
#      If this is a comma-delimited list of directories,then name table is replicated in all of the #directories,
#      for redundancy.
#    true
# 
scp -rp /data/namenode hadoop@data2:/data/namenode

# 在data1机器上初始化ZFCK
hdfs zkfc -formatZK

# 在data1机器上启动hdfs分布式存储系统
start-dfs.sh

# 在data1机器上启动yarn集群
# 此命令将会在data1上启动ResourceManager服务，在data1、data2、data3上启动NodeManager服务
start-yarn.sh

# 在data2机器上启动ResourceManager服务
yarn-daemon.sh start resourcemanager

8.12 验证hadoop集群

（1）访问hdfs集群的web UI管理页面
在Windows机器上访问http://data1:50070，将呈现hdfs集群的基本信息：

（2）访问yarn集群的web UI管理页面
在Windows机器上访问http://data1:50030，将呈现yarn集群资源的基本信息“

9.部署hive

因hive的元数据我们是存的MySQL，故需要提前安装好MySQL数据库。
（1）安装hive包

# 将hive安装包上传至data1机器，并解压至指定目录
tar -zxvf hive-1.1.0-cdh5.15.0.tar.gz /usr/local/hive

# 配置hive的环境变量
# 在/etc/profile 文件中增加hive的path变量配置
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH

# 配置完环境变量后要source /etc/profile
source /etc/profile

（2）拷贝MySQL驱动包

将MySQL的驱动jar包拷贝到hive安装目录的lib目录下，即/usr/local/hive/lib.

（3）修改hive-env.sh文件

# 进入到hive的conf目录下，将hive-env.sh.template文件拷贝并重命名为hive-env.sh
cp hive-env.sh.tmplate hive-env.sh

# 修改hive-env.sh文件的以下两个配置参数
HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf

（4）修改hive-site.xml文件

# 修改hive-site.xml文件内容


  javax.jdo.option.ConnectionURL
  jdbc:mysql://192.168.66.240:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver
  Driver class name for a JDBC metastore



  javax.jdo.option.ConnectionUserName
  root
  username to use against metastore database



  javax.jdo.option.ConnectionPassword
  123456
  password to use against metastore database



  hive.exec.compress.output
  true
   This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress* 



  hive.exec.compress.intermediate
  true
   This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress* 



  datanucleus.autoCreateSchema
  true
  creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once



  hive.mapjoin.check.memory.rows
  100000
  The number means after how many rows processed it needs to check the memory usage



  hive.auto.convert.join
  true
  Whether Hive enables the optimization about converting common join into mapjoin based on the input file size



  hive.auto.convert.join.noconditionaltask
  true
  Whether Hive enables the optimization about converting common join into mapjoin based on the input file 
    size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the
    specified size, the join is directly converted to a mapjoin (there is no conditional task).
  



  hive.auto.convert.join.noconditionaltask.size
  10000000
  If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it
    is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly
    converted to a mapjoin(there is no conditional task). The default is 10MB
  



  hive.auto.convert.join.use.nonstaged
  false
  For conditional joins, if input stream from a small alias can be directly applied to join operator without
    filtering or projection, the alias need not to be pre-staged in distributed cache via mapred local task.
    Currently, this is not working with vectorization or tez execution engine.
  



  hive.mapred.mode
  nonstrict
  The mode in which the Hive operations are being performed.
     In strict mode, some risky queries are not allowed to run. They include:
       Cartesian Product.
       No partition being picked up for a query.
       Comparing bigints and strings.
       Comparing bigints and doubles.
       Orderby without limit.
  



  hive.exec.parallel
  true
  Whether to execute jobs in parallel



  hive.exec.parallel.thread.number
  8
  How many jobs at most can be executed in parallel



  hive.exec.dynamic.partition
  true
  Whether or not to allow dynamic partitions in DML/DDL.



  hive.exec.dynamic.partition.mode
  nonstrict
  In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.


  
  hive.metastore.uris  
  thrift://data1:9083  
  Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.  



hive.server2.enable.impersonation
Enable user impersonation for HiveServer2
false



  hive.server2.enable.doAs
  false



  hive.input.format
  org.apache.hadoop.hive.ql.io.CombineHiveInputFormat



  hive.merge.mapfiles
  true



  hive.merge.mapredfiles
  true



  hive.merge.size.per.task
  256000000



  hive.merge.smallfiles.avgsize
  256000000



    hive.server2.logging.operation.enabled
    true








    hbase.rootdir
    hdfs://cdhbds/hbase
  
  
  
    dfs.nameservices
    cdhbds
  
  
  
    hbase.cluster.distributed
    true
  
  
    hbase.tmp.dir
    /data/hbase/tmp
  
  
    hbase.master.port
    16000
  
  
  
    hbase.zookeeper.quorum
    data1,data2,data3
  
  
  
    hbase.zookeeper.property.clientPort
    2181

（4）拷贝hdfs的配置文件

将hadoop配置目录下的core-site.xml和hdfs-site.xml文件拷贝至hbase的conf目录下

（5）配置regionservers文件

# 配置HRegionServer节点的主机名，编辑regionservers文件，添加以下内容
data1
data2
data3

（6）配置HMaster高可用

# 在hbase的conf目录下创建一个backup-masters文件，并添加HMaster备用节点的主机名
data2

（7）分发hbase包

# 将data1上配置好的hbase包分发到另外两台机器（data2和data3）
scp -rp /usr/local/hbase hadoop@data2:/usr/local/hbase
scp -rp /usr/local/hbase hadoop@data3:/usr/local/hbase

（8）启动hbase集群

# 在data1机器上执行启动hbase集群命令
start-hbase.sh

# 启动完成后，分别在3台机器上执行jps命令，可以查看到每台机器上启动的hbase进程
# data1机器上,可看到HMaster和HRegionServer进程
# data2机器上，可看到HMaster和HRegionServer进程
# data3机器上，可看到HRegionServer进程
jps

（9）验证hbase

# 在data1机器上输入hbase shell命令，将进入到hbase的命令行客户端
hbase shell

此外，也可访问hbase的UI管理页面：http://data1:60010

11.部署kafka

（1）安装kafka包

# 将kafka安装包上传至data1机器，并解压至指定位置
tar -zxvf kafka_2.12-0.11.0.3.tgz /usr/local/kafka

（2）修改server.properties文件

# 编辑server.properties文件，修改以下内容
# broker的唯一标识
broker.id=0
# 修改为当前机器的hostname
listeners=PLAINTEXT://data1:9092
# 修改kafka的日志路径（也是kafka消息数据存储的路径）
log.dirs=/data/kafka-logs
# 修改zookeeper连接地址
zookeeper.connect=data1:2181,data2:2181,data3:2181

（3）分发kafka包

# 将data1上的kafka包分发到另外两台机器（data2和data3）
scp -rp /usr/loca/kafka hadoop@data2:/usr/local/kafka
scp -rp /usr/loca/kafka hadoop@data3:/usr/local/kafka

# 更改data2上的server.properties文件里的以下两个配置
# broker的唯一标识
broker.id=1
# 修改为当前机器的hostname
listeners=PLAINTEXT://data2:9092

# 同理，更改data3上对应的配置
# broker的唯一标识
broker.id=2
# 修改为当前机器的hostname
listeners=PLAINTEXT://data3:9092

（4）启动kafka集群

# 在3台机器上分别启动kafka
# 进入到kafka的bin目录，执行启动命令
cd /usr/local/kafka/bin
./kafka-server-start.sh -daemon ../config/server.properties
# 命令中的-daemon参数表示后台启动kafka

（5）验证kafka

# 分别在3台机器上执行jps命令，可查看到kafka进程是否启动成功
jps

# 创建topic命令
bin/kafka-topics.sh --create --zookeeper data1:2181,data2:2181,data3:2181 --replication-factor 3 --partitions 1 --topic test

# 模拟生产者命令
bin/kafka-console-producer.sh --broker-list data1:9092,data2:9092,data3:9092 --topic test

# 模拟消费者命令
bin/kafka-console-consumer.sh --bootstrap-server data1:9092,data2:9092,data3:9092 --from-beginning --topic test

12.部署flink on yarn

本次flink部署采用的是on yarn模式（HA）。

（1）安装flink包

# 将flink安装包上传至data1机器，并解压至指定位置
tar -zxvf flink-1.10.1-bin-scala_2.12.tgz /usr/local/flink

（2）修改flink-conf.yaml文件

# 进入到flink的conf目录，编辑flink-conf.yaml文件
vi flink-conf.yaml

# 修改以下内容的配置
taskmanager.numberOfTaskSlots: 4

high-availability: zookeeper
high-availability.storageDir: hdfs://cdhbds/flink/ha/
high-availability.zookeeper.quorum: data1:2181,data2:2181,data3:2181
high-availability.zookeeper.path.root: /flink

state.backend: filesystem
state.checkpoints.dir: hdfs://cdhbds/flink/flink-checkpoints
state.savepoints.dir: hdfs://cdhbds/flink/flink-checkpoints

jobmanager.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/
historyserver.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/

yarn.application-attempts: 10

（3）修改日志配置文件

因为flink的conf目录下有log4j和logback的配置文件，在启动flink集群的时候会有一个警告：
org.apache.flink.yarn.AbstractYarnClusterDescriptor           - The configuration directory ('/root/flink-1.7.1/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.

故需要去掉一个日志配置文件，我们可以将log4j.properties给重命名为log4j.properties.bak即可.

（4）配置hadoop classpath

# 该版本的flink默认是没有和hadoop集成的，官方文档指出需要我们自己去完成hadoop集成的配置，
# 官网上给出了两种集成方案，一种是添加hadoop classpath的配置，
# 另一种是将flink-shaded-hadoop-2-uber-xx.jar拷贝至flink的lib目录下，
# 这里我们采用第一种方式，通过配置hadoop classpath
# 编辑/etc/profile文件，添加以下内容
export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)

# 配置完环境变量后要source /etc/profile
source /etc/profile

（5）以yarn-session方式启动flink集群

# 这里我们以yarn-session方式来启动集群，进入到bin目录，启动命令如下：
./yarn-session.sh -s 4 -jm 1024m -tm 4096m -nm flink-test -d

（6）启动historyServer服务

# 启动historyServer服务命令
./historyserver.sh start