非晚の

Apache-Hive3.1.3安装

文章目录

1 Hive官方地址及安装包下载地址
2 安装Hadoop集群
3 Hive服务安装
- 3.1 Hive部署模式介绍
- - 3.1.1 metadata 、metastore
  - 3.1.2 metastore配置方式
  - 3.1.3 客户端
- 3.2 内嵌模式
- 3.3 本地模式
- 3.4 远程模式
- 3.5 不同客户端方式连接hive
- 3.6 编写自动化启动hive脚本
- 3.7 Hive 其他命令操作
4 Hive 常见属性配置
- 4.1 Hive 运行日志信息配置
- 4.2 打印当前库和表头
- 4.3 参数配置方式
5 异常问题

1 Hive官方地址及安装包下载地址

(1) Hive 官网地址
http://hive.apache.org/
(2) 文档查看地址
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
(3) 下载地址
http://archive.apache.org/dist/hive/

2 安装Hadoop集群

Hive服务依赖于Hadoop运行环境，HDFS提供数据存储、MapReduce提供数据运算。
安装Hive前，先部署好Hadoop运行环境。
Hadoop集群安装文档参考：Apache-Hadoop3.3.3集群安装

3 Hive服务安装

3.1 Hive部署模式介绍

Hive安装前需要安装JDK和Hadoop。使用mysql来存储元数据，则需要安装mysql。

3.1.1 metadata 、metastore

Metadata即元数据。元数据包含用Hive创建的database、table、表的字段等元信息。元数据存储在关系型数据库中。如hive内置的Derby、第三方如MySQL等。
Metastore即元数据服务，作用是：客户端连接metastore服务，metastore再去连接MySQL数据库来存取元数据。有了metastore服务，就可以有多个客户端同时连接，而且这些客户端不需要知道MySQL数据库的用户名和密码，只需要连接metastore 服务即可。

3.1.2 metastore配置方式

（1）内嵌模式
内嵌模式使用的是内嵌的Derby数据库来存储元数据，也不需要额外起Metastore服务。数据库和Metastore服务都嵌入在主Hive Server进程中。默认配置，一次只能一个客户端连接，适用于用来实验，不适用于生产环境。
优点： 解压hive安装包，配置环境变量，bin/hive 启动即可使用
缺点： 只能单窗口启动，同节点无法多个窗口共用。不同节点启动hive，每一个hive拥有一套自己的元数据，无法共享。

（2）本地模式
本地模式采用外部数据库来存储元数据，目前支持的数据库有：MySQL、Postgres、Oracle、MS SQL Server。
本地模式不需要单独起metastore服务，用的是hive在同一个进程里的metastore服务。也就是说当你启动一个hive 服务，默认会启动一个metastore服务。
根据 hive.metastore.uris 参数值来判断，如果为空，则为本地模式。
优点： 可以多节点，多窗口使用
缺点： 每启动一次hive服务，都内置启动了一个metastore。

（3）远程模式
远程模式下，需要单独起metastore服务，然后每个客户端都在配置文件里，配置连接到该metastore服务。远程模式的metastore服务和hive运行在不同的进程里。
在生产环境中，建议用远程模式来配置Hive Metastore，在这种情况下，其他依赖hive的软件都可以通过Metastore访问hive。

远程模式下，需要配置 hive.metastore.uris 参数来指定metastore服务运行的机器ip和端口，并且需要单独手动启动metastore服务。

3.1.3 客户端

(1) Hive Client
在hive安装包的bin目录下，有hive提供的第一代客户端 bin/hive。使用该客户端可以访问hive的metastore服务。从而达到操作hive的目的。
官方提示：第一代客户端已经不推荐使用了。
(2)Hive Beeline Client
hive经过发展，推出了第二代客户端beeline，但是beeline客户端不是直接访问metastore服务的，而是需要单独启动hiveserver2服务。

3.2 内嵌模式

(1) 解压Hive安装包到安装目录

[root@node01 package]# ll
total 1088964
-rw-r--r--. 1 root root 326940667 Jun  1 09:53 apache-hive-3.1.3-bin.tar.gz
[root@node01 package]#
[root@node01 package]# tar -zxvf apache-hive-3.1.3-bin.tar.gz -C /opt/soft/

[root@node01 soft]# mv apache-hive-3.1.3-bin/  hive-3.1.3
[root@node01 soft]# ll
total 0
drwxr-xr-x  10 root root 184 Jun  2 10:04 hive-3.1.3

(2) 添加Hive环境变量

vim /etc/profile

export HIVE_HOME=/opt/soft/hive-3.1.3
export PATH=$PATH:$HIVE_HOME/bin

source /etc/profile

(3) 初始化元数据库

[root@node01 ~]# schematool -dbType derby -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/soft/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/soft/hadoop-3.3.3/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:        jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :    org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:       APP
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.derby.sql

(4) 启动并使用 Hive
Ø启动 Hive

hive

hive启动后会在创建/tmp/hive目录

[root@node01 ~]# hadoop fs -ls /tmp
Found 3 items
drwx------   - root supergroup          0 2022-06-01 21:36 /tmp/hadoop-yarn
drwx-wx-wx   - root supergroup          0 2022-06-02 11:46 /tmp/hive
drwxrwxrwt   - root root                0 2022-06-01 22:00 /tmp/logs

Ø使用 Hive

hive> show databases;
OK
default
Time taken: 0.464 seconds, Fetched: 1 row(s)

hive> show tables;
OK
Time taken: 0.04 seconds

hive> create table test(id int);
OK
Time taken: 0.473 seconds

hive>  insert into test values(1);
Query ID = root_20220602114906_0bc1432a-ce4a-479e-875d-050998d6dd0e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1654141576996_0001, Tracking URL = http://node02:8088/proxy/application_1654141576996_0001/
Kill Command = /opt/soft/hadoop-3.3.3/bin/mapred job  -kill job_1654141576996_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-06-02 11:49:20,924 Stage-1 map = 0%,  reduce = 0%
2022-06-02 11:49:30,190 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.84 sec
2022-06-02 11:49:37,443 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.91 sec
MapReduce Total cumulative CPU time: 7 seconds 910 msec
Ended Job = job_1654141576996_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://node01:8020/user/hive/warehouse/test/.hive-staging_hive_2022-06-02_11-49-06_745_8431357607505566080-1/-ext-10000
Loading data to table default.test
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.91 sec   HDFS Read: 12693 HDFS Write: 199 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 910 msec
OK
Time taken: 32.248 seconds

hive> select * from test;
OK
1
Time taken: 0.136 seconds, Fetched: 1 row(s)

Ø在远程工具窗口中开启另一个窗口开启 Hive，在/tmp/root 目录下监控 hive.log 文件

tail -f /tmp/root/hive.log

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/metastore_db.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        ... 12 more

原因在于 Hive 默认使用的元数据库为 derby，开启 Hive 之后就会占用元数据库，且不与其他客户端共享数据，所以我们需要将 Hive 的元数据地址改为 MySQL

3.3 本地模式

(1) 安装前置准备
Ø 安装前必须安装mysql等元数据库，安装步骤参考：linux系统下安装Mysql5.7
Ø 将 MySQL 的 JDBC 驱动上传到 Hive 的 lib 目录下

(2) 配置 Metastore 到 MySQL

cd $HIVE_HOME/conf
vim hive-site.xml




<configuration>
  
  <property>
    <name>javax.jdo.option.ConnectionURLname>
    <value>jdbc:mysql://node01:3306/metastore?useSSL=falsevalue>
  property>
  
  
  <property>
    <name>javax.jdo.option.ConnectionDriverNamename>
    <value>com.mysql.jdbc.Drivervalue>
  property>
  
  
  <property>
    <name>javax.jdo.option.ConnectionUserNamename>
    <value>rootvalue>
  property>
  
  <property>
    <name>javax.jdo.option.ConnectionPasswordname>
    <value>123456value>
  property>
  
  
  <property>
    <name>hive.metastore.schema.verificationname>
    <value>falsevalue>
  property>
  
  <property>
    <name>hive.metastore.event.db.notification.api.authname>
    <value>falsevalue>
  property>
  
  
  <property>
    <name>hive.metastore.warehouse.dirname>
    <value>/user/hive/warehousevalue>
  property>
configuration>

(3) 登陆 MySQL

mysql -uroot -p123456

(4) 新建 Hive 元数据库&设置远程登录权限

mysql>  create database metastore;
Query OK, 1 row affected (0.00 sec)


mysql> grant all on *.* to root@'%' identified by '123456';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

(5) 初始化 Hive 元数据库

schematool -initSchema -dbType mysql -verbose

查看元数据库

mysql> use metastore
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-------------------------------+
| Tables_in_metastore           |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
...
| WRITE_SET                     |
+-------------------------------+
74 rows in set (0.00 sec)

(6) 启动hive
可以在同节点不同窗口访问hive

3.4 远程模式

使用元数据服务的方式访问 Hive
(1) 在 hive-site.xml 文件中添加如下配置信息

vim hive-site.xml

  
  <property>
    <name>hive.metastore.urisname>
    <value>thrift://node01:9083value>
  property>

(2) 启动 metastore

hive --service metastore

[root@node01 ~]# hive --service metastore
2022-06-02 14:38:21: Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/soft/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/soft/hadoop-3.3.3/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

注意: 启动后窗口不能再操作，需打开一个新的 shell 窗口做别的操作。

(3) 启动hive

[root@node01 conf]# hive
Hive Session ID = d5e5cb6c-19d1-467f-ad86-71c9b7ce94e9
hive> show databases;
OK
default
Time taken: 0.489 seconds, Fetched: 1 row(s)
hive> show tables;
OK
test
Time taken: 0.06 seconds, Fetched: 1 row(s)
hive>

3.5 不同客户端方式连接hive

(1) Hive客户端，使用元数据服务的方式访问 Hive
如上节所提直接使用hive命令连接

hive

(2) beeline客户端，使用 JDBC 方式访问 Hive
Ø 在 hive-site.xml 文件中添加如下配置信息

  
  <property>
    <name>hive.server2.thrift.bind.hostname>
    <value>node01value>
  property>
  
  <property>
    <name>hive.server2.thrift.portname>
    <value>10000value>
  property>

Ø 启动metastore & hiveserver2

hive --service metastore
hive --service hiveserver2

Ø beeline客户端连接hive

beeline -u jdbc:hive2://node01:10000 -n root

[root@node01 hadoop]# beeline -u jdbc:hive2://node01:10000 -n root
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/soft/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/soft/hadoop-3.3.3/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://node01:10000
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.3 by Apache Hive
0: jdbc:hive2://node01:10000>

3.6 编写自动化启动hive脚本

(1) 前台启动的方式导致需要打开多个 shell 窗口,可以使用如下方式后台方式启动

nohup: 放在命令开头，表示不挂起,也就是关闭终端进程也继续保持运行状态
/dev/null：是 Linux 文件系统中的一个文件，被称为黑洞，所有写入改文件的内容都会被自动丢弃
2>&1 : 表示将错误重定向到标准输出上
&: 放在命令结尾,表示后台运行

一般会组合使用: nohup [xxx 命令操作]> file 2>&1 &，表示将 xxx 命令运行的结果输出到 file 中，并保持命令启动的进程在后台运行。

[root@node01 ~]# nohup hive --service metastore 2>&1 & 
[root@node01 ~]# nohup hive --service hiveserver2 2>&1 &

(2) 编写脚本来管理Hive相关服务的启动和关闭

vim ~/bin/myhive.sh

#!/bin/bash 
#创建Hive启动日志
HIVE_LOG_DIR=$HIVE_HOME/logs 
if [ ! -d $HIVE_LOG_DIR ] 
then 
mkdir -p $HIVE_LOG_DIR 
fi

#检查进程是否运行正常，参数 1 为进程名，参数 2 为进程端口 
function check_process() 
{ 
 pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}') 
 ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1) 
 echo $pid 
 [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1 
} 

function hive_start() 
{ 
 metapid=$(check_process HiveMetastore 9083) 
 cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &" 
 [ -z "$metapid" ] && eval $cmd || echo "Metastroe 服务已启动" 
 server2pid=$(check_process HiveServer2 10000) 
 cmd="nohup hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &" 
 [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2 服务已启动" 
} 

function hive_stop() 
{ 
metapid=$(check_process HiveMetastore 9083) 
 [ "$metapid" ] && kill $metapid || echo "Metastore 服务未启动" 
 server2pid=$(check_process HiveServer2 10000) 
 [ "$server2pid" ] && kill $server2pid || echo "HiveServer2 服务未启动" 
} 

case $1 in 
"start") 
 hive_start 
 ;; 
"stop") 
 hive_stop 
 ;; 
"restart") 
 hive_stop 
 sleep 2 
 hive_start 
 ;; 
"status")
 metapid=$(check_process HiveMetastore 9083)
 server2pid=$(check_process HiveServer2 10000)
 [ "$metapid" ] && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常"
 [ "$server2pid" ] && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常"
 ;;
*) 
 echo Invalid Args! 
 echo 'Usage: '$(basename $0)' start|stop|restart|status' 
 ;; 
esac

(3）添加执行权限

chmod +x ~/bin/myhive.sh

(4）启动 Hive 后台服务

myhive.sh start

3.7 Hive 其他命令操作

(1）退出 hive 窗口：

hive(default)>exit; 
hive(default)>quit;

(2）在 hive cli 命令窗口中如何查看 hdfs 文件系统

hive(default)>dfs -ls /;

(3）查看在 hive 中输入的所有历史命令
进入到当前用户的根目录 /root ，查看. hivehistory 文件

cat .hivehistory

4 Hive 常见属性配置

4.1 Hive 运行日志信息配置

(1) Hive 的 log 默认存放在/tmp/root/hive.log 目录下（当前用户名下）
(2) 修改 hive 的 log 存放日志到/opt/soft/hive-3.1.3/logs
Ø 修改/opt/soft/hive-3.1.3/conf/hive-log4j2.properties.template 文件名称为 hive-log4j2.properties

[root@node01 conf]# pwd
/opt/soft/hive-3.1.3/conf
[root@node01 conf]# mv hive-log4j2.properties.template hive-log4j2.properties

Ø 在 hive-log4j2.properties 文件中修改 log 存放位置

property.hive.log.dir=/opt/soft/hive-3.1.3/logs

4.2 打印当前库和表头

在 hive-site.xml 中加入如下两个配置:

  
  <property>
    <name>hive.cli.print.headername>
    <value>truevalue>
  property>
  <property>
    <name>hive.cli.print.current.dbname>
    <value>truevalue>
  property>

显示效果如下图：

4.3 参数配置方式

(1) 查看当前所有的配置信息

hive>set;

(2)参数的配置三种方式
Ø配置文件方式

默认配置文件：hive-default.xml

用户自定义配置文件：hive-site.xml

注意：用户自定义配置会覆盖默认配置。另外，Hive 也会读入 Hadoop 的配置，因为 Hive是作为 Hadoop 的客户端启动的，Hive 的配置会覆盖 Hadoop 的配置。配置文件的设定对本机启动的所有 Hive 进程都有效。

Ø命令行参数方式
启动 Hive 时，可以在命令行添加-hiveconf param=value 来设定参数。
例如：

bin/hive -hiveconf mapred.reduce.tasks=10;

注意：仅对本次 hive 启动有效。
查看参数设置：

hive (default)> set mapred.reduce.tasks;

Ø参数声明方式
可以在 HQL 中使用 SET 关键字设定参数
例如：

hive (default)> set mapred.reduce.tasks=100;

注意：仅对本次 hive 启动有效。
查看参数设置