Linux 搭建Hive数仓工具

第一篇文章，介绍了大数据下Hive工具的主要功能，可以使用sql语句来完成大数据计算任务，极大的方便了业务人员。

上一篇文章介绍了如何在Linux下搭建单机版Hadoop环境，文本记录了在此基础上，搭建数仓工具Hive的具体过程。

环境

系统环境	版本
操作系统	Ubuntu 18.04
Java	openjdk version "1.8.0_212"
Hadoop	3.2.0
Mysql	5.7
Hive	3.1.1

配置Mysql

配置Mysql用于存储Hive的MateData

安装mysql
https://www.jianshu.com/p/f81ec3a7c903

sudo apt install -y mysql-server

新建hive用户

mysql> create user 'hive'@'localhost' identified by '123456';
mysql> grant all on *.* to 'hive'@'localhost' identified by '123456';
mysql> flush privileges;

验证hive用户

mysql> select host, user, authentication_string from mysql.user;
+-----------+------------------+-------------------------------------------+
| host      | user             | authentication_string                     |
+-----------+------------------+-------------------------------------------+
| localhost | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | mysql.session    | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | mysql.sys        | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | debian-sys-maint | *01B9179E564621D21BF72BE9D4D31F64EAD60870 |
| %         | root             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | hive             | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+-----------+------------------+-------------------------------------------+
6 rows in set (0.00 sec)

创建hive数据库

msyql -u hive -p

mysql> create database hive;

配置Hive

下载Hive

wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz

配置环境变量

export HIVE_HOME=/home/hduser/hadoop-family/hive-3.1.1
export PATH=${PATH}:${HADOOP_HOME}/bin/:${HADOOP_HOME}/sbin/:${HIVE_HOME}/bin:${DERBY_HOME}/bin

配置hive-site.xml (4项)
在hive-3.1.1/conf下默认没有hive-site.xml
需要cp hive-default.xml.template hive-site.xml后，再修改以下内容，在这里&需要被转义写成&
剔除3210行的非法字符&8#;



    javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver
    Driver class name for a JDBC metastore


    javax.jdo.option.ConnectionURL
    jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8&useSSL=false
    JDBC connect string for a JDBC metastore.


    javax.jdo.option.ConnectionUserName
    hive
    Username to use against metastore database


    javax.jdo.option.ConnectionPassword
    123456
    password to use against metastore database




    hive.server2.thrift.bind.host
    localhost
    Bind host on which to run the HiveServer2 Thrift service.

新建目录mkdir -p /home/hduser/hadoop-family/hive-3.1.1/hive-iotmp/
再替换掉hive-site.xml中，所有的${system:java.io.tmpdir}与${system:user.name}

sed -i 's/${system:java.io.tmpdir}/\/home\/hduser\/hadoop-family\/hive-3.1.1\/hive_data\/tmp/g' hive-site.xml 
sed -i 's/${system:user.name}/hduser/g' hive-site.xml

下载mysql-connector-java.jar，放入hive-3.1.1/lib/下

wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz

配置Hadoop

修改hadoop-3.2.0/etc/hadoop/core-site.xml，添加以下内容，主要原因是hadoop引入了一个安全伪装机制，不允许上层系统直接将实际用户传递到hadoop层，而是将实际用户传递给一个代理用户，由代理用户在hadoop上执行操作，避免任意客户随意操作hadoop


    hadoop.proxyuser.hduser.hosts
    *


    hadoop.proxyuser.hduser.groups
    *

在使用hive创建table之前，需要在haddop中新建hive的hive.metastore.warehouse.dir，并添加g+w权限

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

启动hive

在hive2.1版本后，需要执行schematool命名来执行初始化

schematool -dbType mysql -initSchema

创建数据库

hive> create database hive_db;

创建表

hive> use hive_db;
hive> create table hive_table (
    > id int,
    > name varchar(16),
    > age int,
    > message varchar(128)
    > );

插入数据
hive会将sql指令转为MapReduce任务去执行

hive> insert into hive_table values (1, 'xiaoxin', 16, 'I am wangxin'),(2, 'kunkun', 18, 'I am shikun');
Query ID = hduser_20190625042455_3600e443-663d-45f6-b842-37a3f2a8843b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1561428402986_0002, Tracking URL = http://ubuntu:8088/proxy/application_1561428402986_0002/
Kill Command = /home/hduser/hadoop-family/hadoop-3.2.0/bin/mapred job  -kill job_1561428402986_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-06-25 04:25:04,263 Stage-1 map = 0%,  reduce = 0%
2019-06-25 04:25:10,502 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.77 sec
2019-06-25 04:25:15,666 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.49 sec
MapReduce Total cumulative CPU time: 5 seconds 490 msec
Ended Job = job_1561428402986_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hive_db.db/hive_table/.hive-staging_hive_2019-06-25_04-24-55_404_2009487560348382612-1/-ext-10000
Loading data to table hive_db.hive_table
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.49 sec   HDFS Read: 18587 HDFS Write: 380 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 490 msec
OK
Time taken: 21.773 seconds

查询数据

hive> select * from hive_table;
OK
1   xiaoxin 16  I am wangxin
2   kunkun  18  I am shikun
Time taken: 0.174 seconds, Fetched: 2 row(s)

查看hadoop文件
在hdfs中可以看到新建的db文件

$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2019-06-25 04:22 /user/hive/warehouse/hive_db.db

启动Hiveserver2

hiveserver在hive2.0版本后，由于其不支持多客户端连接等功能被弃用，改用hiveserver2代替，hiveserver2有着自己的CLI工具beeline

启动hiveserver2

$ hiveserver2 
2019-06-25 19:20:16: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hive-3.1.1/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hadoop-3.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 2b252413-1672-4aee-943e-b1fac451ffeb
Hive Session ID = 4937ae43-4382-4333-ac1b-458851a55d45
Hive Session ID = ea183319-7ba4-4f3b-9d3b-d6ac6477aeca
Hive Session ID = 00343592-4acb-4d21-a2ec-af4ef765b2af

连接hiveserver2

# 本地连接
beeline -u jdbc:hive2://

# 远程连接
beeline -u jdbc:hive2://localhost:10000 -n hduser -p 123456

# 测试连接
beeline 
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hduser
Enter password for jdbc:hive2://localhost:10000: 123456

查询数据
连接成功后，操作同hivesever1

0: jdbc:hive2://localhost:10000> select * from hive_db.hive_table;
INFO  : Compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:hive_table.id, type:int, comment:null), FieldSchema(name:hive_table.name, type:varchar(16), comment:null), FieldSchema(name:hive_table.age, type:int, comment:null), FieldSchema(name:hive_table.message, type:varchar(128), comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.171 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO  : Completed executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.0 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+----------------+------------------+-----------------+---------------------+
| hive_table.id  | hive_table.name  | hive_table.age  | hive_table.message  |
+----------------+------------------+-----------------+---------------------+
| 1              | xiaoxin          | 16              | I am wangxin        |
| 2              | kunkun           | 18              | I am shikun         |
+----------------+------------------+-----------------+---------------------+
2 rows selected (0.214 seconds)