第一篇文章,介绍了大数据下Hive
工具的主要功能,可以使用sql
语句来完成大数据计算任务,极大的方便了业务人员。
上一篇文章介绍了如何在Linux下搭建单机版Hadoop
环境,文本记录了在此基础上,搭建数仓工具Hive
的具体过程。
环境
系统环境 | 版本 |
---|---|
操作系统 | Ubuntu 18.04 |
Java | openjdk version "1.8.0_212" |
Hadoop | 3.2.0 |
Mysql | 5.7 |
Hive | 3.1.1 |
配置Mysql
配置Mysql
用于存储Hive
的MateData
- 安装
mysql
https://www.jianshu.com/p/f81ec3a7c903
sudo apt install -y mysql-server
- 新建
hive
用户
mysql> create user 'hive'@'localhost' identified by '123456';
mysql> grant all on *.* to 'hive'@'localhost' identified by '123456';
mysql> flush privileges;
- 验证
hive
用户
mysql> select host, user, authentication_string from mysql.user;
+-----------+------------------+-------------------------------------------+
| host | user | authentication_string |
+-----------+------------------+-------------------------------------------+
| localhost | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | mysql.session | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | mysql.sys | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE |
| localhost | debian-sys-maint | *01B9179E564621D21BF72BE9D4D31F64EAD60870 |
| % | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
| localhost | hive | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 |
+-----------+------------------+-------------------------------------------+
6 rows in set (0.00 sec)
- 创建
hive
数据库
msyql -u hive -p
mysql> create database hive;
配置Hive
- 下载Hive
wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
- 配置环境变量
export HIVE_HOME=/home/hduser/hadoop-family/hive-3.1.1
export PATH=${PATH}:${HADOOP_HOME}/bin/:${HADOOP_HOME}/sbin/:${HIVE_HOME}/bin:${DERBY_HOME}/bin
- 配置
hive-site.xml
(4项)
在hive-3.1.1/conf
下默认没有hive-site.xml
需要cp hive-default.xml.template hive-site.xml
后,再修改以下内容,在这里&
需要被转义写成&
剔除3210
行的非法字符&8#;
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8&useSSL=false
JDBC connect string for a JDBC metastore.
javax.jdo.option.ConnectionUserName
hive
Username to use against metastore database
javax.jdo.option.ConnectionPassword
123456
password to use against metastore database
hive.server2.thrift.bind.host
localhost
Bind host on which to run the HiveServer2 Thrift service.
新建目录mkdir -p /home/hduser/hadoop-family/hive-3.1.1/hive-iotmp/
再替换掉hive-site.xml
中,所有的${system:java.io.tmpdir}
与${system:user.name}
sed -i 's/${system:java.io.tmpdir}/\/home\/hduser\/hadoop-family\/hive-3.1.1\/hive_data\/tmp/g' hive-site.xml
sed -i 's/${system:user.name}/hduser/g' hive-site.xml
- 下载mysql-connector-java.jar,放入
hive-3.1.1/lib/
下
wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz
配置Hadoop
修改hadoop-3.2.0/etc/hadoop/core-site.xml
,添加以下内容,主要原因是hadoop
引入了一个安全伪装机制,不允许上层系统直接将实际用户传递到hadoop
层,而是将实际用户传递给一个代理用户,由代理用户在hadoop
上执行操作,避免任意客户随意操作hadoop
hadoop.proxyuser.hduser.hosts
*
hadoop.proxyuser.hduser.groups
*
在使用hive
创建table
之前,需要在haddop
中新建hive
的hive.metastore.warehouse.dir
,并添加g+w
权限
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
启动hive
在hive2.1
版本后,需要执行schematool
命名来执行初始化
schematool -dbType mysql -initSchema
- 创建数据库
hive> create database hive_db;
- 创建表
hive> use hive_db;
hive> create table hive_table (
> id int,
> name varchar(16),
> age int,
> message varchar(128)
> );
- 插入数据
hive
会将sql
指令转为MapReduce
任务去执行
hive> insert into hive_table values (1, 'xiaoxin', 16, 'I am wangxin'),(2, 'kunkun', 18, 'I am shikun');
Query ID = hduser_20190625042455_3600e443-663d-45f6-b842-37a3f2a8843b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1561428402986_0002, Tracking URL = http://ubuntu:8088/proxy/application_1561428402986_0002/
Kill Command = /home/hduser/hadoop-family/hadoop-3.2.0/bin/mapred job -kill job_1561428402986_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-06-25 04:25:04,263 Stage-1 map = 0%, reduce = 0%
2019-06-25 04:25:10,502 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.77 sec
2019-06-25 04:25:15,666 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.49 sec
MapReduce Total cumulative CPU time: 5 seconds 490 msec
Ended Job = job_1561428402986_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hive_db.db/hive_table/.hive-staging_hive_2019-06-25_04-24-55_404_2009487560348382612-1/-ext-10000
Loading data to table hive_db.hive_table
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.49 sec HDFS Read: 18587 HDFS Write: 380 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 490 msec
OK
Time taken: 21.773 seconds
- 查询数据
hive> select * from hive_table;
OK
1 xiaoxin 16 I am wangxin
2 kunkun 18 I am shikun
Time taken: 0.174 seconds, Fetched: 2 row(s)
- 查看
hadoop
文件
在hdfs
中可以看到新建的db
文件
$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hduser supergroup 0 2019-06-25 04:22 /user/hive/warehouse/hive_db.db
启动Hiveserver2
hiveserver在hive2.0版本后,由于其不支持多客户端连接等功能被弃用,改用hiveserver2代替,hiveserver2有着自己的CLI工具beeline
- 启动hiveserver2
$ hiveserver2
2019-06-25 19:20:16: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hive-3.1.1/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/hadoop-family/hadoop-3.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 2b252413-1672-4aee-943e-b1fac451ffeb
Hive Session ID = 4937ae43-4382-4333-ac1b-458851a55d45
Hive Session ID = ea183319-7ba4-4f3b-9d3b-d6ac6477aeca
Hive Session ID = 00343592-4acb-4d21-a2ec-af4ef765b2af
- 连接hiveserver2
# 本地连接
beeline -u jdbc:hive2://
# 远程连接
beeline -u jdbc:hive2://localhost:10000 -n hduser -p 123456
# 测试连接
beeline
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hduser
Enter password for jdbc:hive2://localhost:10000: 123456
- 查询数据
连接成功后,操作同hivesever1
0: jdbc:hive2://localhost:10000> select * from hive_db.hive_table;
INFO : Compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:hive_table.id, type:int, comment:null), FieldSchema(name:hive_table.name, type:varchar(16), comment:null), FieldSchema(name:hive_table.age, type:int, comment:null), FieldSchema(name:hive_table.message, type:varchar(128), comment:null)], properties:null)
INFO : Completed compiling command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.171 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497): select * from hive_db.hive_table
INFO : Completed executing command(queryId=hduser_20190625200727_652bdba6-9b35-4253-9955-9d20e9dca497); Time taken: 0.0 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+----------------+------------------+-----------------+---------------------+
| hive_table.id | hive_table.name | hive_table.age | hive_table.message |
+----------------+------------------+-----------------+---------------------+
| 1 | xiaoxin | 16 | I am wangxin |
| 2 | kunkun | 18 | I am shikun |
+----------------+------------------+-----------------+---------------------+
2 rows selected (0.214 seconds)