https://mp.weixin.qq.com/s/qtpPM5v27WGBcYyYeBTczQ
关于hive metastore
转载:Hive为什么要启用Metastore?
1.Metadata概念:
元数据包含用Hive创建的database、table等的元信息。元数据存储在关系型数据库中。如Derby、MySQL等。
2.Metastore作用:
客户端连接metastore服务,metastore再去连接MySQL数据库来存取元数据。有了metastore服务,就可以有多个客户端同时连接,而且这些客户端不需要知道MySQL数据库的用户名和密码,只需要连接metastore 服务即可。
3.Metastore 有3中开启方式:
默认开启方式:
没有配置metaStore的时候,每当开启bin/hive;或者开启hiveServer2的时候,都会在内部启动一个metastore
嵌入式服务;资源比较浪费,如果开启多个窗口,就会存在多个metastore server。()
local mataStore(本地)
当metaStore和装载元数据的数据库(MySQL)存在同一机器上时配置是此模式,
开启metastore服务就只需要开启一次就好,避免资源浪费!
Remote Metastore(远程)
当metaStore和装载元数据的数据库(MySQL)不存在同一机器上时配置是此模式,
开启metastore服务就只需要开启一次就好,避免资源浪费!
由于元数据不断地修改、更新,所以Hive元数据不适合存储在HDFS中,一般存在RDBMS中。
hive服务和metastore服务运行在同一个进程中,derby服务也运行在该进程中.内嵌模式使用的是内嵌的Derby数据库来存储元数据,也不需要额外起Metastore服务。
这个是默认的,配置简单,但是一次只能一个客户端连接(这句话说实在有点坑,其实就是你启动一个hive服务会内嵌一个metastore服务,然后在启动一个又会内嵌一个metastore服务,并不是说你的客户端只能启动一个hive,是能启动多个,但是每个都有metastore,浪费资源),适用于用来实验,不适用于生产环境。
不再使用内嵌的Derby作为元数据的存储介质,而是使用其他数据库比如MySQL来存储元数据。hive服务和metastore服务运行在同一个进程中,mysql是单独的进程,可以同一台机器,也可以在远程机器上。(我之前有种方式是:只在接口机配置hive,并配置mysql数据库,用户和密码等;但是集群不配置hive,不起hive任何服务,就属于这种情况)
这种方式是一个多用户的模式,运行多个用户client连接到一个数据库中。这种方式一般作为公司内部同时使用Hive。每一个用户必须要有对MySQL的访问权利,即每一个客户端使用者需要知道MySQL的用户名和密码才行。
本次安装就是本地模式,无需启动metastore 服务。
本次以本地模式安装。
安装机器:hadoop2 ,已换成cluster-host2
一台机器
GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' IDENTIFIED BY 'Hobe199**' WITH GRANT OPTION;
flush privileges;
mysql -uhive -pHobe19**
# 创建hive库
create database hive;
获取hive请参考官网uri https://archive.apache.org/dist/hive/hive-1.2.1/
参考: https://www.cnblogs.com/dxxblog/p/8193967.html
解压
[hadoop@hadoop2 bigdata]$ pwd
/home/hadoop/bigdata
[hadoop@hadoop2 bigdata]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz
cp mysql-connector-java-5.1.47-bin.jar /home/hadoop/hive-current/lib/
[hadoop@hadoop2 ~]$ cd /home/hadoop/hive-current/conf/
cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
其实只用写这一个文件。。。
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURLname>
<value>jdbc:mysql://localhost:3306/hivevalue>
property>
<property>
<name>javax.jdo.option.ConnectionDriverNamename>
<value>com.mysql.jdbc.Drivervalue>
property>
<property>
<name>javax.jdo.option.ConnectionUserNamename>
<value>hivevalue>
property>
<property>
<name>javax.jdo.option.ConnectionPasswordname>
<value>Hobe199**value>
property>
<property>
<name>hive.metastore.schema.verificationname>
<value>falsevalue>
property>
configuration>
启动hive shell的时候会初始化(可以省略schematool -dbType mysql -initSchema
操作)
[hadoop@hadoop2 bin]$ hive
Logging initialized using configuration in file:/home/hadoop/bigdata/apache-hive-1.2.1-bin/conf/hive-log4j.properties
Mon Apr 01 13:51:48 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
如有警告 SSL ,参考 https://blog.csdn.net/u012922838/article/details/73291524 解决。
[hadoop@hadoop2 conf]$ hive
Logging initialized using configuration in file:/home/hadoop/bigdata/apache-hive-1.2.1-bin/conf/hive-log4j.properties
hive>show databases;
OK
default
Time taken: 0.269 seconds, Fetched: 1 row(s)
hive>
会发现mysql hive
用户hive
库中已有了很多初始化的表。
创建一个库:
hive_test1
并use
创建分区表:
-- 分区表在SELECT的时候必须指定WHERE
-- 创建表
create table IF NOT EXISTS employee_partition(
id string,
name string,
age int,
tel string
)
PARTITIONED BY (
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 清除数据
TRUNCATE TABLE employee_partition;
INSERT OVERWRITE TABLE employee_partition PARTITION (city_id="beijing")
VALUES
("1","wanghongbing",18,"130"),
("2","wangxiaojing",17,"150"),
("3","songweiguang",16,"135");
SELECT * from employee_partition where city_id="beijing";
--聚合操作
SELECT count(*) from employee_partition where city_id="beijing" AND age>=17;
show partitions employee_partition;
插入数据:
hive> INSERT OVERWRITE TABLE employee_partition PARTITION (city_id="beijing")
> VALUES
> ("1","wanghongbing",18,"130"),
> ("2","wangxiaojing",17,"150"),
> ("3","songweiguang",16,"135");
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190821165406_ad3eb5da-7622-4974-ada0-405eef181159
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1565062438979_0005, Tracking URL = http://cluster-host1:8088/proxy/application_1565062438979_0005/
Kill Command = /home/hadoop/hadoop-current/bin/hadoop job -kill job_1565062438979_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-08-21 16:54:17,409 Stage-1 map = 0%, reduce = 0%
2019-08-21 16:54:22,678 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.5 sec
MapReduce Total cumulative CPU time: 3 seconds 500 msec
Ended Job = job_1565062438979_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition/city_id=beijing/.hive-staging_hive_2019-08-21_16-54-06_555_1374265165073341927-1/-ext-10000
Loading data to table hive_test1.employee_partition partition (city_id=beijing)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.5 sec HDFS Read: 4774 HDFS Write: 167 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 500 msec
OK
Time taken: 17.935 seconds
查询结果如下:
# select * 不走mr
hive> SELECT * from employee_partition where city_id="beijing";
OK
1 wanghongbing 18 130 beijing
2 wangxx 17 150 beijing
3 songxx 16 135 beijing
Time taken: 0.511 seconds, Fetched: 3 row(s)
# 有mr任务
hive> SELECT count(*) from employee_partition where city_id="beijing" AND age>=17;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190821165443_468f9aa2-5a02-467d-8492-85d0858a2073
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1565062438979_0006, Tracking URL = http://cluster-host1:8088/proxy/application_1565062438979_0006/
Kill Command = /home/hadoop/hadoop-current/bin/hadoop job -kill job_1565062438979_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-08-21 16:54:51,033 Stage-1 map = 0%, reduce = 0%
2019-08-21 16:55:04,578 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.34 sec
2019-08-21 16:55:18,000 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.96 sec
MapReduce Total cumulative CPU time: 6 seconds 960 msec
Ended Job = job_1565062438979_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.96 sec HDFS Read: 9317 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 960 msec
OK
2
Time taken: 35.767 seconds, Fetched: 1 row(s)
准备hive表的数据文件
# vim first_table.txt
1 李晨 50
2 成龙 60
3 王源 40
4 胡歌 50
创一张表
hive> create table first_table(id int, name string, age int) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.4 seconds
加载数据至表中
hive> load data local inpath '/home/hadoop/test/first_table.txt' into table first_table;
Loading data to table default.first_table
Table default.first_table stats: [numFiles=1, totalSize=48]
OK
Time taken: 0.914 seconds
hive> select id,name,age from first_table;
OK
1 李晨 50
2 成龙 60
3 王源 40
4 胡歌 50
Time taken: 0.332 seconds, Fetched: 4 row(s)
至此,hive环境搭建成功。
https://my.oschina.net/u/2500254/blog/1439297
https://mp.weixin.qq.com/s/qtpPM5v27WGBcYyYeBTczQ
我们在元数据信息mysql中查看:
use hive;
mysql> select * from DBS;
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
| DB_ID | DESC | DB_LOCATION_URI | NAME | OWNER_NAME | OWNER_TYPE |
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
| 1 | Default Hive database | hdfs://cluster-host1:9000/user/hive/warehouse | default | public | ROLE |
| 2 | NULL | hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db | hive_test1 | hadoop | USER |
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
2 rows in set (0.00 sec)
有两个库,一个默认库,一个新建的hive_test1
库。
我们看一下表信息:
mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
| 1 | 1566377621 | 2 | 0 | hadoop | 0 | 1 | employee_partition | MANAGED_TABLE | NULL | NULL | |
| 2 | 1566380354 | 2 | 0 | hadoop | 0 | 3 | first_table | MANAGED_TABLE | NULL | NULL | |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
2 rows in set (0.00 sec)
我们到hdfs上找到该路径:
[hadoop@cluster-host2 ~]$ hadoop fs -ls hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2019-08-21 16:54 hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition
drwxr-xr-x - hadoop supergroup 0 2019-08-21 17:42 hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/first_table
该路径下放了表和分区的信息:
[hadoop@cluster-host2 ~]$ hadoop fs -text hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition/city_id=beijing/000000_0
1,wanghongbing,18,130
2,wangxx,17,150
3,songxx,16,135
可以直观地看到,hive的具体数据存放在hdfs上的某个路径下。元数据关系是存在mysql中的。对于集群来说,保证hdfs上的数据不丢失就行。