基于Hadoop的数据仓库解决方案
Hive成为Apache顶级项目
Hive的优势和特点
Hive的发展里程碑和主流版本的社区
Hive发展历史及版本
Hive元数据管理
记录数据仓库中模型的定义、各层级间的映射关系
存储在关系数据库中
Hive操作-命令窗口模式
有两种客户端工具:Beeline和Hive命令行(CLI)
有两种模式:命令行模式和交互模式
命令行模式
操作 | HiveServer2 Beeline | HiveServer1 CLI |
---|---|---|
Server Connection | beeline –u |
hive –h |
Help | beeline -h or beeline --help | hive -H |
Run Query | beeline -e |
hive -e |
Define Variable | beeline --hivevar key=value | hive --hivevar key=value |
交互模式
操作 | HiveServer2 Beeline | HiveServer1 CLI |
---|---|---|
Enter Mode | beeline | hive |
Connect | !connect |
N/A |
List Tables | !table | |
List Columns | !column |
desc table_name; |
Save Result | !record |
N/A |
Run Shell CMD | !sh ls | !ls; |
Run DFS CMD | dfs -ls | dfs -ls ; |
Run SQL File | !run |
source |
Check Version | !dbinfo | !hive --version; |
Quit Mode | !quit | quit; |
Hive操作-客户端交互模式
1检查Hive服务是否已经正常启动
2.1使用Hive交互方式(输入hive即可)
2.2使用beeline
Hive数据类型 - 基本数据类型
类似于SQL数据类型
Hive数据类型 - 集合数据类型
ARRAY:存储的数据为相同类型
MAP:具有相同类型的键值对
STRUCT:封装了一组字段
类型 | 格式 | 类型 | 格式 |
---|---|---|---|
ARRAY | [‘Apple’,‘Orange’,‘Mongo’] | ARRAY |
a[0] = ‘Apple’ |
MAP | {‘A’:‘Apple’,‘O’:‘Orange’} | MAP |
b[‘A’] = ‘Apple’ |
STRUCT | {‘Apple’,2} | STRUCT |
c.weight = 2 |
Hive数据结构
数据结构 | 描述 | 逻辑关系 | 物理存储(hdfs) |
---|---|---|---|
Database | 数据库 | 表的集合 | 文件夹 |
Table | 表 | 行数据的集合 | 文件夹 |
Partition | 分区 | 用于分割数据 | 文件夹 |
Buckets | 分桶 | 用于分布数据 | 文件 |
Row | 行 | 行记录 | 文件中的行 |
Columns | 列 | 列记录 | 每行中指定的位置 |
Views | 视图 | 逻辑概念,可跨越多张表 | 不存储数据 |
Index | 索引 | 记录统计数据信息 | 文件夹 |
数据库(Database)
表的集合,HDFS中表现为一个文件夹
create database if not exists myhivebook;
use myhivebook;
show databases;
describe database default; -- 可以查看数据库更多的描述信息
alter database myhivebook set owner user dayongd;
drop database if exists myhivebook cascade;
如何在hive环境中知道当前所在数据库?
数据表(Table)
分为内部表和外部表
内部表(管理表)
外部表(External Tables)
-- 创建一个内部表
create table if not exists student(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/home/hadoop/hive/warehouse/student';
-- 查询表的类型
desc formatted student;
CREATE EXTERNAL TABLE IF NOT EXISTS employee_external ( -- IF NOT EXISTS可选,如果表存在,则忽略
name string, -- 列出所有列和数据类型
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
COMMENT 'This is an external table' -- COMMENT可选为注释
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|' -- 如何分隔列(字段)
COLLECTION ITEMS TERMINATED BY ',' -- 如何分隔集合和映射
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE -- 文件存储格式
LOCATION '/home/hadoop/hive/warehouse/employee'; -- 数据存储路径(HDFS)
Hive建表 - 分隔符
Hive中默认分隔符
在hive中建表时可以指定分割符
-- 指定列分隔符语法
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
CTAS – as select方式建表
CREATE TABLE ctas_people as SELECT * FROM people; -- 执行后如下效果
hive (test)> select * from people;
OK
1 tom 23 2019-03-16
2 jack 12 2019-03-13
3 robin 14 2018-08-13
4 justin 34 2018-10-12
5 jarry 24 2017-11-11
6 jasper 24 2017-12-12
Time taken: 0.038 seconds, Fetched: 6 row(s)
hive (test)> CREATE TABLE ctas_people as SELECT * FROM people;
Query ID = root_20200708193232_081c2128-9d18-42e2-9ee5-404f29e5cf4c
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1594181035887_0005, Tracking URL = http://hadoop01:8088/proxy/application_1594181035887_0005/
Kill Command = /opt/install/hadoop/bin/hadoop job -kill job_1594181035887_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-07-08 19:32:40,043 Stage-1 map = 0%, reduce = 0%
2020-07-08 19:32:47,383 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_1594181035887_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/hive/warehouse/test.db/.hive-staging_hive_2020-07-08_19-32-28_180_5629274260635048548-1/-ext-10001
Moving data to: hdfs://hadoop01:9000/hive/warehouse/test.db/ctas_people
Table test.ctas_people stats: [numFiles=1, numRows=6, totalSize=131, rawDataSize=125]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.06 sec HDFS Read: 3695 HDFS Write: 204 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
Time taken: 20.451 seconds
hive (test)> select * from ctas_people;
OK
1 tom 23 2019-03-16
2 jack 12 2019-03-13
3 robin 14 2018-08-13
4 justin 34 2018-10-12
5 jarry 24 2017-11-11
6 jasper 24 2017-12-12
Time taken: 0.029 seconds, Fetched: 6 row(s)
CTE (CTAS with Common Table Expression)
CREATE TABLE cte_people AS
WITH
r1 AS (SELECT name FROM r2 WHERE name = 'jarry'),
r2 AS (SELECT name FROM people WHERE age= '24'),
r3 AS (SELECT name FROM people WHERE name='tom' )
SELECT * FROM r1 UNION ALL SELECT * FROM r3;
-- 执行后如下效果:
hive (test)> CREATE TABLE cte_people AS
> WITH
> r1 AS (SELECT name FROM r2 WHERE name = 'jarry'),
> r2 AS (SELECT name FROM people WHERE age= '24'),
> r3 AS (SELECT name FROM people WHERE name='tom' )
> SELECT * FROM r1 UNION ALL SELECT * FROM r3;
Query ID = root_20200708193737_b5225b3b-590f-4605-9d42-7057d82f6a1e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1594181035887_0006, Tracking URL = http://hadoop01:8088/proxy/application_1594181035887_0006/
Kill Command = /opt/install/hadoop/bin/hadoop job -kill job_1594181035887_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-07-08 19:37:51,123 Stage-1 map = 0%, reduce = 0%
2020-07-08 19:37:59,518 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.61 sec
MapReduce Total cumulative CPU time: 1 seconds 610 msec
Ended Job = job_1594181035887_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/hive/warehouse/test.db/.hive-staging_hive_2020-07-08_19-37-39_588_4606334093878180145-1/-ext-10001
Moving data to: hdfs://hadoop01:9000/hive/warehouse/test.db/cte_people
Table test.cte_people stats: [numFiles=1, numRows=2, totalSize=10, rawDataSize=8]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.61 sec HDFS Read: 5015 HDFS Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 610 msec
OK
Time taken: 21.136 seconds
hive (test)> select * from
from from_unixtime( from_utc_timestamp(
hive (test)> select * from cte_people;
OK
tom
jarry
Time taken: 0.029 seconds, Fetched: 2 row(s)
like 只复制表结构
CREATE TABLE people_like LIKE people;
hive (test)> desc people;
OK
id int
name string
age int
start_date date
Time taken: 0.027 seconds, Fetched: 4 row(s)
hive (test)> desc people_like;
OK
id int
name string
age int
start_date date
Time taken: 0.054 seconds, Fetched: 4 row(s)
hive (test)> select * from people;
OK
1 tom 23 2019-03-16
2 jack 12 2019-03-13
3 robin 14 2018-08-13
4 justin 34 2018-10-12
5 jarry 24 2017-11-11
6 jasper 24 2017-12-12
Time taken: 0.032 seconds, Fetched: 6 row(s)
hive (test)> select * from people_like;
OK
Time taken: 0.025 seconds
hive (test)>
TEMPORARY
关键字修饰临时表是应用程序自动管理在复杂查询期间生成的中间数据的方法
删除表
DROP TABLE IF EXISTS employee [With PERGE]; -- With PERGE直接删除(可选),否则会放到 .Trash目录,相当于windows系统的回收站
TRUNCATE TABLE employee; -- 清空表数据
修改表(Alter针对元数据)
ALTER TABLE employee RENAME TO new_employee; -- 修改表名
ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments');
ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$');
ALTER TABLE c_employee SET FILEFORMAT RCFILE; -- 修正表文件格式
-- 修改表的列操作
ALTER TABLE employee_internal CHANGE old_name new_name STRING; -- 修改列名
ALTER TABLE c_employee ADD COLUMNS (work string); -- 添加列,注意没有逗号
ALTER TABLE c_employee REPLACE COLUMNS (name string); -- 替换列,注意没有逗号
LOAD用于在Hive中移动数据
LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt'
OVERWRITE INTO TABLE employee;
-- 加LOCAL关键字,表示原始文件位于Linux本地,执行后为拷贝数据
LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt'
OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2014, month=12);
-- 没有LOCAL关键字,表示文件位于HDFS文件系统中,执行后为直接移动数据
LOAD DATA INPATH '/tmp/employee.txt'
OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2017, month=12);
分区主要用于提高性能
create table userinfos(
userid string,
age string,
birthday string)
partitioned by (sex string) -- 通过PARTITINED BY定义分区
row format delimited fields terminated by ',' stored as textfile;
-- 注意:分区字段不能和表字段相同,实质就是一个字段
-- 添加分区
alter table userinfos add partition(sex='male') ;
-- 删除分区
alter table userinfos drop partition (sex='male');
-- 查看分区表有多少分区
show partitions dept_partition;
-- 加载数据到分区表中
LOAD DATA LOCAL INPATH '/opt/install/hive/tmp/test.csv'
OVERWRITE INTO TABLE userinfos partition(sex='male');
-- 这种方式不合理直接忽略了性别,在查的时候强行改变该列的值但是原数据没有改变
-- 可查看该表,如下所示全是male
create table myusers( userid string,username string,birthday string) partitioned by (sex string)
row format delimited fields terminated by ',' stored as textfile;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
不指定分区值
,该字段相同值就会分为一个分区)insert into myusers partition(sex) select * from ods_users;