基于Hadoop的数据仓库解决方案
Hive发展历史及版本
记录数据仓库中模型的定义、各层级间的映射关系
存储在关系数据库中
默认Derby,轻量级内嵌SQL数据库
Derby非常适合测试和演示
存储在.merastore_db目录中
实际生产一般存储在MySQL中
修改配置文件hive-site.xml
HCatalog
操作 | HiveServer2 Beeline | HiveServer1 CLI |
---|---|---|
Server Connection | beeline -u |
hive -h |
Help | beeline -h or beeline --help | hive -H |
Run Query | beeline -e |
hive -e |
Define Variable | beeline --hivevar key=value | hive --hivevar=value |
交互模式
操作 | HiveServer2 Beeline | HiveServer1 CLI |
---|---|---|
Enter Mode | beeline | hive |
Connect | !connect |
N/A |
List Tables | !table | show tables; |
List Columns | !column |
desc table_name; |
Save Result | !record |
N/A |
Run Shell CMD | !sh ls | !ls; |
Run DFS CMD | dfs -ls | dfs -ls |
Run SQL File | !run |
source |
Check Version | !dbinfo | !hive --version; |
Quit Mode | !quit | quit; |
检查Hive服务是否已经正常启动
使用Hive交互方式(输入hive即可)
使用beeline
nohup hive --service metastore &(非必须)
nohup hive --service hiveserver2 &
!connect jdbc:hive2://hadoop101:10000
beeline -u "jdbc:hive2://localhost:10000/default"
Hive Web Interface(As part of Apache Hive)
Hue(Cloudera)
Ambari Hive View(Hortonworks)
JDBC/ODBC(ETL工具,商业智能工具,集成开发环境)
类似于SQL数据类型
类型 | 示例 |
---|---|
TINYINt | 10Y |
INT | 10 |
FLOAT | 1.342 |
DECIMAL | 3.14 |
BOOLEAN | TRUE |
CHAR | ‘YES’ or “YES” |
DATE | ‘2013-01-31’ |
SMALLINT | 10S |
BIGINT | 100L |
DOUBLE | 1.234 |
BINARY | 1010 |
STRING | ‘Book’ or “Book” |
VARCHAR | ‘Book’ or “Book” |
TIMESTAMP | ‘2013-01-31 00:13:00:345’ |
ARRAY:存储的数据为相同类型
MAP:具有相同类型的键值对
STRUCT:封装了一组字段
类型 | 格式 | 定义 | 示例 |
---|---|---|---|
ARRAY | [‘Apple’,’Orange’,’Mongo’] | ARRAY | a[0] = ‘Apple’ |
MAP | {‘A’:’Apple’,’O’:’Orange’} | MAP |
b[‘A’] = ‘Apple’ |
STRUCT | {‘Apple’, 2} | STRUCT |
c.weight = 2 |
数据结构 | 描述 | 逻辑关系 | 物理存储(HDFS) |
---|---|---|---|
Database | 数据库 | 表的集合 | 文件夹 |
Table | 表 | 行数据的集合 | 文件夹 |
Partition | 分区 | 用于分割数据 | 文件夹 |
Buckets | 分桶 | 用于分布数据 | 文件 |
Row | 行 | 行记录 | 文件中的行 |
Columns | 列 | 列记录 | 每行中指定的位置 |
Views | 视图 | 逻辑概念,可跨越多张表 | 不存储数据 |
Index | 索引 | 记录统计数据信息 | 文件夹 |
表的集合,HDFS中表现为一个文件夹
如果没有指定数据库,默认使用default数据库
如何知道和显示的当前所在数据库
select current_database();
#设置前
hive> use hello;
OK
Time taken: 0.102 seconds
hive>
# 设置后
hive> set hive.cli.print.current.db=true;
hive (hello)>
# 恢复设置前
hive (hello)> set hive.cli.print.current.db=false;
hive>
hive> show databases;
hive> create database testhive2;
hive> use testhive;
分为内部表和外部表
内部表(管理表)
外部表(External Tables)
//if not exists可选,如果表存在,则忽略
CREATE EXTERNAL TABLE IF NOT EXISTS employee_external (
//列出所有列和数据类型
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
//comment可选
COMMENT 'This is an external table'
ROW FORMAT DELIMITED
//如何分隔列(字符)
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
//如何分隔集合和映射
MAP KEYS TERMINATED BY ':'
//文件存储格式
STORED AS TEXTFILE
//文件存储路径(HDFS)
LOCATION '/user/root/employee';
Michael|Montreal,Toronto|Male,30|DB:80|Product:DeveloperLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89|Sales:Lead
Michael|100|Montreal,Toronto|Male,30|DB:80|Product:DeveloperLead
Will|101|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Steven|102|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|103|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Mike|104|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|105|New York|Female,27|Python:80|Test:Lead,COE:Architect
Luly|106|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Lily|107|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shell|108|New York|Female,27|Python:80|Test:Lead,COE:Architect
Mich|109|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Dayong|110|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Sara|111|New York|Female,27|Python:80|Test:Lead,COE:Architect
Roman|112|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Christine|113|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Eman|114|New York|Female,27|Python:80|Test:Lead,COE:Architect
Alex|115|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Alan|116|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Andy|117|New York|Female,27|Python:80|Test:Lead,COE:Architect
Ryan|118|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Rome|119|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Lym|120|New York|Female,27|Python:80|Test:Lead,COE:Architect
Linm|121|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Dach|122|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Ilon|123|New York|Female,27|Python:80|Test:Lead,COE:Architect
Elaine|124|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Matias McGrirl|1|945-639-8596|2011-11-24
Gabriela Feldheim|2|706-232-4166|2017-12-16
Billy O'Driscoll|3|660-841-7326|2017-02-17
Kevina Rawet|4|955-643-0317|2012-01-05
Patty Entreis|5|571-792-2285|2013-06-11
Claudetta Sanderson|6|350-766-4559|2016-11-04
Bentley Oddie|7|446-519-0975|2016-05-02
Theressa Dowker|8|864-330-9976|2012-09-26
Jenica Belcham|9|347-248-4379|2011-05-02
Reube Preskett|10|918-740-2357|2015-03-26
Mary Skeldon|11|361-159-8710|2016-03-09
Ethelred Divisek|12|995-145-7392|2016-10-18
Field McGraith|13|149-133-9607|2015-10-06
Andeee Wiskar|14|315-207-5431|2012-05-10
Lloyd Nayshe|15|366-495-5398|2014-06-28
Mike Luipold|16|692-803-9373|2011-05-14
Tallie Swaine|17|570-709-6561|2011-08-06
Worth Ledbetter|18|905-586-2348|2012-09-25
Reine Leyborne|19|322-644-5798|2015-01-05
Norby Bellson|20|736-881-5785|2012-12-31
Nellie Jewar|21|551-505-3957|2017-06-18
Hoebart Deeth|22|780-240-0213|2011-09-19
Shel Haddrill|23|623-169-5495|2014-02-04
Christalle Cervantes|24|275-309-7794|2017-01-01
Dorita Miche|25|476-242-9769|2014-10-26
Conny Bowmen|26|398-181-4961|2011-10-21
Sabra O' Donohoe|27|327-773-8515|2015-01-28
Rahal Ashbe|28|561-777-0202|2012-12-13
Tye Greenstreet|29|499-510-1700|2012-01-17
Gordy Cristoforetti|30|955-110-7073|2015-10-09
Marsha Sharkey|31|221-696-5744|2017-01-29
Corbie Cruden|32|979-583-4252|2011-08-20
Anya Easen|33|428-602-5117|2011-08-16
Clea Brereton|34|909-198-4992|2018-01-08
Kimberley Pinnijar|35|608-177-4402|2015-06-03
Wilma Mackriell|36|637-304-3580|2012-06-23
Mitzi Gorman|37|134-675-2460|2017-07-16
Ashlin Rennick|38|816-635-9974|2014-04-20
Whitaker Shedd|39|614-792-6663|2016-05-19
Mandi Stronack|40|753-688-2327|2016-04-24
Niki Driffield|41|225-867-0712|2014-02-15
Regine Agirre|42|784-395-9982|2017-05-01
Evelina Craddy|43|274-850-6569|2017-06-14
Yasmin Ubsdall|44|679-739-9660|2012-03-10
Vivianna Shoreman|45|873-271-7100|2014-09-06
Chance Murra|46|248-160-3759|2017-12-31
Ferdy Adriano|47|735-447-2642|2013-11-11
Nikolos Tichner|48|869-871-9057|2014-02-15
Doro Rushman|49|861-337-3364|2011-08-27
Lela Hinzer|50|147-386-3735|2011-06-03
Hoyt Winspar|51|120-561-6266|2016-05-05
Vicki Rimington|52|257-204-8227|2014-11-21
Louis Dalwood|53|735-885-8087|2014-02-17
Joseph Zohrer|54|178-152-4726|2015-11-04
Kennett Senussi|55|182-904-2652|2017-05-20
Letta Musk|56|534-353-2038|2013-11-04
Giulietta Glentz|57|761-390-2806|2011-09-08
Wright Frostdyke|58|932-838-9710|2015-07-15
Bat Hannay|59|404-841-2981|2015-04-04
Devlen Hutsby|60|830-520-6401|2015-07-12
Lynnea Bembrigg|61|408-264-4116|2013-02-24
Udall Nelle|62|485-420-4327|2011-07-01
Kyle Matheson|63|153-149-2140|2011-07-03
Jarid Sprowell|64|848-408-9569|2017-11-08
Jeanie Griffitt|65|442-599-1231|2018-03-09
Joana Sleith|66|264-979-0388|2017-02-13
Doris Ilyushkin|67|877-472-3918|2015-08-03
Michaelina Rennels|68|949-522-9333|2012-07-05
Onofredo Butchard|69|392-833-3926|2017-11-05
Beatrice Amis|70|963-487-6585|2015-01-24
Joyan O'Hanlon|71|952-969-7279|2017-09-22
Mikaela Cardoo|72|960-275-3958|2015-01-24
Lori Dale|73|530-116-2773|2017-07-05
Stevena Roloff|74|241-314-8328|2015-12-21
Fayth Carayol|75|907-502-3752|2012-12-04
Carita Bruun|76|117-771-8056|2017-05-31
Darnell Hardwell|77|718-247-8505|2012-05-09
Jonathon Grealy|78|136-515-3637|2014-03-29
Laurice Rosini|79|352-594-3238|2017-02-15
Emelia Auten|80|311-899-1782|2014-09-10
Trace Fontelles|81|414-607-8366|2016-03-09
Hope Sket|82|461-595-7667|2017-09-30
Cilka Heijne|83|772-704-7366|2011-08-27
Maurise Gallico|84|546-158-7983|2011-12-21
Casey Greenfield|85|204-108-7707|2012-03-18
Wes Jaffrey|86|848-465-5131|2016-02-14
Jilly Eisikowitz|87|431-355-2777|2017-02-18
Auguste Kobel|88|562-494-1360|2012-02-29
Zackariah Pietrusiak|89|810-738-9846|2012-02-25
Pearline Marcq|90|200-835-9497|2016-02-10
Sayre Osbaldeston|91|340-132-2361|2011-11-30
Floyd Cano|92|133-768-6535|2016-02-27
Ciro Arendt|93|792-967-0588|2015-11-07
Auguste Kares|94|230-184-3438|2014-03-13
Skipp Spurden|95|747-133-1382|2012-03-15
Alyssa Prydden|96|963-170-0545|2014-11-07
Orlando Pallatina|97|354-125-1208|2012-07-12
Zoe Adacot|98|704-987-0702|2015-09-29
Blaine Fawdry|99|477-109-9014|2012-07-14
Cleon Haresnape|100|625-338-3965|2014-12-04
create table employee(
name string,
address array<string>,
personalInfo struct<sex:string,age:int>
technol map<string,int>,
job map<string,string>)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';
hive> select * from employee;
create table employee_id(
name string,
id int,
address array<string>,
`info` struct<sex:string,age:int>,
technol map<string,int>,
jobs map<string,string>)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';
hive> select * from employee_id;
hive> select name,id,address[0],`info`.sex,`info`.age,jobs["Sales"] from employee_id;
hive> select name,id,address[0],`info`.sex,`info`.age,jobs["Sales"] from employee_id where jobs["Sales"] is not null;
create external table emp_id(
name string,
id int,
address array<string>,
personalInfo struct<sex:string,age:int>,
technol map<string,int>,
job map<string,string >)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/usr/test/bucket';
Hive中默认分隔符
注意:
建表时指定分隔符只能用于非嵌套类型
嵌套类型由嵌套级别决定
数组中嵌套数组-外部数组时 ^B,内部数组是 ^C
映射中嵌套数组-外部映射是 ^C,内部数组时 ^D
SerDe:Serializer and Deseralizer
Hive支持不同类型的Storage SerDe
CREATE TABLE my_table(a string, b string, ...)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
CREATE TABLE test_serde_hb(
id string,
name string,
sex string,
age string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase. HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping"=
":key,info:name,info:sex,info:age"
) TBLPROPERTIES("hbase.table.name" = "test_serde");
CTAS-as select方式建表
CREATE TABLE ctas_employee as SELECT * FROM employee;
CTE(CTAS with Comment Table Expression)
CREATE TABLE cte_employee AS
WITH
r1 AS (SELECT name FROM r2 WHERE name = 'Michael'),
r2 AS (SELECT name FROM employee WHERE sex_age.sex= 'Male'),
r3 AS (SELECT name FROM employee WHERE sex_age.sex= 'Female')
SELECT * FROM r1 UNION ALL SELECT * FROM r3;
Like
CREATE TABLE employee_like LIKE employee;
creat table tmp_employee as select * from employee_partition;
create table job_detail as
with
tmp as (select job from employee_partition where info.sex='Male')
t2 as (select job from employee_partition where info.sex='Female'
select tmp.job male_job,t2.job female_job from tmp,t2;
select * from job_detail;
临时表是应用程序自动管理在复杂查询期间生成的中间数据的方法
CREATE TEMPORARY TABLE tmp_table_name1 (c1 string);
CREATE TEMPORARY TABLE tmp_table_name2 AS..
CREATE TEMPORARY TABLE tmp_table_name3 LIKE..
删除表
DROP TABLE IF EXISTS employee [With PERGE];
TRUNCATE TABLE employee; -- 清空表数据
//With PERGE直接删除(可选),否则就会放到.Trash目录
修改表(Alter针对元数据)
//修改表常用于数据备份
ALTER TABLE employee RENAME TO new_employee;
ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments');
ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$’);
ALTER TABLE c_employee SET FILEFORMAT RCFILE; -- 修正表文件格式
-- 修改表的列操作
ALTER TABLE employee_internal CHANGE old_name new_name STRING; -- 修改列名
ALTER TABLE c_employee ADD COLUMNS (work string); -- 添加列
ALTER TABLE c_employee REPLACE COLUMNS (name string); -- 替换列
hive> alter table employee change personalInfo `info` struct<gender:string,age:int>;
hive> desc employee;
hive> select * from employee;
分区主要用于提供性能
分为静态分区和动态分区
CREATE TABLE employee_partitioned(
//建表时定义分区
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>> )
//通过PARTITIONED BY定义分区
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
静态分区操作
//ALTER TABLE的方式添加静态分区
//ADD添加分区, DROP删除分区
ALTER TABLE employee_partitioned ADD
PARTITION (year=2019,month=3) PARTITION (year=2019,month=4);
ALTER TABLE employee_partitioned DROP PARTITION (year=2019, month=4);
create table employee_partition(
name string,
address array<string>,
info struct<sex:string,age:int>,
technol map<string,int>,
job map<string,string>)
partitioned by(country string,add string)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';
local data local inpath 'root/employee.txt'
into table employee_partition
partition (country="china",add="Liaoning");
select * from employee_partition;
local data inpath '/opt/hive/warehouse/testhive2.db/employee/employee.txt'
into table employee_partition
partition (country="china",add="nanjing");
select * from employee_partition;
create table a_p(
aname string)
partitioned by(test string);
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition=nonstrict;
insert into a_p partition(test="abc") values("hello");
insert into a_p partition(test) select "def" aname,"sam" xxx;
select * from a_p;
insert into a_p partition(test) select name aname,add from employpartition;
show partitions a_p;
alter table a_p add partition(test="123") partition(test="234");
alter table a_p drop partition(test="123"),partition(test="234");
select * from employee_partition where country="china";
select * from employee_partition where add="nanjing";
使用动态分区需设定属性
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.modenonstrict;
动态分区设置方法
//insert方式添加动态分区
insert into table employee_partitioned partition(year, month)
select name,array('Toronto') as work_place,
named_struct("sex","male","age",30) as sex_age,
map("python",90) as skills_score,
map("r&d", array('developer')) as depart_title,
year(start_date) as year,month(start_date) as month
from employee_hr eh ;
//建表时定义分区
create table p_test(
pid int,
pname string)
partitioned by (person string)
row format delimited
fields terminated by ','
lines terminated by '\n';
//添加分区内容
insert into p_test partition (person="sam") values(1,'a'),(2,'b'),(3,'c');
//查询数据
select * from p_test;
insert into p_test partition (person="bob") values(4,'d'),(5,'e'),(6,'f');
select * from p_test;
分桶对应于HDFS中的文件
分桶只有动态分桶
SET hive.enforce.bucketing = true;
定义分桶
//分桶的列是表中已有的列
//分桶数最好是2的n次方
CLUSTERED BY (employee_id) INTO 2 BUCKETS
必须使用INSERT方式加载数据
随机抽样基于整行数据
SELECT * FROM table_name TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
随机抽样基于指定列(使用分桶列更高效)
SELECT * FROM table_name TABLESAMPLE(BUCKET 3 OUT OF 32 ON id) s;
随机抽样基于block size
SELECT * FROM table_name TABLESAMPLE(10 PERCENT) s;
SELECT * FROM table_name TABLESAMPLE(1M) s;
SELECT * FROM table_name TABLESAMPLE(10 rows) s;
create external table emp_bucket(
name string,
id int,
address array<string>,
personalInfo struct<sex:string,age:int>,
technol map<string,int>,
job map<string,string >)
clustered by(id) into 3 buckets
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/usr/test/bucket';
set hive.enforce.bucketing=true;
视图概述
应用场景
视图操作命令:CREATE、SHOW、DROP、ALTER
CREATE VIEW view_name AS SELECT statement; -- 创建视图
-- 创建视图支持 CTE, ORDER BY, LIMIT, JOIN, etc.
SHOW TABLES; -- 查找视图 (SHOW VIEWS 在 hive v2.2.0之后)
SHOW CREATE TABLE view_name; -- 查看视图定义
DROP view_name; -- 删除视图
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = 'This is a view');
--更改视图属性
ALTER VIEW view_name AS SELECT statement; -- 更改视图定义,
create view v_e_p as select distinct name,address[0],info.age from employee_partition e join a_p a on e.add=a.test;
select * from v_e_p;
常与表生成函数结合使用,将函数的输入和输出连接
OUTER关键字:即使output为空也会生成结果
select name,work_place,loc from employee lateral view outer explode(split(null,',')) a as loc;
支持多层级
select name,wps,skill,score from employee
lateral view explode(work_place) work_place_single as wps
lateral view explode(skills_score) sks as skill,score;
通常用于规范或解析JSON
//创建文件夹,并把数据放入文件夹中
[root@hadoop100 ~]# hdfs dfs -mkdir /wctest
[root@hadoop100 ~]# hdfs dfs -put a.txt /wctest
//创建外部表放入文件夹
create external table wctest(line string) stored as textfile location '/wctest';
select word,count(word) from wctest lateral view explode(split(line,' ')) a as word group by word;