DDL 大纲,让我们对 Hive 的 DDL 操作有一个整体认识
注:SCHEMA/DATABASE 是相同的概念,只是叫法不同而已
-- 创建数据库/SCHEMA,表,视图,函数,索引
CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
-- 删除数据库/SCHEMA,表,视图,索引
DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
-- 清空表
TRUNCATE TABLE
-- 修改数据库/SCHEMA,表,视图
ALTER DATABASE/SCHEMA, TABLE, VIEW
-- 修复表,分区
MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
-- 显示数据库/SCHEMAS,表,表属性,视图,分区,函数,索引,列,建表语句
SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
-- 显示详细信息,数据库/SCHEMA,表,视图,物化视图
DESCRIBE DATABASE/SCHEMA, table_name, view_name, materialized_view_name
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
LOCATION 是指定外部表的存储路径,MANAGEDLOCATION 是指定管理表的存储路径(hive 4.0.0 才支持),官方建议默认就行,让所有的表都在一个根目录下。
create database myhive;
create database if not exists myhive;
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
如果数据库下有表,则不允许删除;如果要删除,后面加 CASCADE。RESTRICT 为默认值,默认不允许删除。
drop database if exists myhive;
drop database if exists myhive cascade;
USE database_name;
USE DEFAULT;
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];
(1) CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常,可使用 IF NOT EXISTS 选项来忽略这个异常;
(2)
TEMPORARY 表示是临时表,在当前会话内,这张表有效,当会话结束,则这张表失效。
EXTERNAL 表示是外部表,在建表的同时指定一个指向实际数据的路径。删除的表的时候,只会删除元数据,不会删除表数据;
(3)PARTITIONED BY 表示按什么字段来分区;
(4)CLUSTERED BY 表示分桶表,按什么字段分区和排序。INTO 表示根据这个字段分多少个桶。(分区表和分桶表,后续会分专题讲);
(5)SKEWED BY 表示指定某些列上有倾斜值,Hive 会记录下这些值,在查询的时候,会有更好的性能表现;
(6)STORED AS 表示以什么压缩格式来存储
CREATE TABLE IF NOT EXISTS test1 (
id INT,name STRING
);
其中 row format delimited 表示定义格式
fields terminated by ‘,’ 表示字段按 ‘,’ 来分割
LINES TERMINATED BY ‘\n’ 行按回车符来分割,默认,一般不写
location ‘/user/hdfs/source/hive_test’ 表示这个外部表的数据时放在这个目录下面
CREATE EXTERNAL TABLE IF NOT EXISTS my_hive (id INT,name STRING)
row format delimited
fields terminated by ','
LINES TERMINATED BY '\n'
stored as textfile
location '/user/hdfs/source/hive_test' ;
使用 partitioned by (dt string) 来表示定义分区字段
stored as 表示以 textfile 来存储
create table IF NOT EXISTS test_part_table(
word string,
num bigint)
partitioned by(dt string)
row format delimited
fields terminated by '\t'
STORED AS TEXTFILE;
create external table IF NOT EXISTS log_detail(
word string,
num bigint)
partitioned by(dt string)
row format delimited
fields terminated by '\t'
STORED AS TEXTFILE
location '/user/hdfs/source/hive_test' ;
这里我们创建了一个页面浏览表,以 userid 的值分成32个桶,插入数据的时候,会把 userid 取 hash,并对32取模,放到32个桶里面去。
CREATE TABLE page_view(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
表示以目标的查询结果来创建表
CREATE TABLE new_key_value_store
AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
表示以 like 后面的表来创建表结构,不写数据进去
CREATE TABLE empty_key_value_store
LIKE key_value_store;
使用 SKEWED BY 语法来指定表中某些字段的倾斜值,以提高表的查询性能
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78));
临时表,表示在当前用户会话内才有效,数据全都存在用户临时目录中,一旦退出对话,表和数据都会被清除掉
CREATE TEMPORARY TABLE list_bucket_multiple (
col1 STRING,
col2 int,
col3 STRING
);
下面的例子,我们使用正则表达式,来读取apache的日志。并且定义了要使用的正则
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
删除管理表,会把表数据和元数据都删除。如果配置了回收站的话,表数据会在回收站里面;
删除外部表,会把表的元数据删除,数据不会删除;
当删除一个表,而这个表上建有视图的时候,是不会有提示的;
PURGE,表示删除后,数据不会进回收站,直接删除。
DROP TABLE [IF EXISTS] table_name [PURGE]
清空表的所有数据,或者分区的所有数据
TRUNCATE [TABLE] table_name [PARTITION partition_spec];
ALTER TABLE table_name RENAME TO new_table_name;
通用语法
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
比如:
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
(关于 hive 的序列化和反序列化,会单独一篇文章详解)
通用语法
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE apachelog SET SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (
"input.regex" = "bduid\\[(.*)\\]uid\\[(\\d+)\\]"
)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
需要先开启动态分区:
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
注意,select 的字段中,分区字段要在最后面,并且按顺序
insert overwrite table pt_table partition(dt,pt)
select id,name,dt,pt from origin_table;
insert overwrite table pt_table partition(dt='20201020',pt)
select id,name,pt from origin_table;
FROM S
INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)
SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10
INSERT OVERWRITE TABLE R PARTITION (ds='2010-03-03, hr=12)
SELECT key, value, ds, hr from srcpart where ds is not null and hr = 12;
CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS
SELECT key, value, ds, hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;
上面的例子,都是动态的分区,如果你想加入一个静态的字段,可以在 select 字段中指定
CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS
SELECT key, value, "2010-03-03", hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;
alter table pt_table drop if exists partition(dt='20201020');
值得注意的是:字段的修改只会修改元数据,并不会修改实际的数据。所以修改之前要确认是否字段和数据匹配。
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
-- 创建一个表
CREATE TABLE test_change (a int, b int, c int);
-- 修改列 a 的名字为 a1
ALTER TABLE test_change CHANGE a a1 INT;
-- 修改列 a1 的名字为 a2,类型变成 STRING,并放在 字段 b 后面
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
-- 修改后的结构为: b int, a2 string, c int.
-- 修改列 c 的名字为 c1,并放在第一列
ALTER TABLE test_change CHANGE c c1 INT FIRST;
-- 修改后的结构为 c1 int, b int, a2 string.
-- 增加一列 a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT 'Referrers to The Onion website'
AS
SELECT DISTINCT referrer_url
FROM page_view
WHERE page_url='http://www.theonion.com';
DROP VIEW IF EXISTS onion_referrers;
通用语法
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
例子
SHOW DATABASES LIKE 'identifier_with_wildcards';
通用语法
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
例子:
SHOW TABLES IN default;
SHOW VIEWS [IN/FROM database_name] [LIKE 'pattern_with_wildcards'];
例子:
SHOW VIEWS in default LIKE 'view';
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03');
SHOW PARTITIONS table_name PARTITION(hr='12');
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');
通用语法
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
例如:
hive> show table extended like part_table;
OK
tableName:part_table
owner:thejas
location:file:/tmp/warehouse/part_table
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { i32 i}
partitioned:true
partitionColumns:struct partition_columns { string d}
totalNumberFiles:1
totalFileSize:2
maxFileSize:2
minFileSize:2
lastAccessTime:0
lastUpdateTime:1459382233000
SHOW CREATE TABLE ([db_name.]table_name|view_name);
例子
SHOW CREATE TABLE default.myhive;
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name];
SHOW FUNCTIONS [LIKE "" ];
show conf 'hive.exec.reducers.max';
所有的配置都在这个链接了
DESCRIBE DATABASE [EXTENDED] db_name;
DESCRIBE SCHEMA [EXTENDED] db_name; -- (Note: Hive 1.1.0 and later)
例子
DESCRIBE DATABASE DEFAULT
例子
DESCRIBE [EXTENDED|FORMATTED] table_name
例子
DESCRIBE extended part_table partition (d='abc');
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
(1) filepath 可以是绝对路径,也可以是相对路径,也可以是 hdfs 路径;
(2) 如果指定了 LOCAL,则会从本地加载文件到目标地址;如果没有指定,则会把 filepath 的文件 移动 到目标表里去;
(3) 如果指定了 overwrite,则会先把目标表的内容清空,再把数据添加进去。如果没有指定,则会把数据添加到表中;
(4) 如果是分区表,则需要指定一个分区
-- insert overwrite
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]
select_statement1 FROM from_statement;
-- insert into
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;
multi insert 可以将一张表的数据,写入到多张表中去。hive 只会读取一次表数据
例子:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
CTE 可以把一个临时的查询结果放到 with 语法中,供多个语法块使用
例如:
with q1 as ( select key from src where key = '5')
select *
from q1;
with q1 as ( select key from q2 where key = '5'),
q2 as ( select key from src where key = '5')
select * from (select key from q1) a;
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);