hive是一个基于hadoop的数据仓库的工具,可以将分布式文件系统HDFS中的结构化数据,映射为一张张表。将映射成的元数据保存在用户设定的数据库中,并且可以使用类SQL的语言:Hive QL进行数据的分析处理。
并且hive能根据表的参数设定,将表在HDFS中的实际数据按特定的格式保存下来。因此,hive在大数据读取方面有较好的性能表现。
下面将介绍hive如何使用Hive QL创建一个表和创建表之后怎么对表的结构进行修改。
本文将重点介绍hive表中各个结构的含义。目的是能在我们了解了hive的结构以后,能更加灵活的对hive的结构进行设定,从而进一步实现对hive各个结构的自定义。
在hive官网中的LanguageManual DDL中,hive表创建语法是:
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive
0.10
.
0
)]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY
'storage.handler.class.name'
[WITH SERDEPROPERTIES (...)] (Note: Only available starting with Hive
0.6
.
0
)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] (Note: Only available starting with Hive
0.6
.
0
)
[AS select_statement] (Note: Only available starting with Hive
0.5
.
0
, and not supported when creating external tables.)
或者:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path]
, struct_type
, union_type (这种格式Hive
0.7
.
0
后生效)
后面四个都是结构化数据,格式如下:
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > (Note: Only available starting with Hive
0.7
.
0
)
primitive_type为基本格式:TINYINT, SMALLINT,INT,BIGINT,BOOLEAN,FLOAT,DOUBLE,STRING,BINARY (Note: Only available starting with Hive 0.8.0),TIMESTAMP (Note: Only available starting with Hive 0.8.0),DECIMAL (Note: Only available starting with Hive 0.11.0),DECIMAL(precision, scale) (Note: Only available starting with Hive 0.13.0),VARCHAR (Note: Only available starting with Hive 0.12.0),CHAR (Note: Only available starting with Hive 0.13.0)
SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive
0.10
.
0
)
这个语句是对表中几个数据倾斜特别严重的字段设定了独立的文件存储。
暂不支持修改语法。
6、[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY
'storage.handler.class.name'
[WITH SERDEPROPERTIES (...)] (Note: Only available starting with Hive
0.6
.
0
)
这里是设定了字段分割符与行分割符。对应了SerDe和IN/OUTPUTFORMAT。
默认的SerDe,内部使用字符串传递,使用ROW FORMAT DELIMITED FIELDS TERMINATED BY '?'设定字段分割符。或者使用SERDE关键字自定义SerDe
HIVE默认支持的行分割方式(即file_format)为:TEXTFILE,SEQUENCEFILE,RCFILE,ORC。用户可使用INPUTFORMAT 和OUTPUTFORMAT关键字自定义fileformat。(这两个必须同时定义,不能单独设定。)
修改语法:ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]
ALTER TABLE table_name SET FILEFORMAT file_format
7、LOCATION关键字,设定了数据保存的位置。