创建表的语句有两种,分别为:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY(col_name[ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE existing_table_or_view_name [LOCATION hdfs_path];
表名和列名是大小写无关的,但SerDe和属性名是大小写相关的。表名可以是任何Unicode字符,表和列的注释是用单引号括起来的字符串,TBLPROPERTIES子句允许用户添加额外的键值对形式的元数据,属性last_modified_user, last_modified_time 自动被Hive添加和管理。在上面的语法中,数据类型(data_type)、分区(PARTITIONED BY)、桶(CLUSTERED BY)都已经学习过了,不在赘述。LOCATION用于指定表存放在HDFS上的路径,同创建Database的用法一致。SKEWED BY和ROW FORMAT子句之前从未接触过,需要进行深入地学习,首先看看ROW FORMAT子句。
ROW FORMAT子句的语法为:
DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
在创建表的时候可以使用定制的SerDe或者自带的SerDe。SerDe是serializer/deserializer的缩写,封装了将非结构化的字节转化为record的逻辑。如果ROW FORMAT没有指定或者使用了ROW FORMAT DELIMITED,则将使用自带的一个SerDe:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,该类由参数hive.script.serde设置。可以使用DELIMITED子句读取分隔符文件,可以使用ESCAPED BY子句(如ESCAPED BY ‘\’)启用分隔符的转义,如果数据包含分隔符启用转义是必需的。使用NULL DEFINED AS子句可以指定定制的NULL格式,默认为’\N’。默认的Filed分隔符是ctrl-A ,默认的LINES分隔符是换行符。
STORED AS file_format子句中的文件格式有:SEQUENCEFILE、TEXTFILE、RCFILE 、ORC和INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname,默认的文件格式为TEXTFILE,由配置参数hive.default.fileformat指定。TEXTFILE指以纯文本文件存储数据,在数据需要压缩时使用SEQUENCEFILE,使用INPUTFORMAT和OUTPUTFORMAT指定输入格式和输出格式的类名,比如'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'。使用STORED BY子句创建非本地表,比如在HBase中创建表。
SKEWED BY子句用于改进表中的一个或者多个列存在偏斜值时的性能。通过指定出现非常频繁(严重偏斜)的值,Hive自动将这些值分开到不同的文件中并且在查询时会考虑到这一情况从而可能的话会跳过或者包含这些文件。例如:
CREATE TABLE list_bucket_single (key STRING, value STRING) SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES]; CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING) SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];
在介绍了CREATE TABLE的一些语法规则后,现在通过一些列子演示如何创建表。
分区表一个表可以有一个或者多个分区列,Hive将会为分区列上的每个不同的值组合创建单独的数据目录。分区列是虚拟列,要避免分区列名和数据列名相同,可以基于分区列进行查询。
hive> create table people(name string, age int, birthday date, telephone string, address string) partitioned by(name string, age int); FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
上面出现的错误是因为分区列和数据列名称相同了,此时要么修改数据列名要么修改分区列名。
hive> create table people(name string, age int, birthday date, telephone string, address string) partitioned by(department string, sex string, howOld int); OK Time taken: 0.743 seconds外部表
可以使用关键字EXTERNAL和LOCATION创建不存储在默认位置中的表,其中LOCATION指定了表存在HDFS上的位置。外部表在LOCATION已经存在数据时会派上用场,当删除一个EXTERNAL表时,表中的数据不会在文件系统删除。
现在HDFS存在/user/Hadoop/iis/input/iis.log文件,在该位置上创建外部表:
hive> create external table iis(log string) location '/user/hadoop/iis/input'; OK Time taken: 0.153 seconds hive> select * from iis; OK #Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2014-03-03 00:00:00 #Fields: date time s-ip cs-uri-stem cs-uri-query c-ip cs(Cookie) sc-status sc-win32-status time-taken 2014-03-02 23:59:59 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.208.207 s=Heccw1DqspAR6xVw3wWbo4BWnkY8 200 0 0 2014-03-02 23:59:59 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.96.252 s=qF0qGgHv46amim54wyYuAy8XC1nj 200 0 0 2014-03-02 23:59:59 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.87.155 s=amQlFiegk5epbDWe09tWjXMKijPR 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=queryquota_v3 10.42.140.127 s=qdd5yCNU28tjG2hxunArlCvxVgox 200 0 15 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.242.141 s=53IjfuX2IdugOJKonB10gv0wJANx 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.226.203 s=aGjXqtI85MbgTDmE4O/aS6efsgjT 200 0 15 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.31.233.243 s=jg6Fq8Qcxg8Br4zUIa6y5Yjf6Ebi 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.140.127 s=qdd5yCNU28tjG2hxunArlCvxVgox 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.140.127 s=qdd5yCNU28tjG2hxunArlCvxVgox 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.31.233.188 s=6S6kNSmdEVIDLaTKGO4NHMlUi4vD 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=queryquota_v3 10.34.125.223 s=/J2Fu+YezjnqpjL51vtvcIu9lrMj 200 0 0 2014-03-03 00:00:01 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.208.207 s=Heccw1DqspAR6xVw3wWbo4BWnkY8 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.95.193 s=oYkwbroqu/XpgftiHyesURev2v+2 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.239.247 s=gRt4Gp5aAPj9YybOdMAMIKpMBI+u 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.94.0 s=lmbXhYOfUscdxhaGADQnflPKOBFy 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.231.198 s=Eb3pPggN2gohY0mHi2kzmxG5IN7x 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.241.191 s=7kScMcvsUk+NKSnFG+No4TMcuIr6 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.105.49 s=NxSIfUA5yc/Y1P+ZRzWR3kKYMCLA 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.95.168 s=GAeOuGvfZWiW4W7BQpzvZ8Wbkxww 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.244.54 s=5gw2zbfzu/rxiABla1Y98bit1rGl 200 0 0 2014-03-03 00:00:02 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.94.162 s=GGZdzXOBv71KPtFewxC0c1q6zTbX 200 0 15 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.121.179 s=PuhdGfCdyuTU9Tk2pxrUqTK71AxJ 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.125.223 s=/J2Fu+YezjnqpjL51vtvcIu9lrMj 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 12.33.6.35 s=NSh4BUvGpZcYsbm5KRGmYTz6sgc8 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.42.98.88 s=iLkrC508IIKHBCsm4ldhGrm39ThQ 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.34.125.223 s=/J2Fu+YezjnqpjL51vtvcIu9lrMj 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.33.241.99 s=9wQKZfqdYllbslYbA13nDTcYfnPv 200 0 0 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=connect 10.34.237.157 - 200 0 15 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=login 10.34.237.157 s=uOi/7qMVbj7aPRVRg2XWZLT4upJy 200 0 15 2014-03-03 00:00:03 3.16.30.11 /aws/cmd.srf c=getmessage 10.31.11.99 s=6Cuj/ITM5Zu8lg6KBEGxfsd/oo1k 200 0 15 Time taken: 0.091 seconds, Fetched: 34 row(s)
可以看到,Hive将HDFS上的数据直接加载到了表中了。
Create Table As Select (CTAS)可以使用create-table-as-select (CTAS)语句创建表并加载查询结果到表中。该语句的操作是原子性,即直到所有查询结果被加载到表中之前,表对于其他用户是不可见的。所以其他用户要么看见带完整查询结果的表,要么根本看不见表。
CTAS有CREATE和SELECT两部分,SELECT部分可以是任何被Hive支持的select语句,CREATE采用SELECT部分的schema并且创建带有其它表属性如SerDe和存储格式的目标表。如果SELECT部分没有指定所查询列的别名,目标表的列名将自动分配为_col0, _col1, 和 _col2 等,否则按照别名命名目标表的列名。CTAS有几个限制:
CTAS的示例代码如下:
hive> create external table iis2 as select log from iis; FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table hive> create table iis2 as select * from iis; Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201406061427_0002, Tracking URL = http://hadoop:50030/jobdetails.jsp?jobid=job_201406061427_0002 Kill Command = /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201406061427_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-06-06 15:47:35,586 Stage-1 map = 0%, reduce = 0% 2014-06-06 15:47:40,634 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.49 sec 2014-06-06 15:47:43,806 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.49 sec MapReduce Total cumulative CPU time: 1 seconds 490 msec Ended Job = job_201406061427_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://hadoop:9000/tmp/hive-hadoop/hive_2014-06-06_15-47-25_302_2650279919157474703-1/-ext-10001 Moving data to: hdfs://hadoop:9000/user/hive/warehouse/learning.db/iis2 Table learning.iis2 stats: [numFiles=1, numRows=34, totalSize=3456, rawDataSize=3422] MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.49 sec HDFS Read: 3690 HDFS Write: 3527 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 490 msec OK Time taken: 18.783 seconds使用LIKE创建表
使用Like创建表的语法为:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE existing_table_or_view_name [LOCATION hdfs_path];
Like语法可以精确地根据现存表的定义定义目标表(不复制数据),如:
hive> create table iis4 like iis; OK Time taken: 0.37 seconds
Hive-0.8.0版本及之后的版本,CREATE TABLE LIKE view_name采用视图的schema创建表,并使用默认的SerDe和文件格式。
桶表CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE;
在上面的例子中,表page_view根据userid分桶并且在每个桶中数据按照viewTime升序排序。对表进行分桶可以在聚类列上(如userid)进行高效率的抽样,排序允许内部操作符在执行查询时利用熟悉的数据结构进而提高效率。MAP KEYS 和 COLLECTION ITEMS 关键字在列是列表或者map时使用。 CLUSTERED BY和SORTED BY不影响数据如何插入到表中,只会影响如何读取数据。这意味着必须通过指定与桶数目相同的reducer来小心正确地插入数据,并在查询时使用CLUSTER BY和SORT BY子句。
删除表的语法如下:
DROP TABLE [IF EXISTS] table_name
当对一个表进行Drop操作时,不仅会删除表中的数据还会删除表的元数据。如果HDFS配置了Trash,那么删除的数据会移动到.Trash/当前目录下,元数据则会彻底丢失。当删除EXTERNAL表时,表中的数据不会从文件系统中删除,也即文件系统中的数据不会随着EXTERNAL表的删除而删除。当删除被视图应用的表时,不会提示警告信息(视图不再有效,必须被另行删除或者被用户重建)。从Hive-0.7.0开始,DROP操作在表不存在时返回错误,除非使用了IF EXISTS或者配置参数hive.exec.drop.ignorenonexistent被设置为true。
Truncate表的语法如下:
TRUNCATE TABLE table_name [PARTITION partition_spec]; partition_spec: : (partition_col = partition_col_value, partition_col = partiton_col_value,
当进行truncate操作时将会从一个表或者分区中删除所有的行,用户可以通过指定partition_spec删除多个分区,省略partition_spec将会删除表中的所有分区。