1、Hive中所有的数据都存储在 HDFS 中,没有专门的数据存储格式(可支持Text,SequenceFile,ParquetFile,RCFILE等)
2、只需要在创建表的时候告诉 Hive 数据中的列分隔符和行分隔符,Hive 就可以解析数据。
3、Hive 中包含以下数据模型:DB、Table,External Table,Partition,Bucket。
(3):external table:外部表, 与table类似,不过其数据存放位置可以在任意指定路径
普通表: 删除表后, hdfs上的文件都删了
External外部表删除后, hdfs上的文件没有删除, 只是把文件删除了
(5):bucket:桶, 在hdfs中表现为同一个表目录下根据hash散列之后的多个文件, 会根据不同的文件把数据放到不同的文件中
#创建: create (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES] (property_name=value,name=value...) #显示描述信息: describe DATABASE|SCHEMA [extended] database_name。 #删除: DROP DATABASE|SHCEMA [IF EXISTS] database_Name [RESTRICT|CASCADE] #使用: user database_name;
[(col_name data_type [COMMENT col_comment], ...)] ----指定表的名称和表的具体列信息。
[COMMENT table_comment] ---表的描述信息。
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] ---表的分区信息。
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] ---表的桶信息。
[ROW FORMAT row_format] ---表的数据分割信息,格式化信息。
[STORED AS file_format] ---表数据的存储序列化信息。
[LOCATION hdfs_path] ---数据存储的文件夹地址信息。
1、 CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 选项来忽略这个异常。hive中的表可以分为内部表(托管表)和外部表,区别在于,外部表的数据不是有hive进行管理的,也就是说当删除外部表的时候,外部表的数据不会从hdfs中删除。而内部表是由hive进行管理的,在删除表的时候,数据也会删除。一般情况下,我们在创建外部表的时候会将表数据的存储路径定义在hive的数据仓库路径之外。hive创建表主要有三种方式,第一种直接使用create table命令,第二种使用create table ... as select...(会产生数据)。第三种使用create table tablename like exist_tablename命令。
2、 EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。
3、 LIKE 允许用户复制现有的表结构,但是不复制数据。
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,Hive通过 SerDe 确定表的具体的列的数据。
对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
(1)获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
7、create table命令介绍2
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name] table_name LIKE existing_table_orview_name ---指定要创建的表和已经存在的表或者视图的名称。
[LOCATION hdfs_path] ---数据文件存储的hdfs文件地址信息。
[db_Name] table_name ---指定要创建的表名称
[AS select_statement] ---导入的数据
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE;
# page_view是数据表的名称,注意hive的数据类型和java的数据类型类似,和mysql和oracle等数据库的字段类型不一致。 CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') #COMMENT描述,可有可无的。 COMMENT 'This is the page view table' # PARTITIONED BY指定表的分区,可以先不管。 PARTITIONED BY(dt STRING, country STRING) # ROW FORMAT DELIMITED代表一行是一条记录,是自己创建的全部字段和文件的字段对应,一行对应一条记录。 ROW FORMAT DELIMITED #FIELDS TERMINATED BY '\001'代表一行记录中的各个字段以什么隔开,方便创建的数据字段对应文件的一条记录的字段。 FIELDS TERMINATED BY '\001' # STORED AS SEQUENCEFILE;代表对应的文件类型。最常见的是SEQUENCEFILE(以键值对类型格式存储的)类型。TEXTFILE类型。 STORED AS SEQUENCEFILE;
//create & load(创建好数据表以后导入数据的操作如):
hive> create table tb_order(id int,name string,memory string,price double)
> row format delimited
> fields terminated by '\t';
load data local inpath '/home/hadoop/ip.txt' into table 要导入的表名称;
load data inpath 'hdfs://ns1/aa/bb/data.log' into table 要导入的表名称;
insert overwrite table tab_ip_seq select * from 要导入的表名称;
hive> load data local inpath '/home/hadoop/hivetest/' into table tb_order;
[root@slaver3 hivetest]# hadoop fs -put /user/hive/warehouse/tb_order
//external外部表 CREATE EXTERNAL TABLE tab_ip_ext(id int, name string, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/external/user';
//external外部表 //使用关键字EXTERNAL CREATE EXTERNAL TABLE 数据表名称(id int, name string, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE #location指定所在的位置:切记,重点。 LOCATION '/external/user';
hive> create table tb_part(sNo int,sName string,sAge int,sDept string) > partition //拿不准的单词,可以tab一下进行提示,并不会影响你创建表;谢谢 partition partitioned partitions > partitioned by (part string) > row format delimited > fields terminated by ',' > stored as textfile; OK Time taken: 0.351 seconds
1 hive> load data local inpath '/home/hadoop/data_hadoop/tb_part' overwrite into table tb_part partition (part='20171210'); 2 Loading data to table test.tb_part partition (part=20171210) 3 Partition test.tb_part{part=20171210} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0] 4 OK 5 Time taken: 2.984 seconds 6 hive> load data local inpath '/home/hadoop/data_hadoop/tb_part' overwrite into table tb_part partition (part='20171211'); 7 Loading data to table test.tb_part partition (part=20171211) 8 Partition test.tb_part{part=20171211} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0] 9 OK 10 Time taken: 0.566 seconds 11 hive> show par 12 parse_url( parse_url_tuple( partition partitioned partitions 13 hive> show partition 14 partition partitioned partitions 15 hive> show partitions tb_part; 16 OK 17 part=20171210 18 part=20171211 19 Time taken: 0.119 seconds, Fetched: 2 row(s) 20 hive>
1 hive> create table if not exists tb_stud(id int,name string,age int) 2 > partitioned by(clus string) 3 > clustered by(id) sorted by(age) into 2 buckets #分桶,根据id进行分桶,分成2个桶。 4 > row format delimited 5 > fields terminated by ','; 6 OK 7 Time taken: 0.194 seconds 8 hive> load data local inpath '/home/hadoop/data_hadoop/tb_clustered' overwrite into table tb_stud partition (clus='20171211'); 9 Loading data to table test.tb_stud partition (clus=20171211) 10 Partition test.tb_stud{clus=20171211} stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0] 11 OK 12 Time taken: 0.594 seconds 13 hive>
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_name DROP partition_spec, partition_spec,...
alter table student_p add partition(part='a') partition(part='b');
hive> alter table tb_stud add partition(clus='20171215') location '/user/hive/warehouse/test.db' partition(clus='20171216'); OK Time taken: 1.289 seconds hive> alter table tb_stud add partition partition partitioned partitions hive> alter table tb_stud add partition(clus='20171217'); OK Time taken: 0.097 seconds hive> dfs -ls /user/hive/warehouse/test.db > ; Found 4 items drwxr-xr-x - root supergroup 0 2017-12-09 23:32 /user/hive/warehouse/test.db/tb_log drwxr-xr-x - root supergroup 0 2017-12-10 00:14 /user/hive/warehouse/test.db/tb_part drwxr-xr-x - root supergroup 0 2017-12-10 00:43 /user/hive/warehouse/test.db/tb_stud drwxr-xr-x - root supergroup 0 2017-12-09 21:28 /user/hive/warehouse/test.db/tb_user hive> show partitions tb_stud; OK clus=20171211 clus=20171215 clus=20171216 clus=20171217 Time taken: 0.119 seconds, Fetched: 4 row(s) hive> alter table tb_stud drop partition partition partitioned partitions hive> alter table tb_stud drop partition(clus='20171217'); Dropped the partition clus=20171217 OK Time taken: 1.433 seconds hive> show partitions tb_stud; OK clus=20171211 clus=20171215 clus=20171216 Time taken: 0.092 seconds, Fetched: 3 row(s) hive> alter table tb_stud drop partition(clus='20171215'),partition(clus='20171216'); Dropped the partition clus=20171215 Dropped the partition clus=20171216 OK Time taken: 0.271 seconds hive> show partitions tb_stud; OK clus=20171211 Time taken: 0.094 seconds, Fetched: 1 row(s) hive>
1 hive> show tables; 2 OK 3 tb_log 4 tb_part 5 tb_stud 6 tb_user 7 Time taken: 0.026 seconds, Fetched: 4 row(s) 8 hive> alter table tb_user rename to tb_user_copy; 9 OK 10 Time taken: 0.19 seconds 11 hive> show tables; 12 OK 13 tb_log 14 tb_part 15 tb_stud 16 tb_user_copy 17 Time taken: 0.05 seconds, Fetched: 4 row(s) 18 hive>
1 hive> desc tb_user; 2 OK 3 id int 4 name string 5 Time taken: 0.148 seconds, Fetched: 2 row(s) 6 hive> alter table tb_user add columns(age int); 7 OK 8 Time taken: 0.238 seconds 9 hive> desc tb_user; 10 OK 11 id int 12 name string 13 age int 14 Time taken: 0.088 seconds, Fetched: 3 row(s) 15 hive> alter table tb_user replace columns(id int,name string,birthday string); 16 OK 17 Time taken: 0.132 seconds 18 hive> desc tb_user; 19 OK 20 id int 21 name string 22 birthday string 23 Time taken: 0.083 seconds, Fetched: 3 row(s) 24 hive>
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement
Multiple inserts:
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...
Dynamic partition inserts:
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement
1 2 hive> load data local inpath '/home/hadoop/data_hadoop/tb_stud' overwrite into table tb_stud partition (clus='20171211'); 3 Loading data to table test.tb_stud partition (clus=20171211) 4 Partition test.tb_stud{clus=20171211} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0] 5 OK 6 Time taken: 4.336 seconds 7 hive> select * from tb_stud where clus='20171211'; 8 OK 9 1 张三 NULL 20171211 10 2 lisi NULL 20171211 11 3 wangwu NULL 20171211 12 4 zhaoliu NULL 20171211 13 5 libai NULL 20171211 14 Time taken: 0.258 seconds, Fetched: 5 row(s) 15 hive> insert overwrite table tb_stud partition(clus='20171218') 16 > select id,name,age from tb_stud where clus='20171211'; 17 Query ID = root_20171210012734_721f76d9-f670-42ad-bf68-bfb94baf5cda 18 Total jobs = 3 19 Launching Job 1 out of 3 20 Number of reduce tasks is set to 0 since there's no reduce operator 21 Starting Job = job_1512874725514_0005, Tracking URL = http://master:8088/proxy/application_1512874725514_0005/ 22 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job -kill job_1512874725514_0005 23 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 24 2017-12-10 01:28:00,125 Stage-1 map = 0%, reduce = 0% 25 2017-12-10 01:28:34,514 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.44 sec 26 MapReduce Total cumulative CPU time: 1 seconds 440 msec 27 Ended Job = job_1512874725514_0005 28 Stage-4 is selected by condition resolver. 29 Stage-3 is filtered out by condition resolver. 30 Stage-5 is filtered out by condition resolver. 31 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171218/.hive-staging_hive_2017-12-10_01-27-34_894_112189266881641464-1/-ext-10000 32 Loading data to table test.tb_stud partition (clus=20171218) 33 Partition test.tb_stud{clus=20171218} stats: [numFiles=1, numRows=5, totalSize=58, rawDataSize=53] 34 MapReduce Jobs Launched: 35 Stage-Stage-1: Map: 1 Cumulative CPU: 1.44 sec HDFS Read: 3816 HDFS Write: 140 SUCCESS 36 Total MapReduce CPU Time Spent: 1 seconds 440 msec 37 OK 38 Time taken: 64.66 seconds 39 hive> select * from tb_stud where clus='20171218'; 40 OK 41 1 张三 NULL 20171218 42 2 lisi NULL 20171218 43 3 wangwu NULL 20171218 44 4 zhaoliu NULL 20171218 45 5 libai NULL 20171218 46 Time taken: 0.085 seconds, Fetched: 5 row(s) 47 hive> 48 49 50 hive> show partitions tb_stud; 51 OK 52 clus=20171211 53 clus=20171218 54 Time taken: 0.153 seconds, Fetched: 2 row(s) 55 hive> alter table tb_stud add partition(clus='20171212'); 56 OK 57 Time taken: 0.143 seconds 58 hive> alter table tb_stud add partition(clus='20171213'); 59 OK 60 Time taken: 0.399 seconds 61 hive> alter table tb_stud add partition(clus='20171214'); 62 OK 63 Time taken: 0.139 seconds 64 hive> from tb_stud 65 > insert overwrite table tb_stud partition(clus='20171213') 66 > select id,name,age where clus='20171211' 67 > insert overwrite table tb_stud partition(clus='20171214') 68 > select id,name,age where clus='20171211'; 69 Query ID = root_20171210013655_0c4a1d78-88e2-4de0-99ca-074c9eed81a4 70 Total jobs = 5 71 Launching Job 1 out of 5 72 Number of reduce tasks is set to 0 since there's no reduce operator 73 Starting Job = job_1512874725514_0007, Tracking URL = http://master:8088/proxy/application_1512874725514_0007/ 74 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job -kill job_1512874725514_0007 75 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0 76 2017-12-10 01:37:05,501 Stage-2 map = 0%, reduce = 0% 77 2017-12-10 01:38:06,089 Stage-2 map = 0%, reduce = 0% 78 2017-12-10 01:38:08,363 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.46 sec 79 MapReduce Total cumulative CPU time: 1 seconds 460 msec 80 Ended Job = job_1512874725514_0007 81 Stage-5 is selected by condition resolver. 82 Stage-4 is filtered out by condition resolver. 83 Stage-6 is filtered out by condition resolver. 84 Stage-11 is selected by condition resolver. 85 Stage-10 is filtered out by condition resolver. 86 Stage-12 is filtered out by condition resolver. 87 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171213/.hive-staging_hive_2017-12-10_01-36-55_602_8039889333976698612-1/-ext-10000 88 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171214/.hive-staging_hive_2017-12-10_01-36-55_602_8039889333976698612-1/-ext-10002 89 Loading data to table test.tb_stud partition (clus=20171213) 90 Loading data to table test.tb_stud partition (clus=20171214) 91 Partition test.tb_stud{clus=20171213} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0] 92 Partition test.tb_stud{clus=20171214} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0] 93 MapReduce Jobs Launched: 94 Stage-Stage-2: Map: 1 Cumulative CPU: 1.57 sec HDFS Read: 4798 HDFS Write: 280 SUCCESS 95 Total MapReduce CPU Time Spent: 1 seconds 570 msec 96 OK 97 Time taken: 81.536 seconds 98 hive> select * from tb_stud where clus='20171213'; 99 OK 100 1 张三 NULL 20171213 101 2 lisi NULL 20171213 102 3 wangwu NULL 20171213 103 4 zhaoliu NULL 20171213 104 5 libai NULL 20171213 105 Time taken: 0.138 seconds, Fetched: 5 row(s) 106 hive> select * from tb_stud where clus='20171214'; 107 OK 108 1 张三 NULL 20171214 109 2 lisi NULL 20171214 110 3 wangwu NULL 20171214 111 4 zhaoliu NULL 20171214 112 5 libai NULL 20171214 113 Time taken: 0.075 seconds, Fetched: 5 row(s) 114 hive> 115 116
multiple inserts:
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
1 1、导出文件到本地。 2 说明: 3 数据写入到文件系统时进行文本序列化,且每列用^A来区分,\n为换行符。用more命令查看时不容易看出分割符,可以使用: sed -e 's/\x01/|/g' filename[]来查看。 4 5 6 hive> insert overwrite local directory '/home/hadoop/data_hadoop/get_tb_stud' 7 > select * from tb_stud; 8 Query ID = root_20171210014640_4c499323-760e-4494-946b-5ffad8fb3789 9 Total jobs = 1 10 Launching Job 1 out of 1 11 Number of reduce tasks is set to 0 since there's no reduce operator 12 Starting Job = job_1512874725514_0008, Tracking URL = http://master:8088/proxy/application_1512874725514_0008/ 13 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job -kill job_1512874725514_0008 14 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 15 2017-12-10 01:46:50,400 Stage-1 map = 0%, reduce = 0% 16 2017-12-10 01:47:25,696 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.13 sec 17 MapReduce Total cumulative CPU time: 7 seconds 130 msec 18 Ended Job = job_1512874725514_0008 19 Copying data to local directory /home/hadoop/data_hadoop/get_tb_stud 20 Copying data to local directory /home/hadoop/data_hadoop/get_tb_stud 21 MapReduce Jobs Launched: 22 Stage-Stage-1: Map: 2 Cumulative CPU: 7.13 sec HDFS Read: 10392 HDFS Write: 515 SUCCESS 23 Total MapReduce CPU Time Spent: 7 seconds 130 msec 24 OK 25 Time taken: 47.258 seconds 26 hive> 27 28 29 hive> insert overwrite directory 'hdfs://' 30 > select * from tb_stud; 31 Query ID = root_20171210015229_b0a323b1-b1dc-4f31-b932-cb8126bac2ff 32 Total jobs = 3 33 Launching Job 1 out of 3 34 Number of reduce tasks is set to 0 since there's no reduce operator 35 Starting Job = job_1512874725514_0009, Tracking URL = http://master:8088/proxy/application_1512874725514_0009/ 36 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job -kill job_1512874725514_0009 37 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 38 2017-12-10 01:53:52,773 Stage-1 map = 0%, reduce = 0% 39 2017-12-10 01:54:07,829 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.65 sec 40 MapReduce Total cumulative CPU time: 2 seconds 650 msec 41 Ended Job = job_1512874725514_0009 42 Stage-3 is filtered out by condition resolver. 43 Stage-2 is selected by condition resolver. 44 Stage-4 is filtered out by condition resolver. 45 Launching Job 3 out of 3 46 Number of reduce tasks is set to 0 since there's no reduce operator 47 Starting Job = job_1512874725514_0010, Tracking URL = http://master:8088/proxy/application_1512874725514_0010/ 48 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job -kill job_1512874725514_0010 49 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0 50 2017-12-10 01:54:55,697 Stage-2 map = 0%, reduce = 0% 51 2017-12-10 01:55:45,607 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.43 sec 52 MapReduce Total cumulative CPU time: 1 seconds 430 msec 53 Ended Job = job_1512874725514_0010 54 Moving data to: hdfs:// 55 MapReduce Jobs Launched: 56 Stage-Stage-1: Map: 2 Cumulative CPU: 2.65 sec HDFS Read: 10412 HDFS Write: 515 SUCCESS 57 Stage-Stage-2: Map: 1 Cumulative CPU: 1.43 sec HDFS Read: 2313 HDFS Write: 515 SUCCESS 58 Total MapReduce CPU Time Spent: 4 seconds 80 msec 59 OK 60 Time taken: 202.508 seconds 61 hive> dfs -ls /user/hive/warehouse/tb_stud_get; 62 Found 1 items 63 -rwxr-xr-x 2 root supergroup 515 2017-12-10 01:55 /user/hive/warehouse/tb_stud_get/000000_0 64 hive>
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
[LIMIT number]
注:1、order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
2、sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。
3、distribute by根据distribute by指定的内容将数据分到同一个reducer。
4、Cluster by 除了具有Distribute by的功能外,还会对该字段进行排序。因此,常常认为cluster by = distribute by + sort by
select,,b.addr from a join b on =;
hive> create table tb_order_new > as > select id,name,memory,price > from tb_order;
//创建一个表 create table tab_ip_like like tab_ip; //批量插入数据,批量插入已经存在表 insert overwrite table tab_ip_like select * from tab_ip;
hive> create table tb_order_append(id int,name string,memory string,price double) > row format delimited > fields terminated by '\t'; hive> insert overwrite table tb_order_append > select * from tb_order;
hive> select * from tb_order_append;
hive> create table tb_order_part(id int,name string,memory string,salary double) > partitioned by (month string) > row format delimited > fields terminated by '\t'; OK Time taken: 0.13 seconds然后将数据导入这个新建的分区里面(所谓分区就是在文件夹下面创建一个文件夹,把数据放到这个文件夹下面),如下所示:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
Hive 支持等值连接(equality joins)、外连接(outer joins)和(left/right joins)。Hive 不支持非等值的连接,因为非等值连接非常难转化到 map/reduce 任务。
另外,Hive 支持多于 2 个表的连接。
写 join 查询时,需要注意几个关键点:
1. 只支持等值join
ON ( = AND a.department = b.department)
2. 可以 join 多于 2 个表。
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务,例如:
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c
ON (c.key = b.key1)
被转化为单个 map/reduce 任务,因为 join 中只使用了 b.key1 作为 join key。
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。
3.join 时,每次 map/reduce 任务的逻辑:
reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后(否则会因为缓存浪费大量内存)。例如:
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表都使用同一个 join key(使用 1 次 map/reduce 任务计算)。Reduce 端会缓存 a 表和 b 表的记录,然后每次取得一个 c 表的记录就计算一次 join 结果,类似的还有:
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
这里用了 2 次 map/reduce 任务。第一次缓存 a 表,用 b 表序列化;第二次缓存第一次 map/reduce 任务的结果,然后用 c 表序列化。
4.LEFT,RIGHT 和 FULL OUTER 关键字用于处理 join 中空记录的情况
SELECT a.val, b.val FROM
a LEFT OUTER JOIN b ON (a.key=b.key)
对应所有 a 表中的记录都有一条记录输出。输出的结果应该是 a.val, b.val,当 a.key=b.key 时,而当 b.key 中找不到等值的 a.key 记录时也会输出:
a.val, NULL
所以 a 表中的所有记录都被保留了;
“a RIGHT OUTER JOIN b”会保留所有 b 表的记录。
Join 发生在 WHERE 子句之前。如果你想限制 join 的输出,应该在 WHERE 子句中写过滤条件——或是在 join 子句中写。这里面一个容易混淆的问题是表分区的情况:
SELECT a.val, b.val FROM a
LEFT OUTER JOIN b ON (a.key=b.key)
WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'
会 join a 表到 b 表(OUTER JOIN),列出 a.val 和 b.val 的记录。WHERE 从句中可以使用其他列作为过滤条件。但是,如前所述,如果 b 表中找不到对应 a 表的记录,b 表的所有列都会列出 NULL,包括 ds 列。也就是说,join 会过滤 b 表中不能找到匹配 a 表 join key 的所有记录。这样的话,LEFT OUTER 就使得查询结果与 WHERE 子句无关了。解决的办法是在 OUTER JOIN 时使用以下语法:
ON (a.key=b.key AND
b.ds='2009-07-07' AND
这一查询的结果是预先在 join 阶段过滤过的,所以不会存在上述问题。这一逻辑也可以应用于 RIGHT 和 FULL 类型的 join 中。
Join 是不能交换位置的。无论是 LEFT 还是 RIGHT join,都是左连接的。
SELECT a.val1, a.val2, b.val, c.val
JOIN b ON (a.key = b.key)
LEFT OUTER JOIN c ON (a.key = c.key)
先 join a 表到 b 表,丢弃掉所有 join key 中不匹配的记录,然后用这一中间结果和 c 表做 join。这一表述有一个不太明显的问题,就是当一个 key 在 a 表和 c 表都存在,但是 b 表中不存在的时候:整个记录在第一次 join,即 a JOIN b 的时候都被丢掉了(包括a.val1,a.val2和a.key),然后我们再和 c 表 join 的时候,如果 c.key 与 a.key 或 b.key 相等,就会得到这样的结果:NULL, NULL, NULL, c.val