Hadoop之Hive基本使用

一.Hive存储模式

Hive中建立的表都叫metastore表。这些表并不真实的存储数据,而是定义真实数据跟hive之间的映射,就像传统数据库中表的meta信息,所以叫做metastore

实际存储的时候可以定义的存储模式有四种:

内部表(默认)
分区表
桶表
外部表

具体内容可参考中文手册:https://www.docs4dev.com/docs/zh/apache-hive/3.1.1/reference

二.内部表

建立内部表

hive> CREATE TABLE worker(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';

建立一个worker的内部表,内部表是默认的类型,所以不用写存储的模式,并且使用逗号作为分隔符存储。

054为ascii码中的逗号,Hive没有专门的数据存储格式,也没有为数据建立索引,用户可以非常自由的组织Hive中的表,只需要在创建表的时候告诉Hive,数据中的列分隔符和行分隔符,Hive就可以解析数据。

比如:

hive> create table user_info (user_id int, cid string, ckid string, username string)  row format delimited  fields terminated by '\t' lines terminated by '\n';

导入数据表的数据格式是:字段之间是tab键分割,行之间是断行。

文件内容格式如下:

100636  100890  c5c86f4cddc15eb7        yyyvybtvt
100612  100865  97cc70d411c18b6f        gyvcycy
100078  100087  ecd6026a15ffddf5        qa000100

表的存储位置

> sudo -u hdfs hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt  - root supergroup          0 2020-07-03 03:34 /user/hive/warehouse/worker

插入数据:

Hive不支持单句插入的语句,必须批量,不能像sql一样使用insert into worker value(1,'zhangsan')
插入的方式有两种:

其一:从文件读取数据
其二:从别的表读出数据插入(insert from select)

从文件读取数据

> cat /worker.txt
1,zhangsan
2,lisi
3,wangwu
4,zhaoliu

插入数据到表中:

hive> LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;
Loading data to table default.worker
Table default.worker stats: [numFiles=1, totalSize=37]
OK
Time taken: 4.477 seconds

hive> select * from worker;
OK
1 zhangsan
2 lisi
3 wangwu
4 zhaoliu
Time taken: 1.293 seconds, Fetched: 4 row(s)

查看数据存储:

> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/worker
Found 1 items
-rwxrwxrwt  3 root supergroup        37 2020-07-03 03:34  /user/hive/warehouse/worker/worker.txt

继续插入数据

> cat /worker.abc
5,tianqi
6,wangba
7,pijiu

注意,hive不需要扩展名,扩展名可以随意写

hive> LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;
Loading data to table default.worker
Table default.worker stats: [numFiles=1, totalSize=37]
OK
Time taken: 34.11 seconds

hive> select * from worker;
OK
1      zhangsan
2      lisi
3      wangwu
4      zhaoliu
Time taken: 28.489 seconds, Fetched: 4 row(s)
> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/worker
Found 1 items
-rwxrwxrwt  3 root supergroup        37 2020-07-03 03:48 /user/hive/warehouse/worker/worker.txt
hive> LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;

LOAD DATA LOCAL INPATHLOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件,一个是从hdfs上找文件

三.分区表

分区表是用来加速查询的,主要依赖于指定条件。

举例:我们按照日期查询数据,所以我们要根据日期分区。

创建分区表:

hive> create table partition_student(id int,name string)
    > partitioned by(daytime string)
    > row format delimited fields TERMINATED BY '\054';
OK
Time taken: 2.492 seconds

hive> show tables;
OK
partition_student
worker
Time taken: 8.952 seconds, Fetched: 2 row(s)

创建数据文件:

> vi 2020070301
1,zhangsan
2,lisi
3,wangwu
4,zhaoliu

> vi 2020070302
33,tianqi
44,xiongda
55,xionger

导入数据表的数据格式是:字段之间是tab键分割,行之间是断行。

hive> LOAD DATA LOCAL INPATH '/2020070301' INTO TABLE partition_student partition(daytime='2020070301');
Loading data to table default.partition_student partition (daytime=2020070301)
Partition default.partition_student{daytime=2020070301} stats: [numFiles=1, numRows=0, totalSize=37, rawDataSize=0]
OK
Time taken: 56.454 seconds

hive> LOAD DATA LOCAL INPATH '/2020070302' INTO TABLE partition_student partition(daytime='2020070302');
Loading data to table default.partition_student partition (daytime=2020070302)
Partition default.partition_student{daytime=2020070302} stats: [numFiles=1, numRows=0, totalSize=32, rawDataSize=0]
OK
Time taken: 9.389 seconds

注意:每次导入文件进去,要手动定义daytime=,作为后面分区查询的依据

hive> select * from partition_student where daytime='2020070301';
OK
1      zhangsan        2020070301
2      lisi    2020070301
3      wangwu  2020070301
4      zhaoliu 2020070301
Time taken: 15.493 seconds, Fetched: 4 row(s)

hive> select * from partition_student where daytime='2020070302';
OK
33      tianqi  2020070302
44      xiongda 2020070302
55      xionger 2020070302
Time taken: 2.219 seconds, Fetched: 3 row(s)

hive> select * from partition_student;
OK
1      zhangsan        2020070301
2      lisi    2020070301
3      wangwu  2020070301
4      zhaoliu 2020070301
33     tianqi  2020070302
44     xiongda 2020070302
55     xionger 2020070302
Time taken: 4.861 seconds, Fetched: 7 row(s)

注意:
select * from partition_student where daytime='2020070302';支持where 使用and,当然应该使用在多维表中,并在创建表时声明多个分区类型。

例如:

create table student(id int, name string)
partitioned by(daytime string,telnum string)定义了两个分区类型,就可以在where后使用and
row format delimited fields TERMINATED BY '\054';

存储结构

> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/partition_student
Found 2 items
drwxrwxrwt  - root supergroup          0 2020-07-03 04:02 /user/hive/warehouse/partition_student/daytime=2020070301
drwxrwxrwt  - root supergroup          0 2020-07-03 04:03 /user/hive/warehouse/partition_student/daytime=2020070302

四.分桶

分桶是相对分区进行更细粒度的划分。
分桶将整个数据内容安装某列属性值得hash值进行区分,如要按照name属性分为3个桶,就是对name属性值的hash值对3取摸,按照取模结果对数据分桶。
如取模结果为0的数据记录存放到一个文件,取模为1的数据存放到一个文件,取模为2的数据存放到一个文件。

与分区不同的是,分区依据的不是真实数据表文件中的列,而是我们指定的伪列,
但是分桶是依据数据表中真实的列而不是伪列。所以在指定分区依据的列的时候要指定列的类型,因为在数据表文件中不存在这个列,相当于新建一个列。而分桶依据的是表中已经存在的列,这个列的数据类型显然是已知的,所以不需要指定列的类型。

1、建表

通过clustered by(字段名) into bucket_num buckets分桶,意思是根据字段名分成bucket_num个桶

> create table test_bucket (
> id int comment 'ID',
> name string comment '名字'
> )
> comment '测试分桶'
> clustered by(id) into 3 buckets
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

测试数据

> for i in {1..10}; do echo $i,name$i >> bucket_data.txt;done
> cat bucket_data.txt
1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
7,name7
8,name8
9,name9
10,name10
load data

直接load data不会有分桶的效果,这样和不分桶一样,在HDFS上只有一个文件。

load data local inpath '/buckt_data.txt' into table test_bucket;

需要借助中间表

> create table test (
> id int comment 'ID',
> name string comment '名字'
> )
> comment '测试分桶中间表'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
OK
Time taken: 0.483 seconds

hive> load data local inpath '/bucket_data.txt' into table test;
Loading data to table default.test
Table default.test stats: [numFiles=1, totalSize=82]
OK
Time taken: 2.077 seconds

然后通过下面的语句,将中间表的数据插入到分桶表中,这样会产生三个文件。

hive> set hive.enforce.bucketing = true;

强制分桶。

hive> set hive.enforce.bucketing = true;
hive> insert into test_bucket select * from test;
Query ID = root_20200703042727_918a62fa-bdce-4ca8-8a07-e1db2d085076
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1593650776473_0001, Tracking URL = http://node1.hadoop.com:8088/proxy/application_1593650776473_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593650776473_0001

http://172.26.37.245:19888/jobhistory/app中可以看到分布式任务

查看文件结构

> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/test_bucket
Found 3 items
-rwxrwxrwt  3 hdfs supergroup        24 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000000_0
-rwxrwxrwt  3 hdfs supergroup        34 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000001_0
-rwxrwxrwt  3 hdfs supergroup        24 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000002_0

五.外部表

外部表不是由hive来存储的,可以依赖Hbase来存储,hive只是做一个映射。

1.创建hbase表

> hbase shell
hbase(main):011:0> create 'student','info'
0 row(s) in 2.2390 seconds
=> Hbase::Table - student
hbase(main):012:0> put 'student',1,'info:id',1
0 row(s) in 0.2780 seconds
hbase(main):014:0> put 'student',1,'info:name','zhangsan'
0 row(s) in 0.0200 seconds
hbase(main):016:0> put 'student',2,'info:id',2
0 row(s) in 0.0100 seconds
hbase(main):018:0> put 'student',2,'info:name','lisi'
0 row(s) in 0.0100 seconds
hbase(main):019:0> scan 'student'
ROW         COLUMN+CELL                                                               
1                column=info:id, timestamp=1556032332227, value=1                       
1                column=info:name, timestamp=1556032361655, value=zhangsan    
2                column=info:id, timestamp=1556032380941, value=2                        
2                column=info:name, timestamp=1556032405776, value=lisi                
2 row(s) in 0.0490 seconds

建立hbase和hive的映射

hive> CREATE EXTERNAL TABLE ex_student(key int, id int, name string) 
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") 
    > TBLPROPERTIES ("hbase.table.name" = "student"); 

hive> CREATE EXTERNAL TABLE ex_student(key int, id int, name string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") 
    > TBLPROPERTIES ("hbase.table.name" = "student"); 
OK
Time taken: 5.019 seconds

hive> select * from ex_student;
OK
1 1 zhangsan
2 2 lisi
Time taken: 1.072 seconds, Fetched: 2 row(s)

文件结构

> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 6 items
drwxrwxrwt  - root supergroup          0 2020-07-03 23:19 /user/hive/warehouse/ex_student
drwxrwxrwt  - root supergroup          0 2020-07-03 23:18 /user/hive/warehouse/h_employee
drwxrwxrwt  - root supergroup          0 2020-07-03 22:03 /user/hive/warehouse/partition_student
drwxrwxrwt  - root supergroup          0 2020-07-03 22:49 /user/hive/warehouse/test
drwxrwxrwt  - root supergroup          0 2020-07-03 23:06 /user/hive/warehouse/test_bucket
drwxrwxrwt  - root supergroup          0 2020-07-03 21:42 /user/hive/warehouse/worker
> sudo -u hdfs hadoop fs -ls /user/hive/warehouse/ex_student/

因为是映射,看不到存储的文件。

转载于:https://www.jianshu.com/p/5857f0a3da61

你可能感兴趣的:(Hadoop之Hive基本使用)