Programming Hive ( Hive编程指南)

一、Hive中“一次使用的命令”

1. -S表示静默模式,结果去掉OK和Time taken等行

hive -e "select * from movie_table limit 3"
...
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 11.631 seconds, Fetched: 3 row(s)


[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" 
...
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy

2.将查询结果输出到本地文件test.txt中(非HDFS)               test.txt是程序执行时自己创建的,无需提前创建

[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" > /usr/local/src/test3/hive/test.txt

(py27tf) [root@master hive-1.2.2]# cat /usr/local/src/test3/hive/test.txt
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy

通过以下命令查询管理表的warehouse属性记录(我也还没搞懂,再看看书P35)

[root@master hive-1.2.2]# hive  -e "seT" | grep warehouse

hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true

 

 

3.文件执行hive查询          shell:source         终端使用hive  -f 得到的结果相同

hive>  source /usr/local/src/test3/select.sql;
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 2.976 seconds, Fetched: 3 row(s)


[root@master test3]# hive -f /usr/local/src/test3/select.sql 
...
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 5.915 seconds, Fetched: 3 row(s)

表里写入数据

1.建表
create table test(line string);
OK
Time taken: 0.82 seconds
2.插数据
[root@master hive-1.2.2]# hive -e "load data local inpath '/usr/local/src/test3/test.txt' into table test"
或者
hive> load data local inpath '/usr/local/src/test3/test.txt' overwrite into table test;

3.查询
hive> select * from test;
OK
one row
Time taken: 0.4 seconds, Fetched: 1 row(s)

执行shell命令

在命令前加上! 并以;结尾

hive> ! pwd;
/usr/local/src/hive-1.2.2
hive> ! echo "what up dog";
"what up dog"

 

在hive内使用hadoop命令   hadoop dfs -ls /  只需要去掉hadoop即可

在hive内使用hadoop命令的优点:hadoop中每次都会启动一个新的JVM实例,而hive在同一个进程中执行

hive> dfs -ls / ;
Found 8 items
drwxr-xr-x   - root supergroup          0 2019-03-04 09:41 /7
-rw-r--r--   3 root supergroup     632207 2018-12-13 14:01 /The_Man_of_Property.txt
-rw-r--r--   3 root supergroup        698 2019-05-13 18:32 /a8a
drwxr-xr-x   - root supergroup          0 2019-05-27 10:29 /hbase
drwxr-xr-x   - root supergroup          0 2019-05-27 14:47 /hbase_test
drwxr-xr-x   - root supergroup          0 2019-05-26 16:49 /hive
drwx-wx-wx   - root supergroup          0 2019-05-26 10:18 /tmp
drwxr-xr-x   - root supergroup          0 2019-05-26 10:43 /user

Hive脚本注释      使用--开头的字符串表示注释

Tips:CLI不能解析注释,会产生报错信息,只能在脚本中通过hive -f **.hql 的方式执行

--copyright (c) 2012 Megacorp,LLC.
--This is the best Hive script evar!!

select * from table;

显示字段名称(默认是关闭的):    (可在$HOME/.HIVRRC文件中配置修改为默认开启:set hive.cli.print.header=true;)

字段名称为:movie_table.movieid     movie_table.title       movie_table.genres   

hive> select * from movie_table limit 2;
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.159 seconds, Fetched: 2 row(s)

hive> set hive.cli.print.header=true;

hive> select * from movie_table limit 2;
OK
movie_table.movieid     movie_table.title       movie_table.genres
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.148 seconds, Fetched: 2 row(s)

Chapter 4      HiveQL:数据定义(创建、修改、删除数据库、表、视图、函数和索引)

1.Hive不支持行级插入操作、更新操作和删除操作。Hive也不支持事物。

2.Hive中数据库的概念本质上仅仅只是表的一个目录或者命名空间

3.创建数据库financials

hive> create database financials;
OK
Time taken: 0.114 seconds

hive> show databases;
OK
default
financials
Time taken: 0.036 seconds, Fetched: 2 row(s)

如果数据库financials已经存在,创建时就会报错,以下命令可以不抛出错误信息;

hive> create database if not exists financials;

用正则表达式匹配筛选数据库名(列举出所有f开头的数据库)

hive> show databases like 'f*';
OK
financials
Time taken: 0.051 seconds, Fetched: 1 row(s)

Hive为每一个数据库创建一个目录(目录名以*.db结尾),数据库中的表将会以这个数据库目录的子目录形式存储(default除外,因为default库本身没有自己的目录)           

修改数据库目录:

1.配置文件修改

数据库的HDFS目录存储由hive-site.xml文件配置 :   set  hive.metastore.warehouse.dir=/user/hive/warehouse(这个目录是配置项默认配置,也可自己修改存储目录)

[root@master Programming_Hive]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x   - root supergroup          0 2019-05-28 10:49 /user/hive/warehouse/financials.db
drwxr-xr-x   - root supergroup          0 2019-05-28 10:50 /user/hive/warehouse/human_resources.db

2.CLI修改数据库目录:(只修改当前数据库位置)   还可以在CLI增加数据库的描述

hive> create database change22
    > comment 'holds all financial tables'
    > location '/hive_test';
OK
Time taken: 0.043 seconds

hive> desc database change22;
OK
change22        holds all financial tables      hdfs://master:9000/hive_test    root    USER
Time taken: 0.038 seconds, Fetched: 1 row(s)

为数据库增加何其相关的键—值对属性信息,

查询时使用语句:desc database extended change;

hive> create database change
    > with dbproperties ("creator" = "Jason Chan","data"="2019-05-28");
OK
Time taken: 0.109 seconds

hive> desc database extended change;
OK
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=Jason Chan, data=2019-05-28}
Time taken: 0.02 seconds, Fetched: 1 row(s)

修改或者新增数据库属性(不能删除)

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=Jason Chan, data=2019-05-28}

hive> alter database change set dbproperties ("creator"="jason");

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=jason, data=2019-05-28}

hive> alter database change set dbproperties ("edited-by"="Joe");

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=jason, data=2019-05-28, edited-by=Joe}

 

 

使用某个数据库:use

hive> show databases;
OK
default
financials
Time taken: 0.033 seconds, Fetched: 2 row(s)

hive> use financials;
OK
Time taken: 0.065 seconds

查询当前使用的数据库(当前的表属于哪个数据库):set hive.cli.print.current.db=true;

hive> set hive.cli.print.current.db=true;

hive (financials)> use default;
OK
Time taken: 0.032 seconds

hive (default)> set hive.cli.print.current.db=false;

删除数据库:drop

hive> drop database if exists human_resources;
OK
Time taken: 0.078 seconds

Hive不允许删除有表的数据库

hive> use traffic;
OK
Time taken: 0.039 seconds
    //traffic数据库下面有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.05 seconds, Fetched: 2 row(s)
    //删除traffic数据库会报错
hive> drop database traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

解决方案:

1.先清空数据库下的表:

    //删除表
hive> drop table monitor_camera_info;
OK
Time taken: 0.132 seconds
hive> drop table monitor_flow_action;
OK
Time taken: 0.236 seconds
    //此时再删除数据库就可以了
hive> drop database traffic;
OK
Time taken: 0.108 seconds

2.使用关键字: cascade

    //数据库traffic下有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.06 seconds, Fetched: 2 row(s)

    //正常删库报错
hive> drop database if exists traffic; 
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

    //使用关键字restrict也不能删除数据库    
hive> drop database if exists traffic restrict; 
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

    //使用关键字cascade可以正常删除
hive> drop database if exists traffic cascade; 
OK
Time taken: 0.25 seconds

创建表(change数据库下创建employees表),在每个字段类型后面追加一个注释

TBLPROPERTIES作用:按键值对的格式为表增加额外的文档说明 (还可作为表示关于数据库连接的必要的元数据信息)      P53

Hive会自动增加两个属性:

1.last_modified_by,保存最后修改这个表的用户的用户名

2.last_modified_time,保存着最后一次修改的新纪元时间秒

    //建表sql语句
create table if not exists change.employees3 (
    name            string  comment "employee name",

    salary          float   comment "employee salary",

    subordinates    array   comment "names of subordinates",

    deductions      map
                    comment "keys are deductions names, values are percentages",

    address         struct
                    comment "home address")
            
comment "description of the table"
TBLPROPERTIES ('creator'='me','created_at'='2019-05-28')
;


    //查看结果

hive> desc  employees3;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array           names of subordinates
deductions              map       keys are deductions names, values are percentages
address                 struct  home address        
Time taken: 0.116 seconds, Fetched: 5 row(s)

为表本身添加注释,自定义一个或多个表属性

怎么查看表的描述:comment "description of the table"   ???      show tblproperties employees3;

hive> show tblproperties employees3;
OK
comment description of the table
created_at      2019-05-28
creator me
transient_lastDdlTime   1559027367
Time taken: 0.128 seconds, Fetched: 4 row(s)

 

 

 

 

用户还可以拷贝一张已经存在的表的模式(不拷贝数据):   在change数据库下创建表employees4(要求表模式与employees3相同)

创建employee4.3_2.sql文件输入下面语句,然后执行hive   -f   employee4.3_2.sql

[root@master Programming_Hive]# cat employee4.3_2.sql 
create table if not exists change.employees4
like change.employees3;

[root@master Programming_Hive]# hive  -f  employee4.3_2.sql 

查看表employees4的属性

hive> show tables;
OK
employees3
employees4
Time taken: 0.081 seconds, Fetched: 2 row(s)
hive> desc employees4;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array           names of subordinates
deductions              map       keys are deductions names, values are percentages
address                 struct  home address        
Time taken: 0.136 seconds, Fetched: 5 row(s)

 

 

 

 

 

在default数据库下查看change数据库的表         show tables in change;

hive> set hive.cli.print.current.db=true;
hive (default)> show tables in change;
OK
employees3
employees4
Time taken: 0.051 seconds, Fetched: 2 row(s)
hive (default)> set hive.cli.print.current.db=false;
hive> 

使用正则表达式查询表:          show  tables   'empl*';     查询empl开头的所有的表

hive> show tables "empl*";
OK
employees3
employees4
Time taken: 0.034 seconds, Fetched: 2 row(s)

 

 

在数据库default下查找数据库change的表的结构信息。desc extended change.employees3;    实际我们更倾向于使用FORMATTED,因为输出内容更详细,且可读性较好

hive (default)> desc extended change.employees3;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array           names of subordinates
deductions              map       keys are deductions names, values are percentages
address                 struct  home address        
                 
Detailed Table Information      Table(tableName:employees3, dbName:change, owner:root
...
location:hdfs://master:9000/user/hive/warehouse/change.db/employees3,  
...
parameters:{creator=me, transient_lastDdlTime=1559027367, created_at=2019-05-28, comment=description of the table}, ...)
Time taken: 0.091 seconds, Fetched: 7 row(s)

查看表的某一列信息:salary列    (加不加extended结果是一样的)

hive (change)> describe employees3.salary;
OK
salary                  float                   from deserializer   
Time taken: 0.117 seconds, Fetched: 1 row(s)

 

 

4.3.2外部表

创建外部表,读取HDFS目录:/hive/programming_hive/data/stocks下所有的文件

删除外表,并不会删掉这个表中的数据,只是删除描述表的元数据

[root@master Programming_Hive]# cat 4.3.2stock.sql                                              create external table if not exists stocks (
`exchange` string,
`symbol` string,
`ymd` string,
`price_open` float,
`price_high` float,
`price_low` float,
`price_close` float,
`volume` int,
`price_adj_close` float)
row format delimited fields terminated by ','
location '/hive/programming_hive/data/stocks';

exchange 字段为hive保留字段,修改之后即可。如果非要用这个字段名,就用反引号(Tab上面的键)引起来:

严格来说:Hive是管理着管理表和外部表的的目录和文件,但是并没有对表具有完全的控制权限

 

查看表是管理表还是外部表:desc  formatted movie_table

hive (default)> desc FORMATTED movie_table;
...  
Table Type:             EXTERNAL_TABLE                    
...    
hive (default)> desc FORMATTED jason;
...    
Table Type:             MANAGED_TABLE            
...

创建外部表,但是可以复制内部表的结构(不复制数据)    employees5外部表,employees3内部表

        如果employees为外部表,external可省略,复制的表依然为外部表

[root@master Programming_Hive]# cat 4.3.2external_table.sql 
create table if not exists change.employees5 
like change.employees3
location '/hive/programming_hive/test';

[root@master Programming_Hive]# hive -f 4.3.2external_table.sql 
...
OK
Time taken: 1.939 seconds

 查看结构employees3和employees5一样,但employees5表的类型为external

hive (change)> desc formatted employees5;
OK
# col_name              data_type               comment             
                 
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array           names of subordinates
deductions              map       keys are deductions names, values are percentages
address                 struct  home address        
...                   
Table Type:             EXTERNAL_TABLE           

 

4.4  分区表、管理表

如果表中的数据以及分区个数都非常大的话,执行一个包含所有分区的查询可能会触发一个巨大的MapReduce任务。

建议:将Hive设置为strict模式,(对分区表进行查询而where子句没有加分区过滤器,将会禁止提交这个任务)
 

hive> set hive.mapred.mode=strict;

hive> select e.name,e.salary from employees e limit 10;

FAILED:Error in semantic analysis: No partition predicate found for Alias  "e" Table  "employees"

hive> set hive.mapred.mode=nonstrict;

hive> select e.name,e.salary from employees e limit 10;

John  Doe    10000.0  

...

 

查看分区:show  partitions 

hive (default)> show partitions rating_table_p;
OK
dt=2008-03
dt=2008-08
Time taken: 0.136 seconds, Fetched: 2 row(s)

查看分区键的方法:desc extended rating_table_p;                         partitionKeys分区键

hive (default)> desc extended rating_table_p;
OK
userid                  string                                      
movieid                 string                                      
rating                  string                                      
dt                      string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
                 
partitionKeys:[FieldSchema(name:dt, type:string, comment:null)]

分区:

1.建表的时候创建分区

2.加载数据的时候创建分区

 

 

4.4.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Hive)