一、Hive中“一次使用的命令”
1. -S表示静默模式,结果去掉OK和Time taken等行
hive -e "select * from movie_table limit 3"
...
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 11.631 seconds, Fetched: 3 row(s)
[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3"
...
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
2.将查询结果输出到本地文件test.txt中(非HDFS) test.txt是程序执行时自己创建的,无需提前创建
[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" > /usr/local/src/test3/hive/test.txt
(py27tf) [root@master hive-1.2.2]# cat /usr/local/src/test3/hive/test.txt
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
通过以下命令查询管理表的warehouse属性记录(我也还没搞懂,再看看书P35)
[root@master hive-1.2.2]# hive -e "seT" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
3.文件执行hive查询 shell:source 终端使用hive -f 得到的结果相同
hive> source /usr/local/src/test3/select.sql;
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 2.976 seconds, Fetched: 3 row(s)
[root@master test3]# hive -f /usr/local/src/test3/select.sql
...
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 5.915 seconds, Fetched: 3 row(s)
表里写入数据
1.建表
create table test(line string);
OK
Time taken: 0.82 seconds
2.插数据
[root@master hive-1.2.2]# hive -e "load data local inpath '/usr/local/src/test3/test.txt' into table test"
或者
hive> load data local inpath '/usr/local/src/test3/test.txt' overwrite into table test;
3.查询
hive> select * from test;
OK
one row
Time taken: 0.4 seconds, Fetched: 1 row(s)
执行shell命令
在命令前加上! 并以;结尾
hive> ! pwd;
/usr/local/src/hive-1.2.2
hive> ! echo "what up dog";
"what up dog"
在hive内使用hadoop命令 hadoop dfs -ls / 只需要去掉hadoop即可
在hive内使用hadoop命令的优点:hadoop中每次都会启动一个新的JVM实例,而hive在同一个进程中执行
hive> dfs -ls / ;
Found 8 items
drwxr-xr-x - root supergroup 0 2019-03-04 09:41 /7
-rw-r--r-- 3 root supergroup 632207 2018-12-13 14:01 /The_Man_of_Property.txt
-rw-r--r-- 3 root supergroup 698 2019-05-13 18:32 /a8a
drwxr-xr-x - root supergroup 0 2019-05-27 10:29 /hbase
drwxr-xr-x - root supergroup 0 2019-05-27 14:47 /hbase_test
drwxr-xr-x - root supergroup 0 2019-05-26 16:49 /hive
drwx-wx-wx - root supergroup 0 2019-05-26 10:18 /tmp
drwxr-xr-x - root supergroup 0 2019-05-26 10:43 /user
Hive脚本注释 使用--开头的字符串表示注释
Tips:CLI不能解析注释,会产生报错信息,只能在脚本中通过hive -f **.hql 的方式执行
--copyright (c) 2012 Megacorp,LLC.
--This is the best Hive script evar!!
select * from table;
显示字段名称(默认是关闭的): (可在$HOME/.HIVRRC文件中配置修改为默认开启:set hive.cli.print.header=true;)
字段名称为:movie_table.movieid movie_table.title movie_table.genres
hive> select * from movie_table limit 2;
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.159 seconds, Fetched: 2 row(s)
hive> set hive.cli.print.header=true;
hive> select * from movie_table limit 2;
OK
movie_table.movieid movie_table.title movie_table.genres
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.148 seconds, Fetched: 2 row(s)
Chapter 4 HiveQL:数据定义(创建、修改、删除数据库、表、视图、函数和索引)
1.Hive不支持行级插入操作、更新操作和删除操作。Hive也不支持事物。
2.Hive中数据库的概念本质上仅仅只是表的一个目录或者命名空间
3.创建数据库financials
hive> create database financials;
OK
Time taken: 0.114 seconds
hive> show databases;
OK
default
financials
Time taken: 0.036 seconds, Fetched: 2 row(s)
如果数据库financials已经存在,创建时就会报错,以下命令可以不抛出错误信息;
hive> create database if not exists financials;
用正则表达式匹配筛选数据库名(列举出所有f开头的数据库)
hive> show databases like 'f*';
OK
financials
Time taken: 0.051 seconds, Fetched: 1 row(s)
Hive为每一个数据库创建一个目录(目录名以*.db结尾),数据库中的表将会以这个数据库目录的子目录形式存储(default除外,因为default库本身没有自己的目录)
修改数据库目录:
1.配置文件修改
数据库的HDFS目录存储由hive-site.xml文件配置 : set hive.metastore.warehouse.dir=/user/hive/warehouse(这个目录是配置项默认配置,也可自己修改存储目录)
[root@master Programming_Hive]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2019-05-28 10:49 /user/hive/warehouse/financials.db
drwxr-xr-x - root supergroup 0 2019-05-28 10:50 /user/hive/warehouse/human_resources.db
2.CLI修改数据库目录:(只修改当前数据库位置) 还可以在CLI增加数据库的描述
hive> create database change22
> comment 'holds all financial tables'
> location '/hive_test';
OK
Time taken: 0.043 seconds
hive> desc database change22;
OK
change22 holds all financial tables hdfs://master:9000/hive_test root USER
Time taken: 0.038 seconds, Fetched: 1 row(s)
为数据库增加何其相关的键—值对属性信息,
查询时使用语句:desc database extended change;
hive> create database change
> with dbproperties ("creator" = "Jason Chan","data"="2019-05-28");
OK
Time taken: 0.109 seconds
hive> desc database extended change;
OK
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=Jason Chan, data=2019-05-28}
Time taken: 0.02 seconds, Fetched: 1 row(s)
修改或者新增数据库属性(不能删除)
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=Jason Chan, data=2019-05-28}
hive> alter database change set dbproperties ("creator"="jason");
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=jason, data=2019-05-28}
hive> alter database change set dbproperties ("edited-by"="Joe");
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=jason, data=2019-05-28, edited-by=Joe}
使用某个数据库:use
hive> show databases;
OK
default
financials
Time taken: 0.033 seconds, Fetched: 2 row(s)
hive> use financials;
OK
Time taken: 0.065 seconds
查询当前使用的数据库(当前的表属于哪个数据库):set hive.cli.print.current.db=true;
hive> set hive.cli.print.current.db=true;
hive (financials)> use default;
OK
Time taken: 0.032 seconds
hive (default)> set hive.cli.print.current.db=false;
删除数据库:drop
hive> drop database if exists human_resources;
OK
Time taken: 0.078 seconds
Hive不允许删除有表的数据库
hive> use traffic;
OK
Time taken: 0.039 seconds
//traffic数据库下面有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.05 seconds, Fetched: 2 row(s)
//删除traffic数据库会报错
hive> drop database traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
解决方案:
1.先清空数据库下的表:
//删除表
hive> drop table monitor_camera_info;
OK
Time taken: 0.132 seconds
hive> drop table monitor_flow_action;
OK
Time taken: 0.236 seconds
//此时再删除数据库就可以了
hive> drop database traffic;
OK
Time taken: 0.108 seconds
2.使用关键字: cascade
//数据库traffic下有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.06 seconds, Fetched: 2 row(s)
//正常删库报错
hive> drop database if exists traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
//使用关键字restrict也不能删除数据库
hive> drop database if exists traffic restrict;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
//使用关键字cascade可以正常删除
hive> drop database if exists traffic cascade;
OK
Time taken: 0.25 seconds
创建表(change数据库下创建employees表),在每个字段类型后面追加一个注释
TBLPROPERTIES作用:按键值对的格式为表增加额外的文档说明 (还可作为表示关于数据库连接的必要的元数据信息) P53
Hive会自动增加两个属性:
1.last_modified_by,保存最后修改这个表的用户的用户名
2.last_modified_time,保存着最后一次修改的新纪元时间秒
//建表sql语句
create table if not exists change.employees3 (
name string comment "employee name",
salary float comment "employee salary",
subordinates array comment "names of subordinates",
deductions map
comment "keys are deductions names, values are percentages",
address struct
comment "home address")
comment "description of the table"
TBLPROPERTIES ('creator'='me','created_at'='2019-05-28')
;
//查看结果
hive> desc employees3;
OK
name string employee name
salary float employee salary
subordinates array names of subordinates
deductions map keys are deductions names, values are percentages
address struct home address
Time taken: 0.116 seconds, Fetched: 5 row(s)
怎么查看表的描述:comment "description of the table" ??? show tblproperties employees3;
hive> show tblproperties employees3;
OK
comment description of the table
created_at 2019-05-28
creator me
transient_lastDdlTime 1559027367
Time taken: 0.128 seconds, Fetched: 4 row(s)
用户还可以拷贝一张已经存在的表的模式(不拷贝数据): 在change数据库下创建表employees4(要求表模式与employees3相同)
创建employee4.3_2.sql文件输入下面语句,然后执行hive -f employee4.3_2.sql
[root@master Programming_Hive]# cat employee4.3_2.sql
create table if not exists change.employees4
like change.employees3;
[root@master Programming_Hive]# hive -f employee4.3_2.sql
查看表employees4的属性
hive> show tables;
OK
employees3
employees4
Time taken: 0.081 seconds, Fetched: 2 row(s)
hive> desc employees4;
OK
name string employee name
salary float employee salary
subordinates array names of subordinates
deductions map keys are deductions names, values are percentages
address struct home address
Time taken: 0.136 seconds, Fetched: 5 row(s)
在default数据库下查看change数据库的表 show tables in change;
hive> set hive.cli.print.current.db=true;
hive (default)> show tables in change;
OK
employees3
employees4
Time taken: 0.051 seconds, Fetched: 2 row(s)
hive (default)> set hive.cli.print.current.db=false;
hive>
使用正则表达式查询表: show tables 'empl*'; 查询empl开头的所有的表
hive> show tables "empl*";
OK
employees3
employees4
Time taken: 0.034 seconds, Fetched: 2 row(s)
在数据库default下查找数据库change的表的结构信息。desc extended change.employees3; 实际我们更倾向于使用FORMATTED,因为输出内容更详细,且可读性较好
hive (default)> desc extended change.employees3;
OK
name string employee name
salary float employee salary
subordinates array names of subordinates
deductions map keys are deductions names, values are percentages
address struct home address
Detailed Table Information Table(tableName:employees3, dbName:change, owner:root
...
location:hdfs://master:9000/user/hive/warehouse/change.db/employees3,
...
parameters:{creator=me, transient_lastDdlTime=1559027367, created_at=2019-05-28, comment=description of the table}, ...)
Time taken: 0.091 seconds, Fetched: 7 row(s)
查看表的某一列信息:salary列 (加不加extended结果是一样的)
hive (change)> describe employees3.salary;
OK
salary float from deserializer
Time taken: 0.117 seconds, Fetched: 1 row(s)
4.3.2外部表
创建外部表,读取HDFS目录:/hive/programming_hive/data/stocks下所有的文件
删除外表,并不会删掉这个表中的数据,只是删除描述表的元数据
[root@master Programming_Hive]# cat 4.3.2stock.sql create external table if not exists stocks (
`exchange` string,
`symbol` string,
`ymd` string,
`price_open` float,
`price_high` float,
`price_low` float,
`price_close` float,
`volume` int,
`price_adj_close` float)
row format delimited fields terminated by ','
location '/hive/programming_hive/data/stocks';
exchange 字段为hive保留字段,修改之后即可。如果非要用这个字段名,就用反引号(Tab上面的键)引起来:
严格来说:Hive是管理着管理表和外部表的的目录和文件,但是并没有对表具有完全的控制权限
查看表是管理表还是外部表:desc formatted movie_table
hive (default)> desc FORMATTED movie_table;
...
Table Type: EXTERNAL_TABLE
...
hive (default)> desc FORMATTED jason;
...
Table Type: MANAGED_TABLE
...
创建外部表,但是可以复制内部表的结构(不复制数据) employees5外部表,employees3内部表
如果employees为外部表,external可省略,复制的表依然为外部表
[root@master Programming_Hive]# cat 4.3.2external_table.sql
create table if not exists change.employees5
like change.employees3
location '/hive/programming_hive/test';
[root@master Programming_Hive]# hive -f 4.3.2external_table.sql
...
OK
Time taken: 1.939 seconds
查看结构employees3和employees5一样,但employees5表的类型为external
hive (change)> desc formatted employees5;
OK
# col_name data_type comment
name string employee name
salary float employee salary
subordinates array names of subordinates
deductions map keys are deductions names, values are percentages
address struct home address
...
Table Type: EXTERNAL_TABLE
4.4 分区表、管理表
如果表中的数据以及分区个数都非常大的话,执行一个包含所有分区的查询可能会触发一个巨大的MapReduce任务。
建议:将Hive设置为strict模式,(对分区表进行查询而where子句没有加分区过滤器,将会禁止提交这个任务)
hive> set hive.mapred.mode=strict;
hive> select e.name,e.salary from employees e limit 10;
FAILED:Error in semantic analysis: No partition predicate found for Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> select e.name,e.salary from employees e limit 10;
John Doe 10000.0
...
查看分区:show partitions
hive (default)> show partitions rating_table_p;
OK
dt=2008-03
dt=2008-08
Time taken: 0.136 seconds, Fetched: 2 row(s)
查看分区键的方法:desc extended rating_table_p; partitionKeys分区键
hive (default)> desc extended rating_table_p;
OK
userid string
movieid string
rating string
dt string
# Partition Information
# col_name data_type comment
dt string
partitionKeys:[FieldSchema(name:dt, type:string, comment:null)]
分区:
1.建表的时候创建分区
2.加载数据的时候创建分区
4.4.1