Hive基础(八)-测试题

练习：
需求描述：现在有一个文件score.csv文件，存放在集群的这个目录下/scoredatas/month=201806，这个文件每天都会生成，存放到对应的日期文件夹下面去，文件别人也需要公用，不能移动。需求，创建hive对应的表，并将数据加载到表中，进行数据统计分析，且删除表之后，数据不能删除

1）外部表
2）分区表按照month字段进行分区
3）指定表的存储位置 location
创建表之后，要进行表的修复，用于识别分区
msck repair table score4;
4）分桶表

按照分桶的字段，不同的数据分到不同的文件中去。（相当于hadoop中的分区）
开启hive的桶表功能
set hive.enforce.bucketing=true;
设置reduce的个数
set mapreduce.job.reduces=3;
创建分桶表
create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';
分桶表不能直接加载数据，需要通过间接表来加载数据
创建普通表：
create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';
普通表中加载数据
load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;
通过insert overwrite给桶表中加载数据
insert overwrite table course select * from course_common cluster by(c_id);
5）修改表
表的重命名：
alter table score4 rename to score5;
增加和修改列的信息：
（1）添加列
alter table score5 add columns (mycol string, mysco string);
（2）更新列
alter table score5 change column mysco mysconew int;
6）删除表
drop table score5;
7）hive中数据加载
通过查询插入数据
通过load方式加载数据
load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');
通过查询方式加载数据
create table score4 like score; insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;
8）数据的导出
1）将查询的结果导出到本地
truncate tableinsert overwrite local directory '/export/servers/exporthive' select * from score;
9）清空表数据
只能清空管理表，也就是内部表
truncate table score6;
2）hive查询语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]]        #数据的分组
[CLUSTER BY col_list                           #分文件进行查询
  | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]    #ORDER BY排序 全局排序
]                                                   #SORT  BY排序 局部排序
[LIMIT number]                                          #限制查询数据返回的条数

8）join连接
hive中只支持等值的join连接，不支持非等值连接。
内连接（INNER JOIN）：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。
左外连接：以左边表为基准。
右外连接：以右边表为基准。
满外连接：以两张表为基准，都查询出来，如果对不上显示NULL。
9)排序
全排序（Order By)：只有一个reduce
局部排序（Sort BY：要设定reduce的个数。
10）分区查询排序：
DISTRIBUTE BY
11）cluster by：
当DISTRIBUTE BY和Sort by的字段一致的时候，可以直接使用cluster by进行代替。
select * from score cluster by s_id;

Hive基础(八)-测试题

你可能感兴趣的:(Hive基础(八)-测试题)