数据仓库工具Hive实践

官网命令链接：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

一、DDL库操作

数据定义语言 Data Definition Language

（一）针对库

创建库、查看库、删除库、修改库、切换库

    1、创建库
        create database myhive;
        create database if not exists myhive;
    
    CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    小括号表示必选，小括号中间的竖线表示或者，中括号表示可选。
        create database myhive_1 comment "myhive_1 database 1" location "/myhive_1";
        # location是HDFS的目录

    2、查看库
        show databases;//查看所有的库
        desc  database  [extended]  myhive_1; //显示数据库的详细属性信息 
        select current_database();//查看当前使用的库

    3、删除库
        默认：
        drop database myhive_1; 等价于     drop database myhive_1 RESTRICT;
        drop database if exists myhive_1; 
        drop database myhive_1 cascade;//强制级联删除，慎用！慎用！慎用！

    4、修改库
        不同版本可以使用的功能不同
        不常用！！

    5、切换库
        use myhive;

（二）针对表

创建表、删除表、修改表、查看表、查看某张表的详细信息。

    1、查看表
        show tables;
        show tables in myhive;//在当前库查看另外一个库的表的信息
        show tables "stu*"; //正则匹配表名查询

    2、查看某张表的详细信息
        desc student;
        desc formatted student;    //查看表的格式化了之后的详细信息
        desc extended student;    //查看表的详细信息
        show create table student_ptn;     //查看建表完整语法
        show partitions student_ptn;          //查看分区

相关名词解释

CREATE TABLE：创建一个指定名字的表。

EXTERNAL： 关键字可以让用户创建一个外部表。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

PARTITIONED BY：在 Hive Select 查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，因此建表时引入 partition 概念。

LIKE：允许用户复制现有的表结构，但是不复制数据。

COMMENT：可以为表与字段增加描述。

ROW FORMAT ：用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，将会使用自带的 SerDe。

STORED AS TEXTFILE | SEQUENCEFILE | RCFILE : 如果文件数据是纯文本，可以使用 STORED AS TEXTFILE，默认也是 textFile 格式，可以通过执行命令 set hive.default.ﬁleformat，进行查看，如果数据需要压缩，使用 STORED AS SEQUENCEFILE。 RCFILE 是一种行列存储相结合的存储方式。

CLUSTERED BY:对于每一个表（table）或者分区，Hive 可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive 也是针对某一列进行桶的组织。Hive 采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

LOCATION：指定数据文件存放的 HDFS 目录，不管内部表还是外表，都可以指定。不指定就在默认的仓库路径。

    3、创建表
        
        分桶表
        示例：CLUSTERED BY department SORTED BY age ASC,id DESC INTO 3 BUCKETS
        
        Create Table ... As Select  简称CTAS
        
        create table studentss like student; //复制表结构，不是复制数据

        创建表的类型有6种：
        （1）创建内部表
        create table student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";
            
            内部表的类型：MANAGED_TABLE
            
        （2）创建外部表
        create external table student_ext_1(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";

        内部表和外部表的对比：
            1、在创建表的时候指定关键字： external
            2、一般来说，创建外部表，都需要指定一个外部路径  
        内部表和外部表的区别：
            删除表的时候，内部表会都删除，外部表只删除元数据
        到底选择内部表还是外部表？
            1、如果数据已经存储在HDFS上面了，需要使用hive去进行分析，并且这份数据还有可能使用其他的计算引擎来执行分析，使用外部表
            2、如果这个一份数据只是hive做数据分析使用，就可以使用内部表
        
        // 指定一个不存在的外部路径: 创建表的时候，会自动给你创建表目录
        create external table student_ext_2(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_2";

        // 指定一个已经存在的目录: 并且有数据
        //在linux中执行
        //hadoop fs -mkdir -p /student_ext_3
        //hadoop fs -put /home/bigdata/data/student.txt /student_ext_3
        //在hive命令行中执行
        create external table student_ext_3(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_3";
        
        
        （3）创建分区表
        
        // 创建只有一个分区字段的分区表：
        create table student_ptn(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field") row format delimited fields terminated by ",";

        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn;  //错误XXXXXX

        // 把数据导入到一个不存在的分区，它会自动创建该分区
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="beijing");  //正确√√√√√√

        注意：partitioned里的字段不能是表中声明的字段
        分区字段是虚拟列，它的值是存储在元数据库中，不是存储在数据文件中。
        分区字段的使用和普通字段没有区别

        // 把数据导入到一个已经存在的分区
        alter table student_ptn add partition (city="chongqing"); //没有变化
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="chongqing"); //数据翻倍
                

        // 创建有多个分区字段的分区表：
        create table student_ptn_date(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field", dt string) row format delimited fields terminated by ",";

        // 往分区中导入数据:
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing");  //错误XXXXXX

        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-12');     //正确√√√√√√

        // 不能在导入数据的时候指定多个分区定义
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-14') partition(city="beijing" , dt='2012-12-13');   //错误XXXXXX

        // 添加分区
        alter table student_ptn_date add partition(city="beijing", dt='2012-12-14') partition (city="beijing" , dt='2012-12-13');     //正确√√√√√√
        alter table student_ptn_date add partition(city="chongqing", dt='2012-12-14') partition (city="chongqing" , dt='2012-12-13'); 

        // 查询一个分区表有那些分区
        show partitions student_ptn;
        show partitions student_ptn_date;
        show partitions student; //报错
        
        （4）创建分桶表
        
        // 创建一个分桶表
        create table student_bucket (id int, name string, sex string, age int, department string) clustered by (department) sorted by (age desc, id asc) into 3 buckets row format delimited fields terminated by ",";

        //desc formatted student_bucket;
        //Num Buckets:          3                        
        //Bucket Columns:       [department]             
        //Sort Columns:         [Order(col:age, order:0), Order(col:id, order:1)]   

        注意：clustered里的字段必须要是表字段中出现的字段
        分桶字段和排序字段可以不一样，分桶字段和排序字段都必须是表字段中的一部分

        你往分通表里面导入数据要通过分桶查询方式进行导入数据。
        
        
        （5）从查询语句的结果创建新表
        //通过下面的命令：
        create table ... as  select ....
        //查询例子：
        select department, count(*) as total from student group by department;
        //完整的CTAS语句：
        create table dpt_count as select department, count(*) as total from student group by department;
                        
        （6）通过like复制已有表的结构创建新表
        create table student_like like student;

    4、删除表
    drop table student;
    drop table if exists student;

    5、修改表
        
        1)修改表名
        alter table student_like rename to studentss;

        2)修改字段

            添加字段： 
            alter table student2 add columns (city string, dt string);
            删除字段： 
            alter table student2 drop columns (city);   //报错XXXXXX
            替换字段：
            alter table student2 replace columns (id int, name string, sex string, age int);
            改变列的定义：
            alter table student2 change id newid string comment "new id";
            改变列的顺序：
            alter table student2 change sex sex string first;
            alter table student2 change name name string after sex;

        3)修改分区

            添加分区：
            alter table student_ptn add partition(city='tiajin') partition(city='shanghai');

            删除分区：
            alter table student_ptn drop partition(city='tiajin');  
            alter table student_ptn drop partition(city='tiajin'),partition(city='shanghai'); 

            修改分区的数据目录：
            alter table student_ptn partition(city="beijing") set location "/stu_beijing";   //报错XXXXXX
            alter table student_ptn partition(city="beijing") set location "hdfs://bigdata02:8020/stu_beijing";  //正确√√√√√√

    6、清空表
    truncate table student;//清空表只是清空该表的所有数据文件
    hadoop fs -rm -r /user/hive/warehouse/myhive.db/student/*

三、DML操作

数据操纵语言 Data Manipulation Language

    1、导入数据
    LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
    （1）LOAD操作是复制或者移动操作，将数据文件复制或移动到Hive设置的路径上。
    写LOCAL的是复制，不写LOCAL是移动
    //导入本地绝对路径数据：
    load data local inpath "/home/bigdata/data/student.txt" into table student;

    //导入本地相对路径的数据:
    load data local inpath "./student.txt" into table student;
    load data local inpath './student.txt' overwrite into table student;
    (覆盖导入)

     导入本地数据，相当于复制或者上传

    //导入HDFS上的简便路径数据：
    // hadoop fs -put /home/bigdata/data/student.txt /
    load data inpath '/student.txt' into table student;

    //导入HDFS上的全路径模式下的数据：
    load data inpath 'hdfs://bigdata02:9000/student.txt' into table student;

    导入HDFS上的数据到hive表，表示截切，移动

    insert
    insert into table student (id, name, sex, age, department) values (101,"huangbo","M",222,"IT");
    
    创建分区表：

    create table student_ptn (id int, name string, sex string, age int) partitioned by (department string) row format delimited fields terminated by ",";


    单重插入：
    insert into table student_ptn partition (department = 'IS') select id,sex,name,age from student where department  = 'IS';
    insert into table student_ptn partition (department = 'CS') select id,sex,name,age from student where department  = 'CS';
    insert into table student_ptn partition (department = 'MA') select id,sex,name,age from student where department  = 'MA';


    多重插入：
    from student 
    insert into table student_ptn partition (department = 'IS') select id,sex,name,age where department = 'IS' 
    insert into table student_ptn partition (department = 'CS') select id,sex,name,age where department = 'CS' 
    insert into table student_ptn partition (department = 'MA') select id,sex,name,age where department = 'MA' 
    
    多重插入最大的好处就是给很多结构相同的SQL语句组合在一起提高所有的HQL的执行效率，翻译成的MapReduce只需要读取一次数据就搞定了。

        分区插入：
    需要手动的创建分区
    alter table student add partition (city="zhengzhou")
    load data local inpath '/student.txt' into table student partition(city='zhengzhou');


    CTAS(create table ... as select ...)(直接把查询出来的结果存储到新建的一张表里)
    内部表/内建表
    create table student as select id,name,age,department from mingxing;
    注意：自动新建的表中的字段和查询语句出现的字段的名称，类型，注释一模一样

    限制：
    1、不能创建外部表
    2、不能创建分区表
    3、不能创建分桶表

    分桶插入：
    创建分桶表：
    create table mingxing(id int, name string, sex string, age int, department string)
    clustered by(id) sorted by(age desc) into 4 buckets
    row format delimited fields terminated by ',';

    //不能使用load方式直接往分桶表中导入数据
    插入数据：
    insert into table mingxing select id,name,sex,age,department from mingxing2
    distribute by id sort by age desc;
    注意：查询语句中的分桶信息必须和分桶表中的信息一致

    2、导出数据
    单模式导出数据到本地：
    insert overwrite local directory '/root/outputdata' select id,name,sex,age,department from mingxing;

    多模式导出数据到本地：
    from mingxing
    insert overwrite local directory '/root/outputdata1' select id, name
    insert overwrite local directory '/root/outputdata2' select id, name,age

    简便路径模式导出到hdfs：
    insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;

    全路径模式查询数据到hdfs：
    insert overwrite directory 'hdfs://bigdata02:9000/root/outputdata1' select id,name,sex,age,department from mingxing;

    local ：导出到本地目录
    overwrite ：表示覆盖

    3、查询数据
     Hive 中的 SELECT 基础语法和标准 SQL 语法基本一致，支持 WHERE、DISTINCT、GROUP BY、 ORDER BY、HAVING、LIMIT、子查询等

    order by : 全局排序
    sort by ：局部排序
    一般来说，要搭配 分桶操作使用
    distribute by id sort by age desc;
    
    distribute by : 纯粹就是分桶
    在使用distribute by的时候：要设置reduceTask的个数

    cluster by ： 既分桶，也排序
    cluster by age = distribute by age sort by age;
    
    cluster by 和 sort by 不能同时使用

    where , group by, distinct ,having ,  case...when, ....

四、Hive视图

和关系型数据库一样，Hive 也提供了视图的功能，不过请注意，Hive 的视图和关系型数据库的数据还是有很大的区别：

1、只有逻辑视图，没有物理视图；
2、视图只能查询，不能 Load/Insert/Update/Delete 数据；
3、视图在创建时候，只是保存了一份元数据，当查询视图的时候，才开始执行视图对应的那些子查询。

1、创建视图 
    create view view_name as select * from carss; create view carss_view as select * from carss limit 500; 

2、查看视图
    show tables;   // 可以查看表，也可以查看视图 
    desc view_name  // 查看某个具体视图的信息 
    desc carss_view 

3、删除视图 
    drop view view_name drop view if exists carss_view 

4、使用视图 
    create view sogou_view as select * from sogou_table where rank > 3 ; 
    select count(distinct uid) from sogou_vi

数据仓库工具Hive实践

数据仓库工具Hive实践

一、DDL库操作

数据定义语言 Data Definition Language

（一）针对库

（二）针对表

三、DML操作

四、Hive视图

你可能感兴趣的:(数据仓库工具Hive实践)