一、什么是hive?
Hive是基于Hadoop的一个数据仓库工具(离线),可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。
特点:
Hive可以自由的扩展集群的规模,一般情况下不需要重启服务。
Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。
良好的容错性,节点出现问题SQL仍可完成执行。
二、安装hive的安装
1:上传解压hive
2:安装mysql,添加开机自启
3:配置hive-site.xml
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
root
username to use against metastore database
javax.jdo.option.ConnectionPassword
root
password to use against metastore database
4:添加mysql*.jar 添加到$HIVE_HOME/lib
5:添加到path环境变量里面
三、hive概念与操作
1:使用服务的形式启动hive
nohup hiveserver2 1>/dev/null 2>&1 &
2:连接方式
1:beeline
!connect jdbc:hive2://hadoop01:10000
root
2:beeline -u jdbc:hive2://hadoop01:10000 -n root
3.基本操作语句
建表并指定分隔符
create table t_user(id string,name string) row format delimited fields terminated by ',';
查看表结构 desc tablename
4.创建外部表
create external table t_access(ip String,url String,accesstime String )
row format delimited fields terminated by ','
location '/data/acc';
外部表和内部表的区别?
内部表放在/usr/hive/warehouse下面,删除表的时候,会把数据删除掉。
外部表需要自己指定目录,删除表的时候,不会删除数据。
5.分区表
create table t_accc(id String,url String,access_time String)
partitioned by(dt String)
row format delimited
fields terminated by ',';
向分区中导入数据
注意:分区字段不能是表中已定义的字段
不同的分区数据存放在不同的目录下面。
load data local inpath "/root/access.txt" into table t_access partition(dt="2018-8-28");
多个分区字段
create table t_user(id String,name String,age String)
partitioned by(birthday String,sex String)
row format delimited
fields terminated by ',';
load data local inpath "/root/test/user.txt" into table t_user partition(birthday="1949-10-1",sex="man");
6.load数据
1)手动用hdfs命令,将文件放入表目录
2)本地(linux)加载数据:
load data local inpath "/root/test" into table t_user;
3)从HDFS上加载数据
load data inpath "/" into table t_user;
7.导出数据
1)将hive表中的数据导入HDFS的文件
insert overwrite directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access;
2)将hive表中的数据导入本地磁盘文件
insert overwrite local directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access limit 100000;
8.CTAS建表语法
1)通过已有表来创建表
create table t_user2 like t_user;
2)在建表的时候插入数据
create table t_access2
as
select ip,url from t_access;
9.存储类型
stored as textfile; 默认为文本格式
stored as sequencefile;
stored as parquetfile
create table t_table(ip string,url string,sccess_time string)
row format delimited fields terminated by ','
stored as textfile;
insert into t_table select * from t_access;
10.复合数据类型
1)array
create table t_movie(movie_name string,actors array,first_data string)
row format delimited
fields terminated by ","
collection items terminated by ":";
load data local.............
select * from t_movie;
select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'吴刚');
select moive_name,size(actors) from t_movie;
2)map
create table t_person(id int,name string,femily_members map,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
select * from t_person;
## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;
## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;
## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;
## 综合:查询有brother的用户信息
select id,name,father
from
(select id,name,family_members['brother'] as father from t_person) tmp
where father is not null;
3)struct
create table t_person_struct(id int,name string,info struct)
row format delimited fields terminated by ','
collection items terminated by ':';
select id,name,info.age from t_person_struct;
11.设置本地MR查询
set hive.exec.mode.local.auto=true;
12.join on...
select * from t_a a join t_b b on a.name=b.name;
join
left join
right join
full join
left semi join
right semi join
13分组
group by having...
使用聚合函数的时候,条件只能在having里面使用
14.子查询
将查询出来的数据当成表再次查询
select id,name,brother from
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;
15.排序
order by desc
16.hive函数
1)类型转换函数
select cast("5" as int) ;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);
2)数学运算函数
select round(5.4); ## 5 四舍五入
select round(5.1345,3) ; ##5.135
select ceil(5.4) ; // select ceiling(5.4) from dual; ## 6 向上取整
select floor(5.4); ## 5 向下取整
select abs(-5.4) ; ## 5.4 绝对值
select greatest(3,5,6) ; ## 6
select least(3,5,6) from dual; ##求多个输入参数中的最小值
3).字符串函数
substr(string str, int start) ## 截取子串
substring(string str, int start)
substr(string, int start, int len)
substring(string, int start, int len)
concat(string A, string B...) ## 拼接字符串
concat_ws(string SEP, string A, string B...)
length(string A)
4).时间函数
select current_timestamp; ## 获取当前的时间戳(详细时间信息)
select current_date; ## 获取当前的日期
## 取当前时间的秒数时间戳--(距离格林威治时间1970-1-1 0:0:0秒的差距)
select unix_timestamp();
## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");
5).条件控制函数
5.1 case when
case
when then
when then
else
end
from table_name
5.2 if
select id,name,age,if(age>40,'old','young') from t_user;
6).集合函数
6.1 array_contains(Array
6.2 sort_array() 对数组排序
6.3 size(Array
6.4 map_keys(Map
7).表生成函数
7.1 行转列函数 explode()
select distinct tmp.sub from
(select explode(subjects) as sub from t_student) as tmp;
7.2表生成函数
lateral view
理解: lateral view 相当于两个表在join
左表:是原表
右表:是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行
select id,name,a.sub from
(select id,name,tem.sub as sub from t_student lateral view explode(subjects) tmp as sub)a
where sub = '生物';
8).解析json格式
select json_tuple(json,'movie','rate','timeStamp','uid')as(movie,rate,ts,uid) from t_json;
解析时间戳获取分钟数
minute(from_unixtime(cast(ts as bigint))) as minute,
9). 分析函数 row_number() over()----分组TOPN
select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber;
17.自定义函数
1.在Java中编写函数
类需要实现UDF
重写evaluate()方法
2.将该方法打包成 .jar文件,传到linux下。
3.在hive命令行中输入下列命令
add jar /root/test/hql.jar
4.创建关联到java函数的hive函数
create temporary function get_max() as 'Test.getMaxIndex(全类名)';
5.运行结果
select get_max(1,2,3);
18.sqoop
1)从MySQL上传到HDFS
sqoop import --connect jdbc:mysql://192.168.1.3:3306/default --username root --password root --table t_user --columns 'id, name, age'
指定上传HDFS目录
sqoop import --connect mysql://192.168.1.3:3306/default --username root --password root --table T_user --target-dir '/sqoop/td' --fields-terminated-by '\t'
指定map的数量: 在最后添加 -m 2
2)将HDFS上的数据导出到数据库中(不要忘记指定分隔符)
sqoop export --connect jdbc:mysql://192.168.1.3:3306/default --username root --password root --export-dir '/td3' --table td_bak -m 1 --fields-terminated-by ','
19. java 操作hive
public class HiveJdbcTest {
public static void main(String[] args) throws Exception {
Class.forName("org.apache.hive.jdbc.HiveDriver"); //加载驱动
Connection conn = DriverManager.getConnection( "jdbc:hive2://hadoop01:10000","root",""); //建立连接
Statement statement = conn.createStatement(); //创建执行SQL语句的对象
ResultSet set = statement.executeQuery("select * from t_user");
while(set.next()){
String str = set.getString(2);
System.out.println(str);
}
set.close();
statement.close();
conn.close();
}
}