https://blog.csdn.net/joob000/article/details/85699213.
1、到下载地址http://apache.fayea.com/hive/,下载apache-hive-1.2.1-bin.tar.gz,解压
tar -xzvf apache-hive-1.2.1-bin.tar.gz
2、配置环境变量,
vi /etc/profile
export HIVE_HOME=/home/liqqc/app/apache-hive-1.2.1-bin
export PATH=$PATH:$HIVE_HOME/bin
3、配置hive参数
拷贝模板配置
cp hive-default.xml.template hive-default.xml
cp hive-env.sh.template hive-env.sh
创建hive-site.xml
touch hive-site.xml
配置hive-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_141
export HIVE_HOME=/home/liqqc/appapache-hive-1.2.1-bin
export HADOOP_HOME=/home/liqqc/app/hadoop-2.7.1
配置hive-site.xml
创建临时文件:在apache-hive-1.2.1-bin文件夹下创建tmp文件夹
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
root
javax.jdo.option.ConnectionPassword
root
hive.querylog.location
/opt/modules/hive/tmp
hive.exec.local.scratchdir
/opt/modules/hive/tmp
hive.downloaded.resources.dir
/opt/modules/hive/tmp
datanucleus.schema.autoCreateAll
true
### beeline连接hive的thrift服务,用户名密码设置
hive.server2.thrift.client.user
root
Username to use against thrift client
hive.server2.thrift.client.password
root
Password to use against thrift client
4、上传mysql驱动jar
下载mysql驱动文件mysql-connector-java-5.1.7-bin.jar,并上传到到/apache-hive-2.1.1-bin/lib目录下。
5、初始化hive
命令:schematool -initSchema -dbType mysql
最后显示schemaTool completed,没有报错就成功了。
6、启动hive
输入命令:hive
1.Permission denied: user=dr.who, access=READ_EXECUTE, inode="/tmp":root:supergroup:drwx------
修改一下权限
[root@hadoop01 bin]# ./hdfs dfs -chmod -R 777 /tmp
2.hive启动beeline连接报错: User: xxx is not allowed to impersonate anonymous (state=08S01,code=0)
解决方式:在hadoop的配置文件core-site.xml增加如下配置,重启hdfs,其中“xxx”是连接beeline的用户,将“xxx”替换成自己的用户名即可
hadoop.proxyuser.xxx.hosts
*
hadoop.proxyuser.xxx.groups
*
“*”表示可通过超级代理“xxx”操作hadoop的用户、用户组和主机
初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-.jar: No such file or directory问题
spark升级到spark2以后,原有lib目录下的大JAR包被分散成多个小JAR包,原来的spark-assembly-.jar已经不存在,所以hive没有办法找到这个JAR包。
解决方法
打开hive的安装目录下的bin目录,找到hive文件
cd $HIVE_HOME/bin
vi hive
找到下图中的位置
将鼠标定位的位置,更改成下图
这样问题就解决了。
hive介绍
hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
hive的运行机制
图示
假设我在hive命令行客户端使用创建了一个数据库(database)myhive,接着又在该数据库中创建了一张表emp。
create database myhive;
use myhive;
create table emp(id int,name string);
那么hive会将元数据存储在数据库中。Hive 中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等。
hive是基于hadoop的,所以数据库和表均表现在hdfs上的目录,数据信息当然也是存储在hdfs上。
对于上面的库和表来说,会在hdfs上创建/user/hive/warehouse/myhive.db这样的目录结构,而表的信息则可以自己上传个文件比如图中的emp.data到/user/hive/warehouse/myhive.db目录下。那么就可以写sql进行查询了(注:写查询语句写的是myhive这张表不删emp.data,如select * from myhive,但是查询到的是emp.data中的信息,两者结合可以理解为传统数据库的某张表),而这些元数据信息都会存储到外部的数据库中(如mysql,当然也可以使用内嵌的derby,不推荐使用derby毕竟是内嵌的不能共享信息)。
然后我再写个查询语句
select id,name from emp where id>2 order by id desc;
那么是怎么执行的呢?查询语句交给hive,hive利用解析器、优化器等(图中表示Compiler),调用mapreduce模板,形成计划,生成的查询计划存储在 HDFS 中,随后由Mapreduce程序调用,提交给job放在Yarn上运行。
hive与mapreduce关系
hive的数据存储
1、Hive中所有的数据都存储在 HDFS 中,没有专门的数据存储格式(可支持Text,SequenceFile,ParquetFile,RCFILE等)
2、只需要在创建表的时候告诉 Hive 数据中的列分隔符和行分隔符,Hive 就可以解析数据。
3、Hive 中包含以下数据模型:DB、Table,External Table,Partition,Bucket。
db:在hdfs中表现为${hive.metastore.warehouse.dir}目录下一个文件夹
table:在hdfs中表现所属db目录下一个文件夹
external table:外部表, 与table类似,不过其数据存放位置可以在任意指定路径
普通表: 删除表后, hdfs上的文件都删了
External外部表删除后, hdfs上的文件没有删除, 只是把文件删除了
partition:在hdfs中表现为table目录下的子目录
bucket:桶, 在hdfs中表现为同一个表目录下根据hash散列之后的多个文件, 会根据不同的文件把数据放到不同的文件中
理论总让人头昏,下面介绍hive的初步使用上面的自然就明白了。
hive的使用
虽然可以使用hive与shell交互的方式启动hive
[root@mini1 ~]# cd apps/hive/bin
[root@mini1 bin]# ll
总用量 32
-rwxr-xr-x. 1 root root 1031 4月 30 2015 beeline
drwxr-xr-x. 3 root root 4096 10月 17 12:38 ext
-rwxr-xr-x. 1 root root 7844 5月 8 2015 hive
-rwxr-xr-x. 1 root root 1900 4月 30 2015 hive-config.sh
-rwxr-xr-x. 1 root root 885 4月 30 2015 hiveserver2
-rwxr-xr-x. 1 root root 832 4月 30 2015 metatool
-rwxr-xr-x. 1 root root 884 4月 30 2015 schematool
[root@mini1 bin]# ./hive
hive>
但是界面并不好看,而hive也可以发布为服务(Hive thrift服务),然后可以使用hive自带的beeline去连接。如下
窗口1,开启服务
[root@mini1 ~]# cd apps/hive/bin
[root@mini1 bin]# ll
总用量 32
-rwxr-xr-x. 1 root root 1031 4月 30 2015 beeline
drwxr-xr-x. 3 root root 4096 10月 17 12:38 ext
-rwxr-xr-x. 1 root root 7844 5月 8 2015 hive
-rwxr-xr-x. 1 root root 1900 4月 30 2015 hive-config.sh
-rwxr-xr-x. 1 root root 885 4月 30 2015 hiveserver2
-rwxr-xr-x. 1 root root 832 4月 30 2015 metatool
-rwxr-xr-x. 1 root root 884 4月 30 2015 schematool
[root@mini1 bin]# ./hiveserver2
窗口2,作为客户端连接
[root@mini1 bin]# ./beeline
Beeline version 1.2.1 by Apache Hive
beeline> [root@mini1 bin]#
[root@mini1 bin]# ./beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: root
Enter password for jdbc:hive2://localhost:10000: ******
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>
Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/tmp":hadoop3:supergroup:drwx------
./hadoop dfs -chmod -R 777 /tmp
下面进行简单使用,感觉下使用sql的舒适吧
1、查看数据库
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
+----------------+--+
1 row selected (1.456 seconds)
2、创建并使用数据库,查看表
0: jdbc:hive2://localhost:10000> create database myhive;
No rows affected (0.576 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| myhive |
+----------------+--+
0: jdbc:hive2://localhost:10000> use myhive;
No rows affected (0.265 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
+-----------+--+
3、创建表
0: jdbc:hive2://localhost:10000> create table emp(id int,name string);
No rows affected (0.29 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| emp |
+-----------+--+
1 row selected (0.261 seconds)
上传数据到该目录下,从页面看的话是个目录,如下
里面没有文件当然没有数据,那么我们需要上传个文件到该目录下。
[root@mini1 ~]# cat sz.data
1,zhangsan
2,lisi
3,wangwu
4,furong
5,fengjie
[root@mini1 ~]# hadoop fs -put sz.data /user/hive/warehouse/myhive.db/emp
再查看
4、查看表信息
0: jdbc:hive2://localhost:10000> select * from emp;
+---------+-----------+--+
| emp.id | emp.name |
+---------+-----------+--+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+---------+-----------+--+
结果肯定都是null,因为创建表的时候根本没指定根据”,”来切分,而文件中的字段分隔用了逗号。那么删除该表,重新上传文件,重新建表语句如下
0: jdbc:hive2://localhost:10000> drop table emp;
No rows affected (1.122 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
+-----------+--+
0: jdbc:hive2://localhost:10000> create table emp(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited
0: jdbc:hive2://localhost:10000> fields terminated by ',';
No rows affected (0.265 seconds)
0: jdbc:hive2://localhost:10000>
[root@mini1 ~]# hadoop fs -put sz.data /user/hive/warehouse/myhive.db/emp
0: jdbc:hive2://localhost:10000> select * from emp;
+---------+-----------+--+
| emp.id | emp.name |
+---------+-----------+--+
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
+---------+-----------+--+
6、条件查询
0: jdbc:hive2://localhost:10000> select id,name from emp where id>2 order by id desc;
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_1508216103995_0004
INFO : The url to track the job: http://mini1:8088/proxy/application_1508216103995_0004/
INFO : Starting Job = job_1508216103995_0004, Tracking URL = http://mini1:8088/proxy/application_1508216103995_0004/
INFO : Kill Command = /root/apps/hadoop-2.6.4/bin/hadoop job -kill job_1508216103995_0004
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2017-10-18 00:35:39,865 Stage-1 map = 0%, reduce = 0%
INFO : 2017-10-18 00:35:46,275 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
INFO : 2017-10-18 00:35:51,487 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.34 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 340 msec
INFO : Ended Job = job_1508216103995_0004
+-----+----------+--+
| id | name |
+-----+----------+--+
| 5 | fengjie |
| 4 | furong |
| 3 | wangwu |
+-----+----------+--+
3 rows selected (18.96 seconds)
看到这就能明白了,写的sql最后是被解析为了mapreduce程序放到yarn上来跑的,hive其实是提供了众多的mapreduce模板。
7、创建外部表
0: jdbc:hive2://localhost:10000> create external table emp2(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ','//指定逗号分割
0: jdbc:hive2://localhost:10000> stored as textfile//文本存储方式
0: jdbc:hive2://localhost:10000> location '/company';
No rows affected (0.101 seconds)//存储在/company目录下
0: jdbc:hive2://localhost:10000> dfs -ls /;
+----------------------------------------------------------------------------------------+--+
| DFS Output |
+----------------------------------------------------------------------------------------+--+
| Found 16 items |
| -rw-r--r-- 2 angelababy mygirls 7 2017-10-01 20:22 /canglaoshi_wuma.avi |
| -rw-r--r-- 2 root supergroup 22 2017-10-03 21:12 /cangmumayi.avi |
| drwxr-xr-x - root supergroup 0 2017-10-18 00:55 /company |
| drwxr-xr-x - root supergroup 0 2017-10-13 04:44 /flowcount |
| drwxr-xr-x - root supergroup 0 2017-10-17 03:44 /friends |
| drwxr-xr-x - root supergroup 0 2017-10-17 06:19 /gc |
| drwxr-xr-x - root supergroup 0 2017-10-07 07:28 /liushishi.log |
| -rw-r--r-- 3 12706 supergroup 60 2017-10-04 21:58 /liushishi.love |
| drwxr-xr-x - root supergroup 0 2017-10-17 07:32 /logenhance |
| -rw-r--r-- 2 root supergroup 26 2017-10-16 20:49 /mapjoin |
| drwxr-xr-x - root supergroup 0 2017-10-16 21:16 /mapjoincache |
| drwxr-xr-x - root supergroup 0 2017-10-13 13:15 /mrjoin |
| drwxr-xr-x - root supergroup 0 2017-10-16 23:35 /reverse |
| drwx------ - root supergroup 0 2017-10-17 13:10 /tmp |
| drwxr-xr-x - root supergroup 0 2017-10-17 13:13 /user |
| drwxr-xr-x - root supergroup 0 2017-10-14 01:33 /wordcount |
+----------------------------------------------------------------------------------------+--+
0: jdbc:hive2://localhost:10000> create external table t_sz_ext(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by '\t'
0: jdbc:hive2://localhost:10000> stored as textfile
0: jdbc:hive2://localhost:10000> location '/company';
No rows affected (0.135 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| emp |
| emp2 |
| t_sz_ext |
+-----------+--+
能发现多了目录/company和两张表,不过这个时候/company下是没任何东西的。
8、加载文件信息到表中
前面使用了hadoop命令将文件上传到了表对应的目录下,但是也可以在命令行下直接导入文件信息
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table emp2;(也可以用hadoo直接上传)
INFO : Loading data to table myhive.emp2 from file:/root/sz.data
INFO : Table myhive.emp2 stats: [numFiles=0, totalSize=0]
No rows affected (0.414 seconds)
0: jdbc:hive2://localhost:10000> select * from emp2;
+----------+------------+--+
| emp2.id | emp2.name |
+----------+------------+--+
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
+----------+------------+--+
9、表分区,分区字段为school,导入数据到2个不同的分区中
0: jdbc:hive2://localhost:10000> create table stu(id int,name string)
0: jdbc:hive2://localhost:10000> partitioned by(school string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.319 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| emp |
| emp2 |
| stu |
| t_sz_ext |
+-----------+--+
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table stu partition(school='scu');
INFO : Loading data to table myhive.stu partition (school=scu) from file:/root/sz.data
INFO : Partition myhive.stu{school=scu} stats: [numFiles=1, numRows=0, totalSize=46, rawDataSize=0]
No rows affected (0.607 seconds)
0: jdbc:hive2://localhost:10000> select * from stu;
+---------+-----------+-------------+--+
| stu.id | stu.name | stu.school |
+---------+-----------+-------------+--+
| 1 | zhangsan | scu |
| 2 | lisi | scu |
| 3 | wangwu | scu |
| 4 | furong | scu |
| 5 | fengjie | scu |
+---------+-----------+-------------+--+
5 rows selected (0.286 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz2.data' into table stu partition(school='hfut');
INFO : Loading data to table myhive.stu partition (school=hfut) from file:/root/sz2.data
INFO : Partition myhive.stu{school=hfut} stats: [numFiles=1, numRows=0, totalSize=46, rawDataSize=0]
No rows affected (0.671 seconds)
0: jdbc:hive2://localhost:10000> select * from stu;
+---------+-----------+-------------+--+
| stu.id | stu.name | stu.school |
+---------+-----------+-------------+--+
| 1 | Tom | hfut |
| 2 | Jack | hfut |
| 3 | Lucy | hfut |
| 4 | Kitty | hfut |
| 5 | Lucene | hfut |
| 6 | Sakura | hfut |
| 1 | zhangsan | scu |
| 2 | lisi | scu |
| 3 | wangwu | scu |
| 4 | furong | scu |
| 5 | fengjie | scu |
+---------+-----------+-------------+--+
注:hive是不遵循三范式的,别去考虑主键了。
10、添加分区
0: jdbc:hive2://localhost:10000> alter table stu add partition (school='Tokyo');
为了更直观,去页面查看
hive元数据表说明
https://blog.csdn.net/haozhugogo/article/details/73274832
Hive分桶通俗点来说就是将表(或者分区,也就是hdfs上的目录而真正的数据是存储在该目录下的文件)中文件分成几个文件去存储。比如表buck(目录,里面存放了某个文件如sz.data)文件中本来是1000000条数据,由于在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便,所以我们可以分4个文件去存储。
下面记录了从头到尾以及出现问题的操作
进行连接,创建数据库myhive2,使用该数据库
[root@mini1 ~]# cd apps/hive/bin
[root@mini1 bin]# ./beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: root
Enter password for jdbc:hive2://localhost:10000: ******
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| myhive |
+----------------+--+
2 rows selected (1.795 seconds)
0: jdbc:hive2://localhost:10000> create database myhive2;
No rows affected (0.525 seconds)
0: jdbc:hive2://localhost:10000> use myhive2;
No rows affected (0.204 seconds)
创建分桶表,导入数据,查看表内容
0: jdbc:hive2://localhost:10000> create table buck(id string,name string)
0: jdbc:hive2://localhost:10000> clustered by (id) sorted by (id) into 4 buckets
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.34 seconds)
0: jdbc:hive2://localhost:10000> desc buck;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | string | |
| name | string | |
+-----------+------------+----------+--+
2 rows selected (0.55 seconds)
load data local inpath '/root/sz.data' into table buck;
INFO : Loading data to table myhive2.buck from file:/root/sz.data
INFO : Table myhive2.buck stats: [numFiles=1, totalSize=91]
No rows affected (1.411 seconds)
0: jdbc:hive2://localhost:10000> select * from buck;
+----------+------------+--+
| buck.id | buck.name |
+----------+------------+--+
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+----------+------------+--+
如果分桶了的话,那么buck目录下应该有4个文件,页面查看
然而并没有,还是自己导入的那个文件。
这是因为分桶不是hive活着hadoop自动给我们划分文件来分桶的,而应该是我们分好之后导入才好。
需要设置开启分桶,设置reducetask数量(跟分桶数量一致)
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing = true;
No rows affected (0.063 seconds)
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing ;
+------------------------------+--+
| set |
+------------------------------+--+
| hive.enforce.bucketing=true |
+------------------------------+--+
1 row selected (0.067 seconds)
0: jdbc:hive2://localhost:10000> set mapreduce.job.reduces=4;
那么创建另外一个表tp,将该表数据放入到buck中(select出来insert 进去),放入的时候指定进行分桶,那么会分四桶,每个里面进行排序。那么最后buck表就进行了分桶(分桶是导入的时候就分桶的而不是自己实现分桶(文件划分))。
接下来,清空buck表信息,创建表tp,将tp中数据查询出来insert into到buck中。
0: jdbc:hive2://localhost:10000> truncate table buck;
No rows affected (0.316 seconds)
0: jdbc:hive2://localhost:10000> create table tp(id string,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.112 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table tp;
INFO : Loading data to table myhive2.tp from file:/root/sz.data
INFO : Table myhive2.tp stats: [numFiles=1, totalSize=91]
No rows affected (0.419 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| buck |
| tp |
+-----------+--+
2 rows selected (0.128 seconds)
0: jdbc:hive2://localhost:10000> select * from tp;
+--------+-----------+--+
| tp.id | tp.name |
+--------+-----------+--+
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+--------+-----------+--+
12 rows selected (0.243 seconds)
0: jdbc:hive2://localhost:10000> insert into buck
0: jdbc:hive2://localhost:10000> select id,name from tp distribute by (id) sort by (id);
INFO : Number of reduce tasks determined at compile time: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_1508216103995_0028
INFO : The url to track the job: http://mini1:8088/proxy/application_1508216103995_0028/
INFO : Starting Job = job_1508216103995_0028, Tracking URL = http://mini1:8088/proxy/application_1508216103995_0028/
INFO : Kill Command = /root/apps/hadoop-2.6.4/bin/hadoop job -kill job_1508216103995_0028
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
INFO : 2017-10-19 03:57:23,631 Stage-1 map = 0%, reduce = 0%
INFO : 2017-10-19 03:57:29,349 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.18 sec
INFO : 2017-10-19 03:57:40,096 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 2.55 sec
INFO : 2017-10-19 03:57:41,152 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 5.29 sec
INFO : 2017-10-19 03:57:42,375 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.61 sec
INFO : MapReduce Total cumulative CPU time: 6 seconds 610 msec
INFO : Ended Job = job_1508216103995_0028
INFO : Loading data to table myhive2.buck from hdfs://192.168.25.127:9000/user/hive/warehouse/myhive2.db/buck/.hive-staging_hive_2017-10-19_03-57-14_624_1985499545258899177-1/-ext-10000
INFO : Table myhive2.buck stats: [numFiles=4, numRows=12, totalSize=91, rawDataSize=79]
No rows affected (29.238 seconds)
0: jdbc:hive2://localhost:10000> select * from buck;
+----------+------------+--+
| buck.id | buck.name |
+----------+------------+--+
| 11 | fff |
| 4 | furong |
| 8 | ccc |
| 1 | zhangsan |
| 12 | ggg |
| 5 | fengjie |
| 9 | ddd |
| 2 | lisi |
| 6 | aaa |
| 10 | eee |
| 3 | wangwu |
| 7 | bbb |
+----------+------------+--+
到这应该就知道已经分桶了,否则id应该是1-12出来的,这是因为在4个桶中,分别进行了各自的排序,而不是跟order by一样会进行全局排序,页面查看下吧。
能看到确实分了4桶,客户端查看下内容吧(可以直接解析hdfs操作的)
0: jdbc:hive2://localhost:10000> dfs -ls /user/hive/warehouse/myhive2.db/buck;
+-----------------------------------------------------------------------------------------------------------+--+
| DFS Output |
+-----------------------------------------------------------------------------------------------------------+--+
| Found 4 items |
| -rwxr-xr-x 2 root supergroup 22 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000000_0 |
| -rwxr-xr-x 2 root supergroup 34 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000001_0 |
| -rwxr-xr-x 2 root supergroup 13 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000002_0 |
| -rwxr-xr-x 2 root supergroup 22 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000003_0 |
+-----------------------------------------------------------------------------------------------------------+--+
5 rows selected (0.028 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/myhive2.db/buck/000000_0;
+-------------+--+
| DFS Output |
+-------------+--+
| 11,fff |
| 4,furong |
| 8,ccc |
+-------------+--+
3 rows selected (0.02 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/myhive2.db/buck/000001_0;
+-------------+--+
| DFS Output |
+-------------+--+
| 1,zhangsan |
| 12,ggg |
| 5,fengjie |
| 9,ddd |
+-------------+--+
4 rows selected (0.08 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/myhive2.db/buck/000002_0;
+-------------+--+
| DFS Output |
+-------------+--+
| 2,lisi |
| 6,aaa |
+-------------+--+
2 rows selected (0.088 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/myhive2.db/buck/000003_0;
+-------------+--+
| DFS Output |
+-------------+--+
| 10,eee |
| 3,wangwu |
| 7,bbb |
+-------------+--+
3 rows selected (0.062 seconds)
注: select id,name from tp distribute by (id) sort by (id)语句中distribute by (id) sort by (id)知道根据id进行分桶(根据id进行hash散列),根据id进行排序默认升序。如果两者字段相同那么可以使用cluster by (id);也就是说可以写成
insert into buck select id ,name from p cluster by (id);
效果是一样的。
观察下面的语句。
select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已经是分桶表,而且分桶的字段是id字段,那么做这个操作的时候只需要在相同的桶中寻找相同id节约时间
1、创建表
建表语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
创建测试使用的数据库myhive3,使用该数据库。
1)、创建普通表
0: jdbc:hive2://localhost:10000> create database myhive3;
No rows affected (0.204 seconds)
0: jdbc:hive2://localhost:10000> use myhive3;
No rows affected (0.13 seconds)
0: jdbc:hive2://localhost:10000> create table t1(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';//指定,分割,具体的参考前面说的那篇
No rows affected (0.117 seconds)
0: jdbc:hive2://localhost:10000> show tables ;
+-----------+--+
| tab_name |
+-----------+--+
| t1 |
+-----------+--+
0: jdbc:hive2://localhost:10000> desc t1;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | int | |
| name | string | |
+-----------+------------+----------+--+
2)、创建外部表
EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。
STORED AS
SEQUENCEFILE|TEXTFILE|RCFILE
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。
location当然是指定表(hdfs上)位置
0: jdbc:hive2://localhost:10000> create external table t2(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ','
0: jdbc:hive2://localhost:10000> stored as textfile
0: jdbc:hive2://localhost:10000> location '/mytable2';
No rows affected (0.133 seconds)
页面查看是否创建了该表
直接创建在根目录下的,区别于普通表创建在/user/hive/warehouse目录下。
3)、创建分区
创建分区,分区字段fields string,查看表信息的时候会显示该表下所有分区信息的。
0: jdbc:hive2://localhost:10000> create table t3(id int,name string)
0: jdbc:hive2://localhost:10000> partitioned by(fields string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.164 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table t3 partition (fields ='Chengdu');
INFO : Loading data to table myhive3.t3 partition (fields=Chengdu) from file:/root/sz.data
INFO : Partition myhive3.t3{fields=Chengdu} stats: [numFiles=1, numRows=0, totalSize=91, rawDataSize=0]
No rows affected (0.738 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table t3 partition (fields ='Wuhan');
INFO : Loading data to table myhive3.t3 partition (fields=Wuhan) from file:/root/sz.data
INFO : Partition myhive3.t3{fields=Wuhan} stats: [numFiles=1, numRows=0, totalSize=91, rawDataSize=0]
No rows affected (0.608 seconds)
0: jdbc:hive2://localhost:10000> select * from t3;
+--------+-----------+------------+--+
| t3.id | t3.name | t3.fields |
+--------+-----------+------------+--+
| 1 | zhangsan | Chengdu |
| 2 | lisi | Chengdu |
| 3 | wangwu | Chengdu |
| 4 | furong | Chengdu |
| 5 | fengjie | Chengdu |
| 6 | aaa | Chengdu |
| 7 | bbb | Chengdu |
| 8 | ccc | Chengdu |
| 9 | ddd | Chengdu |
| 10 | eee | Chengdu |
| 11 | fff | Chengdu |
| 12 | ggg | Chengdu |
| 1 | zhangsan | Wuhan |
| 2 | lisi | Wuhan |
| 3 | wangwu | Wuhan |
| 4 | furong | Wuhan |
| 5 | fengjie | Wuhan |
| 6 | aaa | Wuhan |
| 7 | bbb | Wuhan |
| 8 | ccc | Wuhan |
| 9 | ddd | Wuhan |
| 10 | eee | Wuhan |
| 11 | fff | Wuhan |
| 12 | ggg | Wuhan |
+--------+-----------+------------+--+
页面查看
这两个分区目录下都存放了文件sz.data。
2、修改表
1)、增加、删除表分区
语法
增加
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
删除
ALTER TABLE table_name DROP partition_spec, partition_spec,...
还是对上面的分区表t3
增加分区fields=’Hefei’位置还是跟其他分区一致(可以省略不写)
由于hive客户端命令行可以使用hadoop命令查看文件系统(dfs),后面就不去页面查看了
0: jdbc:hive2://localhost:10000> alter table t3 add partition (fields='Hefei');
No rows affected (0.198 seconds)
0: jdbc:hive2://localhost:10000> dfs -ls /user/hive/warehouse/myhive3.db/t3;
+---------------------------------------------------------------------------------------------------------------+--+
| DFS Output |
+---------------------------------------------------------------------------------------------------------------+--+
| Found 3 items |
| drwxr-xr-x - root supergroup 0 2017-10-19 05:17 /user/hive/warehouse/myhive3.db/t3/fields=Chengdu |
| drwxr-xr-x - root supergroup 0 2017-10-19 05:28 /user/hive/warehouse/myhive3.db/t3/fields=Hefei |
| drwxr-xr-x - root supergroup 0 2017-10-19 05:18 /user/hive/warehouse/myhive3.db/t3/fields=Wuhan |
+---------------------------------------------------------------------------------------------------------------+--+
0: jdbc:hive2://localhost:10000> alter table t3 drop partition (fields='Hefei');
INFO : Dropped the partition fields=Hefei
No rows affected (0.536 seconds)
0: jdbc:hive2://localhost:10000> dfs -ls /user/hive/warehouse/myhive3.db/t3;
+---------------------------------------------------------------------------------------------------------------+--+
| DFS Output |
+---------------------------------------------------------------------------------------------------------------+--+
| Found 2 items |
| drwxr-xr-x - root supergroup 0 2017-10-19 05:17 /user/hive/warehouse/myhive3.db/t3/fields=Chengdu |
| drwxr-xr-x - root supergroup 0 2017-10-19 05:18 /user/hive/warehouse/myhive3.db/t3/fields=Wuhan |
+---------------------------------------------------------------------------------------------------------------+--+
2)、重命名表
语法
alter table old_name rename to new_name
将t1改名为t4
0: jdbc:hive2://localhost:10000> alter table t1 rename to t4;
No rows affected (0.183 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| t2 |
| t3 |
| t4 |
+-----------+--+
3 rows selected (0.127 seconds)
3)、添加、更新列
语法
alter table table_name add|replace columns(col_name data_type ...)
注:ADD是代表新增一字段,字段位置在所有列后面,REPLACE则是表示替换表中所有字段。
0: jdbc:hive2://localhost:10000> desc t4;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | int | |
| name | string | |
+-----------+------------+----------+--+
2 rows selected (0.315 seconds)
0: jdbc:hive2://localhost:10000> alter table t4 add columns (age int);
No rows affected (0.271 seconds)
0: jdbc:hive2://localhost:10000> desc t4;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | int | |
| name | string | |
| age | int | |
+-----------+------------+----------+--+
3 rows selected (0.199 seconds)
0: jdbc:hive2://localhost:10000> alter table t4 replace columns (no string,name string,scores int);
No rows affected (0.406 seconds)
0: jdbc:hive2://localhost:10000> desc t4;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| no | string | |
| name | string | |
| scores | int | |
+-----------+------------+----------+--+
常用显示命令
show tables
show databases
show partitions
show functions
desc formatted table_name;//跟desc table_name一样,但是显示的内容更多
3、数据操作
1)、load导入数据
上面已经演示了将本地的文件sz.data导入到t3表中。
load也就是说将文件复制到指定的表(目录)下,指定了local的话那么会去查找本地文件系统中的文件路径。如果没指定会根据inpath指定的路径去查找。如果是hdfs的话,如下格式
hdfs://namenode:9000/user/hive/project/data1。
另外如果使用了 OVERWRITE 关键字,则目标表(或者分区)中的内容会被删除,然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。
如果目标表(分区)已经有一个文件,并且文件名和 filepath 中的文件名冲突,那么现有的文件会被新文件所替代。
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' overwrite into table t4 ;
INFO : Loading data to table myhive3.t4 from file:/root/sz.data
INFO : Table myhive3.t4 stats: [numFiles=1, numRows=0, totalSize=91, rawDataSize=0]
No rows affected (0.7 seconds)
0: jdbc:hive2://localhost:10000> select * from t4;
+--------+-----------+------------+--+
| t4.no | t4.name | t4.scores |
+--------+-----------+------------+--+
| 1 | zhangsan | NULL |
| 2 | lisi | NULL |
| 3 | wangwu | NULL |
| 4 | furong | NULL |
| 5 | fengjie | NULL |
| 6 | aaa | NULL |
| 7 | bbb | NULL |
| 8 | ccc | NULL |
| 9 | ddd | NULL |
| 10 | eee | NULL |
| 11 | fff | NULL |
| 12 | ggg | NULL |
+--------+-----------+------------+--+
2)、插入语句
向表中插入语句的话
普通插入,查询其他表的表信息插入(自动数量要一致),将查询结果保存到一个目录中(目录会自动创建,由OutputFormat实现)。
insert into table t4 values('13','zhangsan',99);
0: jdbc:hive2://localhost:10000> truncate table t4;//清空表信息
0: jdbc:hive2://localhost:10000> insert into t4
0: jdbc:hive2://localhost:10000> select id,name from t3;
0: jdbc:hive2://localhost:10000> select * from t4;
+--------+-----------+--+
| t4.no | t4.name |
+--------+-----------+--+
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+--------+-----------+--+
重新创建表t5,将表信息保存到本地目录/root/insertDir/test中
0: jdbc:hive2://localhost:10000> insert overwrite local directory '/root/insertDir/test'
0: jdbc:hive2://localhost:10000> select * from t5;
查看本地
[root@mini1 ~]# cd insertDir/test/
[root@mini1 test]# ll
总用量 4
-rw-r--r--. 1 root root 91 10月 19 06:15 000000_0
[root@mini1 test]# cat 000000_0
1zhangsan
2lisi
3wangwu
4furong
5fengjie
6aaa
7bbb
8ccc
9ddd
10eee
11fff
12ggg
4、数据查询SELECT
语法基本跟mysql一样,留意下分桶即可
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
在前面做了很多测试,就不想再重复了,会mysql的查询这个肯定也会。
需要注意的是order by和sort by的区别:
1、order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
2、sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。
主要介绍下join
5、Join查询
join查询其实跟mysql还是一样的
准备数据
a.txt中
1,a
2,b
3,c
4,d
7,y
8,u
b.txt中
2,bb
3,cc
7,yy
9,pp
创建表a和b,将a.txt导入到a表中,b.txt导入到b表中
1)、内连接
0: jdbc:hive2://localhost:10000> create table a(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.19 seconds)
0: jdbc:hive2://localhost:10000> create table b(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.071 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/a.txt' into table a;
0: jdbc:hive2://localhost:10000> load data local inpath '/root/b.txt' into table b;
0: jdbc:hive2://localhost:10000> select * from a;
+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
| 7 | y |
| 8 | u |
+-------+---------+--+
6 rows selected (0.218 seconds)
0: jdbc:hive2://localhost:10000> select * from b;
+-------+---------+--+
| b.id | b.name |
+-------+---------+--+
| 2 | bb |
| 3 | cc |
| 7 | yy |
| 9 | pp |
+-------+---------+--+
4 rows selected (0.221 seconds)
0: jdbc:hive2://localhost:10000> select * from a inner join b on a.id = b.id;
...
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 7 | y | 7 | yy |
+-------+---------+-------+---------+--+
根据id进行连接,能连接到的则串起来。
2)、左外连接(outer可省)
0: jdbc:hive2://localhost:10000> select * from a left outer join b on a.id = b.id;
...
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
+-------+---------+-------+---------+--+
6 rows selected (16.453 seconds)
左边的表内容全列出来,右边的能连上的就显示,不能的则显示null。
右外连接则相反。
3)、全连接full outer
0: jdbc:hive2://localhost:10000> select * from a full outer join b on a.id = b.id;
...
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
| NULL | NULL | 9 | pp |
+-------+---------+-------+---------+--+
相当于左连接+右连接
4)、semi join
0: jdbc:hive2://localhost:10000> select * from a left semi join b on a.id = b.id;
+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| 2 | b |
| 3 | c |
| 7 | y |
+-------+---------+--+
3 rows selected (17.511 seconds)
相当于左外连接得到的信息的左半部分。
注:可以理解为exist in(…),但是hive中没有该语法,所以使用LEFT SEMI JOIN代替IN/EXISTS的,前者为后者高效实现。
比如下面的例子
重写以下子查询为LEFT SEMI JOIN
SELECT a.key, a.value
FROM a
WHERE a.key exist in
(SELECT b.key
FROM B);
可以被重写为:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
hive是给了我们很多内置函数的,比如转大小写,截取字符串等,具体的都在官方文档里面。但是并不是所有的函数都能满足我们的需求,所以hive提供了给我们自定义函数的功能。
1、至于怎么测试hive为我们提供的函数
因为mysql或者oracle中都可以使用伪表,但是hive不行,所以可以使用以下方法
1)、创建表dual,create table dual(id string)
2)、在本地创建文件dual.data,内容为空格或者空一行
3)、将dual.data文件load到表dual
进行测试,比如:字符串截取
0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual;
+------+--+
| _c0 |
+------+--+
| sic |
+------+--+
当然也可以直接使用 select substr(‘sichuan’,1,3),但是还是习惯用from dual;
2、自定义内置函数
添加maven依赖
org.apache.hive
hive-exec
1.2.1
org.apache.hive
hive-metastore
1.2.1
org.apache.hive
hive-common
1.2.1
org.apache.hive
hive-service
1.2.1
org.apache.hive
hive-jdbc
1.2.1
1)、大写转小写
可以先创建java类继承UDF,重载evaluate方法。
/**
* 大写转小写
* @author 12706
*/
public class UpperToLowerCase extends UDF {
/*
* 重载evaluate
* 访问限制必须是public
*/
public String evaluate(String word) {
String lowerWord = word.toLowerCase();
return lowerWord;
}
}
打包上传到hadoop集群(打的jar包名字为hive.jar)。
0: jdbc:hive2://localhost:10000> select * from t5;
+--------+-----------+--+
| t5.id | t5.name |
+--------+-----------+--+
| 13 | BABY |
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+--------+-----------+--+
13 rows selected (0.221 seconds)
将jar包放到hive的classpath下
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
创建临时函数,指定完整类名
0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.scu.hive.UpperToLowerCase';
到这就可以使用自定义临时函数tolower()了,测试t5表中的name输出小写
0: jdbc:hive2://localhost:10000> select id,tolower(name) from t5;
+-----+-----------+--+
| id | _c1 |
+-----+-----------+--+
| 13 | baby |
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+-----+-----------+--+
根据电话号码显示归属地信息
jave类
/**
* 根据电话号码前三位获取归属地
* @author 12706
*
*/
public class PhoneNumParse extends UDF{
static HashMap phoneMap = new HashMap();
static{
phoneMap.put("136", "beijing");
phoneMap.put("137", "shanghai");
phoneMap.put("138", "shenzhen");
}
public static String evaluate(int phoneNum) {
String num = String.valueOf(phoneNum);
String province = phoneMap.get(num.substring(0, 3));
return province==null?"foreign":province;
}
//测试
public static void main(String[] args) {
String string = evaluate(136666);
System.out.println(string);
}
}
将工程打包上传到linux,注意:如果名字还是跟上面一样,那么需要重新连接hive服务端了,否则jar包是不会覆盖的,建议打的jar包名字别一样
编辑文件vi prov.data
创建表flow(phonenum int,flow int)
将文件load到flow表
[root@mini1 ~]# vi prov.data;
1367788,1
1367788,10
1377788,80
1377788,97
1387788,98
1387788,99
1387788,100
1555118,99
0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/prov.data' into table flow;
INFO : Loading data to table myhive3.flow from file:/root/prov.data
INFO : Table myhive3.flow stats: [numFiles=1, totalSize=88]
No rows affected (0.316 seconds)
0: jdbc:hive2://localhost:10000> select * from flow;
+----------------+------------+--+
| flow.phonenum | flow.flow |
+----------------+------------+--+
| 1367788 | 1 |
| 1367788 | 10 |
| 1377788 | 80 |
| 1377788 | 97 |
| 1387788 | 98 |
| 1387788 | 99 |
| 1387788 | 100 |
| 1555118 | 99 |
+----------------+------------+--+
classpath下加入jar包,创建临时函数,测试
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
INFO : Added [/root/hive.jar] to class path
INFO : Added resources: [/root/hive.jar]
No rows affected (0.236 seconds)
0: jdbc:hive2://localhost:10000> create temporary function getprovince as 'com.scu.hive.PhoneNumParse';
No rows affected (0.038 seconds)
0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow;
+-----------+-----------+-------+--+
| phonenum | _c1 | flow |
+-----------+-----------+-------+--+
| 1367788 | beijing | 1 |
| 1367788 | beijing | 10 |
| 1377788 | shanghai | 80 |
| 1377788 | shanghai | 97 |
| 1387788 | shenzhen | 98 |
| 1387788 | shenzhen | 99 |
| 1387788 | shenzhen | 100 |
| 1555118 | foreign | 99 |
+-----------+-----------+-------+--+
Json数据解析UDF开发
有文件,内容一部分如下,里面都是json串,现在需要将它展示输出到表中,并解析对应为4个字段。
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
java类
public class JsonParse extends UDF{
//{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
//输出字符串 1193 5 978300760 1
public static String evaluate(String line){
MovieRateBean movieRateBean = JSON.parseObject(line, new TypeReference() {});
return movieRateBean.toString();
}
}
public class MovieRateBean {
private String movie;
private String rate;//评分
private String timeStamp;
private String uid;
@Override
public String toString() {
return this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid;
}
get、set方法
}
工程打包上传到linux下。
创建表json
create table json(line string);
将文件导入到json表
load data local inpath ‘/root/json.data’ into table json;
0: jdbc:hive2://localhost:10000> select * from json limit 10;
+----------------------------------------------------------------+--+
| json.line |
+----------------------------------------------------------------+--+
| {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} |
| {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} |
| {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} |
| {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} |
| {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} |
| {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} |
| {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} |
| {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} |
| {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} |
| {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"} |
+----------------------------------------------------------------+--+
0: jdbc:hive2://localhost:10000> add jar /root/hive3.jar;
INFO : Added [/root/hive3.jar] to class path
INFO : Added resources: [/root/hive3.jar]
No rows affected (0.023 seconds)
0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.scu.hive.JsonParse';
No rows affected (0.07 seconds)
0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10;
+---------------------+--+
| _c0 |
+---------------------+--+
| 1193 5 978300760 1 |
| 661 3 978302109 1 |
| 914 3 978301968 1 |
| 3408 4 978300275 1 |
| 2355 5 978824291 1 |
| 1197 3 978302268 1 |
| 1287 5 978302039 1 |
| 2804 5 978300719 1 |
| 594 4 978302268 1 |
| 919 4 978301368 1 |
+---------------------+--+
到这里发现还有不足的地方,就是没显示字段。可以使用函数来实现重写建表来命名。
0: jdbc:hive2://localhost:10000> create table t_rating as
0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid
0: jdbc:hive2://localhost:10000> from json limit 10;
0: jdbc:hive2://localhost:10000> select * from t_rating;
+-------------------+----------------+----------------------+---------------+--+
| t_rating.movieid | t_rating.rate | t_rating.timestring | t_rating.uid |
+-------------------+----------------+----------------------+---------------+--+
| 919 | 4 | 978301368 | 1 |
| 594 | 4 | 978302268 | 1 |
| 2804 | 5 | 978300719 | 1 |
| 1287 | 5 | 978302039 | 1 |
| 1197 | 3 | 978302268 | 1 |
| 2355 | 5 | 978824291 | 1 |
| 3408 | 4 | 978300275 | 1 |
| 914 | 3 | 978301968 | 1 |
| 661 | 3 | 978302109 | 1 |
| 1193 | 5 | 978300760 | 1 |
+-------------------+----------------+----------------------+---------------+--+
transform关键字使用
需求,创建新表,内容与t_rating表一致,但是第三个字段时间戳要改为输出周几。
Hive的 TRANSFORM 关键字提供了在SQL中调用自写脚本的功能
适合实现Hive中没有的功能又不想写UDF的情况。
1、编写python脚本(先看看机器有没有python),用来将表时间戳转为周几
2、加入编写的py文件
3、创建新表,字段值为t_rating表传入py函数后输出的字段值
[root@mini1 ~]# python
Python 2.6.6 (r266:84292, Feb 21 2013, 23:54:59)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'hello';
hello
>>> quit()
[root@mini1 ~]# vi weekday_mapper.py;
#import sys
import datetime
for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])
切换到hive客户端
0: jdbc:hive2://localhost:10000> add FILE /root/weekday_mapper.py;
1
0: jdbc:hive2://localhost:10000> create TABLE u_data_new as
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> TRANSFORM (movieid, rate, timestring,uid)
0: jdbc:hive2://localhost:10000> USING 'python weekday_mapper.py'
0: jdbc:hive2://localhost:10000> AS (movieid, rate, weekday,uid)
0: jdbc:hive2://localhost:10000> FROM t_rating;
...
0: jdbc:hive2://localhost:10000> select * from u_data_new;
+---------------------+------------------+---------------------+-----------------+--+
| u_data_new.movieid | u_data_new.rate | u_data_new.weekday | u_data_new.uid |
+---------------------+------------------+---------------------+-----------------+--+
| 919 | 4 | 1 | 1 |
| 594 | 4 | 1 | 1 |
| 2804 | 5 | 1 | 1 |
| 1287 | 5 | 1 | 1 |
| 1197 | 3 | 1 | 1 |
| 2355 | 5 | 7 | 1 |
| 3408 | 4 | 1 | 1 |
| 914 | 3 | 1 | 1 |
| 661 | 3 | 1 | 1 |
| 1193 | 5 | 1 | 1 |
+---------------------+------------------+---------------------+-----------------+--+
本次实践的目的就在于通过对该技术论坛网站的tomcat access log日志进行分析,计算该论坛的一些关键指标,供运营者进行决策时参考。
PS:开发该系统的目的是为了获取一些业务相关的指标,这些指标在第三方工具中无法获得的;
该论坛数据有两部分:
(1)历史数据约56GB,统计到2012-05-29。这也说明,在2012-05-29之前,日志文件都在一个文件里边,采用了追加写入的方式。
(2)自2013-05-30起,每天生成一个数据文件,约150MB左右。这也说明,从2013-05-30之后,日志文件不再是在一个文件里边。
图2展示了该日志数据的记录格式,其中每行记录有5部分组成:访问者IP、访问时间、访问资源、访问状态(HTTP状态码)、本次访问流量。
image.png
图2 日志记录数据格式
image
(1)定义:页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。
(2)分析:网站总浏览量,可以考核用户对于网站的兴趣,就像收视率对于电视剧一样。
计算公式:记录计数,从日志中获取访问次数。
image
该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。
计算公式:对访问member.php?mod=register的url,计数。
image
(1)定义:一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。
(2)分析:这是我们最熟悉的一个概念,无论同一个IP上有多少电脑,或者其他用户,从某种程度上来说,独立IP的多少,是衡量网站推广活动好坏最直接的数据。
计算公式:对不同的访问者ip,计数
image
(1)定义:只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。
(2)分析:跳出率是非常重要的访客黏性指标,它显示了访客对网站的兴趣程度:跳出率越低说明流量质量越好,访客对网站的内容越感兴趣,这些访客越可能是网站的有效用户、忠实用户。
PS:该指标也可以衡量网络营销的效果,指出有多少访客被网络营销吸引到宣传产品页或网站上之后,又流失掉了,可以说就是煮熟的鸭子飞了。比如,网站在某媒体上打广告推广,分析从这个推广来源进入的访客指标,其跳出率可以反映出选择这个媒体是否合适,广告语的撰写是否优秀,以及网站入口页的设计是否用户体验良好。
计算公式:①统计一天内只出现一条记录的ip,称为跳出数;②跳出数/PV;
1 数据清洗
使用MapReduce对HDFS中的原始数据进行清洗,以便后续进行统计分析;
2 统计分析
使用Hive对清洗后的数据进行统计分析;
该论坛数据有两部分:
(1)历史数据约56GB,统计到2012-05-29。这也说明,在2012-05-29之前,日志文件都在一个文件里边,采用了追加写入的方式。
(2)自2013-05-30起,每天生成一个数据文件,约150MB左右。这也说明,从2013-05-30之后,日志文件不再是在一个文件里边。
图1展示了该日志数据的记录格式,其中每行记录有5部分组成:访问者IP、访问时间、访问资源、访问状态(HTTP状态码)、本次访问流量。
图1 日志记录数据格式
本次使用数据来自于两个2013年的日志文件,分别为access_2013_05_30.log与access_2013_05_31.log,下载地址为:http://pan.baidu.com/s/1pJE7XR9
(1)根据前一篇的关键指标的分析,我们所要统计分析的均不涉及到访问状态(HTTP状态码)以及本次访问的流量,于是我们首先可以将这两项记录清理掉;
(2)根据日志记录的数据格式,我们需要将日期格式转换为平常所见的普通格式如20150426这种,于是我们可以写一个类将日志记录的日期进行转换;
(3)由于静态资源的访问请求对我们的数据分析没有意义,于是我们可以将"GET /staticsource/"开头的访问记录过滤掉,又因为GET和POST字符串对我们也没有意义,因此也可以将其省略掉;
首先,把日志数据上传到HDFS中进行处理,可以分为以下几种情况:
(1)如果是日志服务器数据较小、压力较小,可以直接使用shell命令把数据上传到HDFS中;
(2)如果日志服务器非常多、数据量大,使用flume进行数据处理;
这里我们的实验数据文件较小,因此直接采用第一种Shell命令方式。
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682
110.52.250.126 20130530173820 data/cache/style_1_widthauto.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wsh_zk.css
110.52.250.126 20130530173820 data/cache/style_1_forum_index.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wx_jqr.gif
27.19.74.143 20130530173820 data/attachment/common/c8/common_2_verify_icon.png
27.19.74.143 20130530173820 data/cache/common_smilies_var.js?y7a
package com.neuedu;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LogCleanJob {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage:Merge and duplicate removal ");
System.exit(2);
}
Job job = Job.getInstance(conf, "LogCleanJob");
job.setJarByClass(LogCleanJob.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
static class MyMapper extends
Mapper {
LogParser logParser = new LogParser();
Text outputValue = new Text();
protected void map(
LongWritable key,
Text value,
Context context)
throws java.io.IOException, InterruptedException {
final String[] parsed = logParser.parse(value.toString());
// step1.过滤掉静态资源访问请求
if (parsed[2].startsWith("GET /static/")
|| parsed[2].startsWith("GET /uc_server")) {
return;
}
// step2.过滤掉开头的指定字符串
if (parsed[2].startsWith("GET /")) {
parsed[2] = parsed[2].substring("GET /".length());
} else if (parsed[2].startsWith("POST /")) {
parsed[2] = parsed[2].substring("POST /".length());
}
// step3.过滤掉结尾的特定字符串
if (parsed[2].endsWith(" HTTP/1.1")) {
parsed[2] = parsed[2].substring(0, parsed[2].length()
- " HTTP/1.1".length());
}
// step4.只写入前三个记录类型项
outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
context.write(key, outputValue);
}
}
static class MyReducer extends
Reducer {
protected void reduce(
LongWritable k2,
Iterable values,
Context context)
throws java.io.IOException, InterruptedException {
context.write(values.iterator().next(), NullWritable.get());
}
}
/*
* 日志解析类
*/
static class LogParser {
public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
"d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
"yyyyMMddHHmmss");
public static void main(String[] args) throws ParseException {
final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
LogParser parser = new LogParser();
final String[] array = parser.parse(S1);
System.out.println("样例数据: " + S1);
System.out.format(
"解析结果: ip=%s, time=%s, url=%s, status=%s, traffic=%s",
array[0], array[1], array[2], array[3], array[4]);
}
/**
* 解析英文时间字符串
*
* @param string
* @return
* @throws ParseException
*/
private Date parseDateFormat(String string) {
Date parse = null;
try {
parse = FORMAT.parse(string);
} catch (ParseException e) {
e.printStackTrace();
}
return parse;
}
/**
* 解析日志的行记录
*
* @param line
* @return 数组含有5个元素,分别是ip、时间、url、状态、流量
*/
public String[] parse(String line) {
String ip = parseIP(line);
String time = parseTime(line);
String url = parseURL(line);
String status = parseStatus(line);
String traffic = parseTraffic(line);
return new String[] { ip, time, url, status, traffic };
}
private String parseTraffic(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String traffic = trim.split(" ")[1];
return traffic;
}
private String parseStatus(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String status = trim.split(" ")[0];
return status;
}
private String parseURL(String line) {
final int first = line.indexOf("\"");
final int last = line.lastIndexOf("\"");
String url = line.substring(first + 1, last);
return url;
}
private String parseTime(String line) {
final int first = line.indexOf("[");
final int last = line.indexOf("+0800]");
String time = line.substring(first + 1, last).trim();
Date date = parseDateFormat(time);
return dateformat1.format(date);
}
private String parseIP(String line) {
String ip = line.split("- -")[0].trim();
return ip;
}
}
}
HIVE
为了能够借助Hive进行统计分析,首先我们需要将清洗后的数据存入Hive中,那么我们需要先建立一张表。这里我们选择分区表,以日期作为分区的指标,建表语句如下:(这里关键之处就在于确定映射的HDFS位置,我这里是/project/techbbs/cleaned即清洗后的数据存放的位置)
hive> dfs -mkdir -p /project/techbbs/cleaned
hive>CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/project/techbbs/cleaned';
建立了分区表之后,就需要增加一个分区,增加分区的语句如下:(这里主要针对20150425这一天的日志进行分区)
hive>ALTER TABLE techbbs ADD PARTITION(logdate='2015_04_25') LOCATION '/project/techbbs/cleaned/2015_04_25';
hive> load data local inpath '/root/cleaned' into table techbbs3 partition(logdate='2015_04_25');
(1)关键指标之一:PV量
页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下:
hive>CREATE TABLE techbbs_pv_2015_04_25 AS SELECT COUNT(1) AS PV FROM techbbs WHERE logdate='2015_04_25';
(2)关键指标之二:注册用户数
该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。因此,这里我们只需要统计出日志中访问的URL是member.php?mod=register的即可,HQL代码如下:
hive>CREATE TABLE techbbs_reguser_2015_04_25 AS SELECT COUNT(1) AS REGUSER FROM techbbs WHERE logdate='2015_04_25' AND INSTR(url,'member.php?mod=register')>0;
(3)关键指标之三:独立IP数
一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字:
hive>CREATE TABLE techbbs_ip_2015_04_25 AS SELECT COUNT(DISTINCT ip) AS IP FROM techbbs WHERE logdate='2015_04_25';
(4)关键指标之四:跳出用户数
只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数,HQL代码如下:
hive>select count(*) from (select ip,count(ip) as num from techbbs group by ip) as tmpTable where tmpTable.num = 1;
PS:跳出率是指只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。这里,我们可以将这里得出的跳出用户数/PV数即可得到跳出率。