2017年11月3日课后作业
Hive 第三天
[toc]
第二天内容回顾
Hive帮助文档的地址
https://cwiki.apache.org/confluence/display/Hive/Home
Hive SQL Language Manual: Commands, CLIs, Data Types,
DDL (create/drop/alter/truncate/show/describe), Statistics (analyze), Indexes, Archiving,
DML (load/insert/update/delete/merge, import/export, explain plan),
Queries (select), Operators and UDFs, Locks, Authorization
File Formats and Compression: RCFile, Avro, ORC, Parquet; Compression, LZO
Procedural Language: Hive HPL/SQL
Hive Configuration Properties
Hive Clients
Hive Client (JDBC, ODBC, Thrift)
HiveServer2: Overview, HiveServer2 Client and Beeline, Hive Metrics
DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
database
- create
- drop
- alter
- Use
Table
Create
CREATE [TEMPORARY] [EXTERNAL] TABLE
Create Table As Select (CTAS)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
三种类型表
临时表:TEMPORARY
跟Hive的Session生命周期一致,Hive Client 关闭|退出 表也一起删除了
临时表的优先级比其他表高:当临时表与其他表名一致时,我们操作的是临时表
直到我们把临时表Drop掉,或者Alter掉,我们才可以操作其他表
外部表:EXTERNAL
只管理元数据,Drop表的时候,只删除原数据,HDFS上的数据,不会被删除
需要指定Location
内部表:没有修饰词
全部管理,元数据和HDFS上的数据,删除就都没了
特别注意一下,没事别删除数据
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
ROW FORMAT
原始数据,用什么样的格式,加载到我们Hive表
加载到我们表里的数据,原始数据不会改变
PARTITIONED BY
对我们数据进行分区
STORED AS
数据存储的文件格式
LOCATION
存放在HDFS 上目录的位置
Drop
Truncate
DML
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
LOAD
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL本地
LOCAL和inpath组合,决定是从hdfs上读取数据,还是从客户端位置读取数据
我们加载数据的时候,实际是把一个数据文件,移动到Hive warehouse目录下面,表名的这个目录
HDFS 上 直接就挪过去了
LOCAL 是上传到临时目录,然后在移动到相应的位置
强调一下,那个warehouse目录和本地什么的,那个地方没明白?
hive.metastore.warehouse.dir
/user/hive/warehouse
在本地linux系统上的文件 要加上LOCAL这个关键词
如果是hdfs上的文件,直接写 filepath
OVERWRITE
是否覆盖原有数据
如果不覆盖原有数据的话,把原来的数据,复制到hive数据目录下,就会重复了xxx_copy
PARTITION
分区,根据PARTITION (gender='male',age='35')
INSERT
into Hive tables from queries
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
Example
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt
into Hive tables from SQL
Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
案例
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp)
VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
今天要讲的内容
HiveServer2 Client and Beeline
HiveServer2
Beeline
Operators and UDFs
Operators 操作符
UDFs:User-Defined Functions
Hive 所有知识点
HiveServer2 Client and Beeline
执行beeline之前要启动HiveServer2
Dependencies of HS2(HiveServer2)
启动HS2之前,依赖
Metastore
需要启动metastore
hive --service metastore &
The metastore can be configured as embedded (in the same process as HS2) or as a remote server (which is a Thrift-based service as well). HS2 talks to the metastore for the metadata required for query compilation.
Hadoop cluster
start-all.sh
HS2 prepares physical execution plans for various execution engines (MapReduce/Tez/Spark) and submits jobs to the Hadoop cluster for execution.
可以配置hive-site.xml
hive.server2.thrift.min.worker.threads – Minimum number of worker threads, default 5.
hive.server2.thrift.max.worker.threads – Maximum number of worker threads, default 500.
hive.server2.thrift.port – TCP port number to listen on, default 10000.
hive.server2.thrift.bind.host – TCP interface to bind to.
两种方式启动How to Start
$HIVE_HOME/bin/hiveserver2
$HIVE_HOME/bin/hive --service hiveserver2
写一个JDBC的程序,连接Hive,操作Hive里面的表
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC
ctrl + f
JDBC Client Sample Code
创建JAVA项目
创建包名
创建类名
粘贴
run as
没有JDBC
把Hive里面的 lib 加到我们的 buildpath
还要加Hadoop的buildpath
hadoop里面 所有share里面的说有lib放到一起
url地址不对
用户名要改
找到合适的数据格式
https://baike.baidu.com/item/ASCII
默认格式 SOH
0000 0001
1
1
01
SOH(start of headline)
标题开始
制作默认分隔符数据文件的Shell脚本
#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive
echo -e '1\x01foo' > /tmp/a.txt
echo -e '2\x01bar' >> /tmp/a.txt
HADOOP_CORE=$(ls $HADOOP_HOME/hadoop-core*.jar)
CLASSPATH=.:$HIVE_HOME/conf:$(hadoop classpath)
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH HiveJdbcClient
Operators and UDFs
UDF、UDAF、UDTF他们之间的区别
一叶知秋的问题
UDF、UDAF、UDTF他们之间的区别
User-Defined Functions (UDFs)
一个输入一个输出
mask()
Aggregate Functions (UDAF)
多个输入一个输出
Table-Generating Functions (UDTF)
一个输入多个输出,更复杂的类型
https://www.iteblog.com/archives/2258.html
todo
案例的位置
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
粘贴程序到eclipse里面,把所有小红线都给去掉
写一下 mask的业务 substring
第一步,你需要创建一个class 继承UDF,一个或者多个方法叫做evaluate
First, you need to create a new class that extends UDF, with one or more methods named evaluate.
package com.youxiaoxueyuan.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Mask extends UDF {
public Text evaluate(final Text s) {
if (s == null) {
return null;
}
String newstring = s.toString();
newstring = newstring.substring(0,1) + "*****" + newstring.substring(newstring.length()-1, newstring.length());
return new Text(newstring);
}
}
第二步,在把你的代码编译成jar包以后,你需要添加这个jar到 Hive classpath
After compiling your code to a jar, you need to add this to the Hive classpath. See the section below on deploying jars.
ADD { FILE[S] | JAR[S] | ARCHIVE[S] } []*
ADD JAR /root/Mask.jar
第三步,第一次运行,Hive时候,你需要注册你的function
Once Hive is started up with your jars in the classpath, the final step is to register your function as described in Create Function:
create temporary function mask as 'com.youxiaoxueyuan.udf.Mask';
第四步,你就可以使用这个Fuction了
Now you can start using it:
select my_lower(title), sum(freq) from titles group by my_lower(title);
select key,mask(value),value from testhivedrivertable;
ROW FORMAT 正则表达式
作用和意义
数据来源
两种
- 我们自己采集
可以按照我们自己要求的格式来采集
- 从别人拿的(同公司的其他部门、外面买的、传感器采集、从互联网爬来的)
ETL(抽取,转换,加载)清洗 Extract-Transform-Load
Extract translation load
水若清寒的好习惯,将看到的单词,我重新复习一下
水若清寒的好习惯,将看到的单词,我重新复习一下
https://baike.baidu.com/item/ETL/1251949
MapReduce
拿到每一条数据
按分隔符给分开
处理每一条数据,把不要的内容去掉
把要的内容,写到hdfs上
把清晰好的数据,进行数据分析或者 数据挖掘 或者 AI BI
正则表达式练习工具
http://tool.chinaz.com/regex/
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
最后
CREATE TABLE logtbl1 (
host STRING,
identity STRING,
t_user STRING,
time STRING,
request STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
)
STORED AS TEXTFILE;
SERDE 关键字
后面接'org.apache.hadoop.hive.serde2.RegexSerDe'
RegEx正则
https://baike.baidu.com/item/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F
log源文件
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
第一天基础知识加上配置环境
12.5.2 build-4638234
如果搞不定的同学可以下载资料里面的hh.rar
用vmware 12.5.2 打开,启动就可以用了
启动步骤
start-all.sh
service mysqld restart
hive --servcie metastore &
hive --servcie hiveserver2 & === hiveserver2
hive
遇到各种问题,解决的安装视频
环境安装视频坎坷版,在这个目录下面
不给大家总结了,总结的第一天的视频,在第二节的开始
第二天的视频,在第三天的开始
今天的总结
[TOC]