hive的UDF函数与hive实现WC以及开启文件递归支持

一、UDF函数

UDF函数
sum count … build-in 基础
线上业务,单单是用build-in是完成不了了
==> 扩展我们自己所需要的函数
迁移:RDBMS ==> 云化(大数据上来Hive/Spark)
基于已有的业务使用Hive语法改造
开发一个函数,和RDBMS是同名的
DIFF

UDF: User-Defined-Function
    UDF:one-to-one upper substr(ename,....)
    UDAF: User-Defined Aggregation Function sum count max
        many-to-one
    UDTF: User-Defined Table-Generating Function
        one-to-many

自定义UDF函数的步骤

you need to create a new class that extends UDF
with one or more methods named(方法的重载) evaluate
compiling your code to a jar
you need to add this to the Hive classpath (TODO…)
add jar …
Create Function

UDF实例代码
package com.ccj.pxj.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class UDFRemoveRandomPrefix extends UDF {
    public String evaluate(String name){
        String result = name.substring(name.lastIndexOf("_")+1);
        return result;
    }
    public static void main(String[] args) {
        UDFRemoveRandomPrefix udf = new UDFRemoveRandomPrefix();
        String result = udf.evaluate("1_pxj");
        System.out.println(result);
    }
}
package com.ccj.pxj.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.util.Random;
public class UDFAddRandomPrefix extends UDF {
    public String evaluate(String name){
        Random random = new Random();
        int prefix = random.nextInt(10);
        String result = prefix +"_"+ name;
        return result;
    }
    public static void main(String[] args) {
        UDFAddRandomPrefix udf = new UDFAddRandomPrefix();
        for (int i = 0; i < 10; i++) {
            String pk = udf.evaluate("pxj");
            System.out.println(pk);
        }
    }
}
package com.ccj.pxj.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class UDFRemoveRandomPrefix extends UDF {
    public String evaluate(String name){
        String result = name.substring(name.lastIndexOf("_")+1);
        return result;
    }
    public static void main(String[] args) {
        UDFRemoveRandomPrefix udf = new UDFRemoveRandomPrefix();
        String result = udf.evaluate("1_pxj");
        System.out.println(result);
    }
}
第一种注册UDF方法
hive (default)> add jar /home/pxj/lib/hivecode-1.0-SNAPSHOT.jar;
Added [/home/pxj/lib/hivecode-1.0-SNAPSHOT.jar] to class path
Added resources: [/home/pxj/lib/hivecode-1.0-SNAPSHOT.jar]
hive (default)> create temporary function pxj as 'com.ccj.pxj.udf.PxjUDF';
OK
Time taken: 0.01 seconds
hive (default)> select pxj(ename) from emp;
OK
_c0
smith
allen
ward
jones
martin
blake
clark
scott
king
turner
adams
james
ford
miller
hive
hive (default)> show functions;
OK
tab_name
pxj
create table a(
id int,
name string,
subject array<string>
) row format delimited fields terminated by ','
collection items terminated by ':'
;
此方法仅在当前session 
第二种注册UDF方法
[pxj@pxj /home/pxj/lib]$hadoop fs -mkdir -p  /pxj60/lib
20/02/02 00:58:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[pxj@pxj /home/pxj/lib]$ll
总用量 856
-rw-r--r--. 1 pxj pxj   2647 1月  31 21:57 hivecode-1.0-SNAPSHOT.jar
-rw-r--r--. 1 pxj pxj 872303 1月  18 11:46 mysql-connector-java-5.1.27-bin.jar
[pxj@pxj /home/pxj/lib]$hadoop fs -put hivecode-1.0-SNAPSHOT.jar /pxj60/lib
20/02/02 01:04:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[pxj@pxj /home/pxj/lib]$
hive (default)> CREATE FUNCTION pxj AS 'com.ccj.pxj.udf.PxjUDF' USING JAR 'hdfs://pxj:9000/pxj60/lib/hivecode-1.0-SNAPSHOT.jar';
converting to local hdfs://pxj:9000/pxj60/lib/hivecode-1.0-SNAPSHOT.jar
Added [/tmp/c1c645e9-1770-468c-8fef-dcc874c260a1_resources/hivecode-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs://pxj:9000/pxj60/lib/hivecode-1.0-SNAPSHOT.jar]
OK
Time taken: 1.527 seconds
hive (default)> select pxj(ename) from emp;
OK
_c0
smith
allen
ward
jones
martin
blake
clark
scott
king
turner
adams
james
ford
miller
hive
Time taken: 3.848 seconds, Fetched: 15 row(s)
第三种注册UDF方法
需要去修改Hive的源码的FunctionRegistry
system.registerUDF("pxj", PxjUDF.class, false);
Hive编译
    1.1.0-cdh5.16.2 的源码下载
    mvn clean package
    hive
        show functions

二、hive实现Wordcount

1.建表

hive (default)> create  table pxj_word(
              > word string
              > );
OK
Time taken: 0.097 seconds
hive (default)> select * from pxj_word;
OK
pxj_word.word
pxj,pxj,pxj
wfy,wfy
ccj
hive (default)> select
              > word,
              > count(1) cnt
              > from
              > (
              > select explode(split(word,',')) word from pxj_word
              > ) t
              > group by word;
Total MapReduce CPU Time Spent: 3 seconds 600 msec
OK
word    cnt
ccj 1
pxj 3
wfy 2
Time taken: 28.574 seconds, Fetched: 3 row(s)

三、开启hive和hadoop支持递归子目录

/data/hive/mulit_file/1.txt
/data/hive/mulit_file/sub_dir/2.txt
==> 数据所在的文件夹可能下面还有文件夹

输入:/data/hive/mulit_file
wc

[pxj@pxj /home/pxj/app/hadoop/etc/hadoop]$vim mapred-site.xml
<property>
  <name>mapreduce.input.fileinputformat.input.dir.recursive</name>
  <value>true</value>
 </property>
[pxj@pxj /home/pxj/app/hadoop/etc/hadoop]$hadoop fs  -mkdir -p /data/hive/mulit_file
20/02/02 01:47:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[pxj@pxj /home/pxj/app/hadoop/etc/hadoop]$hadoop fs  -mkdir -p /data/hive/mulit_file/sub_dir
20/02/02 01:48:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[pxj@pxj /home/pxj/app/hadoop]$hadoop jar \
> ./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
> wordcount \
>  /data/hive/mulit_file  /data/hive/mulit_file/output
[pxj@pxj /home/pxj/app/hadoop]$hadoop fs -text /data/hive/mulit_file/output/part-r-00000
20/02/02 01:56:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ccj 1
pxj 3
pxj,ccj 1
pxj,wfy 1
wfy 3
wi  1
wo  1
woo 1

作者:pxj(潘陈)
日期:2020-02-02 凌晨2:02

你可能感兴趣的:(hive的UDF函数与hive实现WC以及开启文件递归支持)