Hive 之 用户自定义函数 UDF UDAF UDTF

一 什么是UDF

UDF是UserDefined Function 用户自定义函数的缩写。Hive中除了原生提供的一些函数之外,如果还不能满足我们当前需求,我们可以自定义函数。

除了UDF 之外,我们还可以定义聚合函数UDAF 和 Table-Generating函数

 

 

二 如何创建UDF函数

2.1编写JAVA类,需要继承UDF类或者GenericUDF

一般需要返回简单数据类型的,继承UDF就可以,然后实现evaluate方法;如果类型稍微复杂的可以使用GenericUDF,然后实现initializegetDisplayString evaluate方法。

public class UDFStripDoubleQuotes extends UDF{

 

       private static final String DOUBLE_QUOTES = "\"";

 

       private static final String BLANK_SYMBOL = "";

 

       public  Text evaluate (Text text) throws UDFArgumentException{

              if (null == text || BLANK_SYMBOL.equals(text)) {

                     throw new UDFArgumentException("The function STRIP_DOUBLE_QUOTES(s) takes exactly 1arguments.");

              }

 

              String temp = text.toString().trim();

              if (temp.startsWith(DOUBLE_QUOTES) || temp.endsWith(DOUBLE_QUOTES)) {

                     temp = temp.replace(DOUBLE_QUOTES, BLANK_SYMBOL);

              }

              return new Text(temp);

       }

}

2.2编译这个java类并打成jar包

2.3在hive中添加jar包

hive(hadoop)> add jar /opt/data/UDFStripDoubleQuotes.jar;

Added/opt/data/UDFStripDoubleQuotes.jar to class path

Addedresource: /opt/data/UDFStripDoubleQuotes.jar

 

2.4创建临时函数和永久函数

2.4.1创建临时函数

语法:CREATETEMPORARY FUCNTION strip_double_quotes

AS' com.hive.udf.UDFStripDoubleQuotes';

 

2.4.2创建永久函数

CREATEFUNCTION [db_name.]function_name AS class_name

[USINGJAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];

说白了其实就是把jar放到HDFS上,然后指定这个函数是哪一个数据库的,然后跟一个URL,这一个url就是你jar包所放的那个HDFS目录

举个例子:

CREATEFUNCTION hadoop09.lu_str AS 'com.hive.udf.LowerAnd

UpperUDF'USING JAR 'hdfs:/var/hive/udf/lowerOrUpper.jar';

 

2.5测试

准备测试数据:/opt/data/quotes.txt

"10"    "ACCOUNTING"    "NEW YORK"

"20"    "RESEARCH"      "DALLAS"

"30"    "SALES" 'CHICAGO'

"40"    "OPERATIONS"    'BOSTON'

 

Hive这边:

CREATETABLE t_dept LIKE dept;

LOADDATA LOCAL INPATH '/opt/data/quotes.txt' INTO TABLE t_dept;

 

SELECTstrip_double_quotes(dname) name, strip_double_quotes(loc) loc FROM t_dept;

运行结果:

name                   loc

ACCOUNTING  NEW YORK

RESEARCH        DALLAS

SALES                 'CHICAGO'

OPERATIONS    'BOSTON'

 

三 如何创建UDAF函数

继承AbstractGenericUDAFAverageEvaluator,并且继承Generic

UDAFEvaluator。GenericUDAFEvaluator就是根据job不同的阶段执行不同的方法。Hive通过GenericUDAFEvaluator.Model来确定job的执行阶段。

那有哪些阶段呢?

PARTIAL1:从原始数据到部分聚合,会调用iterate,terminatePartial方法 -->map的输入 到 输出

PARTIAL2:从部分数据聚合和部分数据聚合,会调用merge和terminatePartial--> map的输出 到reduce输入

FINAL: 从部分数据聚合到全部数据聚合,调用merge和 terminate方法 -->Reduce输入到输出

COMPLETE: 从原始数据到全部数据聚合,会调用iterate和 terminate方法;没有reduce阶段,只有map阶段

有几个注意点:

如果要聚合的数据量比较大,我们需要注意内存是否够,很容易出现内存溢出的问题;

尽可能重用对象,尽量避免new 对象,尽量减轻JVM垃圾回收的过程。

 

public class UDAFAdd extendsAbstractGenericUDAFResolver{

 

       static final Log LOG = LogFactory.getLog(UDAFAdd.class.getName());

 

       @Override

       publicGenericUDAFEvaluator getEvaluator(TypeInfo[] arguments) throws SemanticException {

              //check arguments length

              if (arguments.length != 1) {

                     throw new UDFArgumentException("Exactly one argument is expected.");

              }

              //check if arguments data type is primitive

              if (arguments[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {

                     throw new UDFArgumentException("Argument is not expected.");

              }

             

              switch(((PrimitiveTypeInfo)arguments[0]).getPrimitiveCategory()) {

                     case BYTE:

                     case SHORT:

                     case INT:

                     case LONG:

                            return new UDAFAddLong();

                     case FLOAT:

                     case DOUBLE:

                            return new UDAFAddDouble();

                     default:

                            throw new UDFArgumentException("Only numeric or string type argumentsare accepted but "

                                          + arguments[0].getTypeName() + " is passed.");

              }

       }

      

       public static class UDAFAddDouble extendsGenericUDAFEvaluator{

              privatePrimitiveObjectInspector inputOI;

              privateDoubleWritable result;

             

              //invoke the INIT method on the each stage

              @Override

              publicObjectInspector init(Mode mode, ObjectInspector[] arguments) throws HiveException {

                     super.init(mode, arguments);

                     //INIT double value

                     result = new DoubleWritable(0);

                     inputOI =(DoubleObjectInspector)arguments[0];

                     returnPrimitiveObjectInspectorFactory.writableDoubleObjectInspector;

              }

 

              /**

               * it is used for storing aggregation resultduring the process of aggregation

               * @author nickyzhang

               *

               */

              static class AddDoubleAgg extendsAbstractAggregationBuffer{

                     boolean empty;

                     double sum;

                     @Override

                     public int estimate() {

                            return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;

                     }

              }

 

              /**Get a new aggregation object*/

              @Override

              public AggregationBuffergetNewAggregationBuffer() throws HiveException {

                     AddDoubleAgg addDoubleAgg = new AddDoubleAgg();

                     reset(addDoubleAgg);

                     return addDoubleAgg;

              }

 

              /** Reset the aggregation. This is useful if we want toreuse the same aggregation. */

              @Override

              public void reset(AggregationBuffer agg) throws HiveException {

                     AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;

                     addDoubleAgg.empty = Boolean.TRUE;

                     addDoubleAgg.sum = 0;

              }

 

              /** Iterate through original data.*/

              @Override

              public void iterate(AggregationBuffer agg, Object[] arguments) throwsHiveException {

                     if (arguments.length != 1) {

                            throw new UDFArgumentException("Just one argument expected!");

                     }

                     this.merge(agg, arguments);

              }

 

              /** Get partial aggregation result.*/

              @Override

              public ObjectterminatePartial(AggregationBuffer agg) throws HiveException {

                     returnterminate(agg);

              }

 

              /**Combiner or Reduce merge the mapper*/

              @Override

              public void merge(AggregationBuffer agg, Object partial) throwsHiveException {

                     if (partial == null) {

                            return;

                     }

                     AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;

                     addDoubleAgg.empty = false;

                     addDoubleAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);

              }

             

              /** Get final aggregation result */

              @Override

              public Objectterminate(AggregationBuffer agg) throws HiveException {

                     AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;

                     if (addDoubleAgg.empty) {

                            return null;

                     }

                     result.set(addDoubleAgg.sum);

                     return result;

              }

       }

      

       public static class UDAFAddLong extendsGenericUDAFEvaluator{

              privatePrimitiveObjectInspector inputOI;

              private LongWritableresult;

             

              //invoke the INIT method on the each stage

              @Override

              publicObjectInspector init(Mode mode, ObjectInspector[] arguments) throws HiveException {

                     super.init(mode, arguments);

                     //INIT double value

                     result = new LongWritable(0);

                     inputOI =(LongObjectInspector)arguments[0];

                     returnPrimitiveObjectInspectorFactory.writableLongObjectInspector;

              }

 

              /**

               * it is used for storing aggregation resultduring the process of aggregation

               * @author nickyzhang

               *

               */

              static class AddLongAgg extends AbstractAggregationBuffer{

                     boolean empty;

                     long sum;

                     @Override

                     public int estimate() {

                            return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;

                     }

              }

 

              /**Get a new aggregation object*/

              @Override

              public AggregationBuffergetNewAggregationBuffer() throws HiveException {

                     AddLongAgg addLongAgg = new AddLongAgg();

                     reset(addLongAgg);

                     return addLongAgg;

              }

 

              /** Reset the aggregation. This is useful if we want toreuse the same aggregation. */

              @Override

              public void reset(AggregationBuffer agg) throws HiveException {

                     AddLongAgg addLongAgg = (AddLongAgg)agg;

                     addLongAgg.empty = Boolean.TRUE;

                     addLongAgg.sum = 0;

              }

 

              /** Iterate through original data.*/

              @Override

              public void iterate(AggregationBuffer agg, Object[] arguments) throwsHiveException {

                     if (arguments.length != 1) {

                            throw new UDFArgumentException("Just one argument expected!");

                     }

                     this.merge(agg, arguments);

              }

 

              /** Get partial aggregation result.*/

              @Override

              public ObjectterminatePartial(AggregationBuffer agg) throws HiveException {

                     returnterminate(agg);

              }

 

              /**Combiner or Reduce merge the mapper*/

              @Override

              public void merge(AggregationBuffer agg, Object partial) throws HiveException{

                     if (partial == null) {

                            return;

                     }

                     AddLongAgg addLongAgg = (AddLongAgg)agg;

                     addLongAgg.empty = false;

                     addLongAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);

              }

 

              /** Get final aggregation result */

              @Override

              public Objectterminate(AggregationBuffer agg) throws HiveException {

                     AddLongAgg addLongAgg = (AddLongAgg)agg;

                     if (addLongAgg.empty) {

                            return null;

                     }

                     result.set(addLongAgg.sum);

                     return result;

              }

       }

}

 

四 如何创建UDTF函数

一般用于解析工作,比如说解析url,然后获取url信息,需要继承GenericUDTF.

public class UDTFEmail extendsGenericUDTF{

 

       @Override

       publicStructObjectInspector initialize(StructObjectInspector inspector) throws UDFArgumentException {

              if (inspector == null) {

                     throw new UDFArgumentException("arguments is null");

              }

              List args = inspector.getAllStructFieldRefs();

              if(CollectionUtils.isEmpty(args) || args.size()!= 1){

                     throw new UDFArgumentException("UDF tables only one argument");

              }

              List<String> fields = new ArrayList<String>();

              fields.add("name");

              fields.add("email");

              List<ObjectInspector> fieldIOList = new ArrayList<ObjectInspector>();

              fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

              fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

              returnObjectInspectorFactory.getStandardStructObjectInspector(fields, fieldIOList);

       }

 

       @Override

       public void process(Object[] args) throws HiveException {

              if(ArrayUtils.isEmpty(args) || args.length != 1) {

                     return;

              }

              String name = args[0].toString();

              String email = name + "@163.com";

              super.forward(new String[] {name,email});

       }

 

       @Override

       public void close() throws HiveException {

              super.forward(new String[] {"complete","finish"});

       }

}

 

五UDF UDAF UDTF 区别

UDF:一进一出

UDAF:多进一出,一般聚合用

UDTF:一进多出

 

 

你可能感兴趣的:(Hive 之 用户自定义函数 UDF UDAF UDTF)