UDF是UserDefined Function 用户自定义函数的缩写。Hive中除了原生提供的一些函数之外,如果还不能满足我们当前需求,我们可以自定义函数。
除了UDF 之外,我们还可以定义聚合函数UDAF 和 Table-Generating函数
一般需要返回简单数据类型的,继承UDF就可以,然后实现evaluate方法;如果类型稍微复杂的可以使用GenericUDF,然后实现initializegetDisplayString evaluate方法。
public class UDFStripDoubleQuotes extends UDF{
private static final String DOUBLE_QUOTES = "\"";
private static final String BLANK_SYMBOL = "";
public Text evaluate (Text text) throws UDFArgumentException{
if (null == text || BLANK_SYMBOL.equals(text)) {
throw new UDFArgumentException("The function STRIP_DOUBLE_QUOTES(s) takes exactly 1arguments.");
}
String temp = text.toString().trim();
if (temp.startsWith(DOUBLE_QUOTES) || temp.endsWith(DOUBLE_QUOTES)) {
temp = temp.replace(DOUBLE_QUOTES, BLANK_SYMBOL);
}
return new Text(temp);
}
}
hive(hadoop)> add jar /opt/data/UDFStripDoubleQuotes.jar;
Added/opt/data/UDFStripDoubleQuotes.jar to class path
Addedresource: /opt/data/UDFStripDoubleQuotes.jar
语法:CREATETEMPORARY FUCNTION strip_double_quotes
AS' com.hive.udf.UDFStripDoubleQuotes';
CREATEFUNCTION [db_name.]function_name AS class_name
[USINGJAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];
说白了其实就是把jar放到HDFS上,然后指定这个函数是哪一个数据库的,然后跟一个URL,这一个url就是你jar包所放的那个HDFS目录
举个例子:
CREATEFUNCTION hadoop09.lu_str AS 'com.hive.udf.LowerAnd
UpperUDF'USING JAR 'hdfs:/var/hive/udf/lowerOrUpper.jar';
准备测试数据:/opt/data/quotes.txt
"10" "ACCOUNTING" "NEW YORK"
"20" "RESEARCH" "DALLAS"
"30" "SALES" 'CHICAGO'
"40" "OPERATIONS" 'BOSTON'
Hive这边:
CREATETABLE t_dept LIKE dept;
LOADDATA LOCAL INPATH '/opt/data/quotes.txt' INTO TABLE t_dept;
SELECTstrip_double_quotes(dname) name, strip_double_quotes(loc) loc FROM t_dept;
运行结果:
name loc
ACCOUNTING NEW YORK
RESEARCH DALLAS
SALES 'CHICAGO'
OPERATIONS 'BOSTON'
继承AbstractGenericUDAFAverageEvaluator,并且继承Generic
UDAFEvaluator。GenericUDAFEvaluator就是根据job不同的阶段执行不同的方法。Hive通过GenericUDAFEvaluator.Model来确定job的执行阶段。
那有哪些阶段呢?
PARTIAL1:从原始数据到部分聚合,会调用iterate,terminatePartial方法 -->map的输入 到 输出
PARTIAL2:从部分数据聚合和部分数据聚合,会调用merge和terminatePartial--> map的输出 到reduce输入
FINAL: 从部分数据聚合到全部数据聚合,调用merge和 terminate方法 -->Reduce输入到输出
COMPLETE: 从原始数据到全部数据聚合,会调用iterate和 terminate方法;没有reduce阶段,只有map阶段
有几个注意点:
如果要聚合的数据量比较大,我们需要注意内存是否够,很容易出现内存溢出的问题;
尽可能重用对象,尽量避免new 对象,尽量减轻JVM垃圾回收的过程。
public class UDAFAdd extendsAbstractGenericUDAFResolver{
static final Log LOG = LogFactory.getLog(UDAFAdd.class.getName());
@Override
publicGenericUDAFEvaluator getEvaluator(TypeInfo[] arguments) throws SemanticException {
//check arguments length
if (arguments.length != 1) {
throw new UDFArgumentException("Exactly one argument is expected.");
}
//check if arguments data type is primitive
if (arguments[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentException("Argument is not expected.");
}
switch(((PrimitiveTypeInfo)arguments[0]).getPrimitiveCategory()) {
case BYTE:
case SHORT:
case INT:
case LONG:
return new UDAFAddLong();
case FLOAT:
case DOUBLE:
return new UDAFAddDouble();
default:
throw new UDFArgumentException("Only numeric or string type argumentsare accepted but "
+ arguments[0].getTypeName() + " is passed.");
}
}
public static class UDAFAddDouble extendsGenericUDAFEvaluator{
privatePrimitiveObjectInspector inputOI;
privateDoubleWritable result;
//invoke the INIT method on the each stage
@Override
publicObjectInspector init(Mode mode, ObjectInspector[] arguments) throws HiveException {
super.init(mode, arguments);
//INIT double value
result = new DoubleWritable(0);
inputOI =(DoubleObjectInspector)arguments[0];
returnPrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
}
/**
* it is used for storing aggregation resultduring the process of aggregation
* @author nickyzhang
*
*/
static class AddDoubleAgg extendsAbstractAggregationBuffer{
boolean empty;
double sum;
@Override
public int estimate() {
return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;
}
}
/**Get a new aggregation object*/
@Override
public AggregationBuffergetNewAggregationBuffer() throws HiveException {
AddDoubleAgg addDoubleAgg = new AddDoubleAgg();
reset(addDoubleAgg);
return addDoubleAgg;
}
/** Reset the aggregation. This is useful if we want toreuse the same aggregation. */
@Override
public void reset(AggregationBuffer agg) throws HiveException {
AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;
addDoubleAgg.empty = Boolean.TRUE;
addDoubleAgg.sum = 0;
}
/** Iterate through original data.*/
@Override
public void iterate(AggregationBuffer agg, Object[] arguments) throwsHiveException {
if (arguments.length != 1) {
throw new UDFArgumentException("Just one argument expected!");
}
this.merge(agg, arguments);
}
/** Get partial aggregation result.*/
@Override
public ObjectterminatePartial(AggregationBuffer agg) throws HiveException {
returnterminate(agg);
}
/**Combiner or Reduce merge the mapper*/
@Override
public void merge(AggregationBuffer agg, Object partial) throwsHiveException {
if (partial == null) {
return;
}
AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;
addDoubleAgg.empty = false;
addDoubleAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);
}
/** Get final aggregation result */
@Override
public Objectterminate(AggregationBuffer agg) throws HiveException {
AddDoubleAgg addDoubleAgg = (AddDoubleAgg)agg;
if (addDoubleAgg.empty) {
return null;
}
result.set(addDoubleAgg.sum);
return result;
}
}
public static class UDAFAddLong extendsGenericUDAFEvaluator{
privatePrimitiveObjectInspector inputOI;
private LongWritableresult;
//invoke the INIT method on the each stage
@Override
publicObjectInspector init(Mode mode, ObjectInspector[] arguments) throws HiveException {
super.init(mode, arguments);
//INIT double value
result = new LongWritable(0);
inputOI =(LongObjectInspector)arguments[0];
returnPrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
/**
* it is used for storing aggregation resultduring the process of aggregation
* @author nickyzhang
*
*/
static class AddLongAgg extends AbstractAggregationBuffer{
boolean empty;
long sum;
@Override
public int estimate() {
return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;
}
}
/**Get a new aggregation object*/
@Override
public AggregationBuffergetNewAggregationBuffer() throws HiveException {
AddLongAgg addLongAgg = new AddLongAgg();
reset(addLongAgg);
return addLongAgg;
}
/** Reset the aggregation. This is useful if we want toreuse the same aggregation. */
@Override
public void reset(AggregationBuffer agg) throws HiveException {
AddLongAgg addLongAgg = (AddLongAgg)agg;
addLongAgg.empty = Boolean.TRUE;
addLongAgg.sum = 0;
}
/** Iterate through original data.*/
@Override
public void iterate(AggregationBuffer agg, Object[] arguments) throwsHiveException {
if (arguments.length != 1) {
throw new UDFArgumentException("Just one argument expected!");
}
this.merge(agg, arguments);
}
/** Get partial aggregation result.*/
@Override
public ObjectterminatePartial(AggregationBuffer agg) throws HiveException {
returnterminate(agg);
}
/**Combiner or Reduce merge the mapper*/
@Override
public void merge(AggregationBuffer agg, Object partial) throws HiveException{
if (partial == null) {
return;
}
AddLongAgg addLongAgg = (AddLongAgg)agg;
addLongAgg.empty = false;
addLongAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);
}
/** Get final aggregation result */
@Override
public Objectterminate(AggregationBuffer agg) throws HiveException {
AddLongAgg addLongAgg = (AddLongAgg)agg;
if (addLongAgg.empty) {
return null;
}
result.set(addLongAgg.sum);
return result;
}
}
}
一般用于解析工作,比如说解析url,然后获取url信息,需要继承GenericUDTF.
public class UDTFEmail extendsGenericUDTF{
@Override
publicStructObjectInspector initialize(StructObjectInspector inspector) throws UDFArgumentException {
if (inspector == null) {
throw new UDFArgumentException("arguments is null");
}
List args = inspector.getAllStructFieldRefs();
if(CollectionUtils.isEmpty(args) || args.size()!= 1){
throw new UDFArgumentException("UDF tables only one argument");
}
List<String> fields = new ArrayList<String>();
fields.add("name");
fields.add("email");
List<ObjectInspector> fieldIOList = new ArrayList<ObjectInspector>();
fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
returnObjectInspectorFactory.getStandardStructObjectInspector(fields, fieldIOList);
}
@Override
public void process(Object[] args) throws HiveException {
if(ArrayUtils.isEmpty(args) || args.length != 1) {
return;
}
String name = args[0].toString();
String email = name + "@163.com";
super.forward(new String[] {name,email});
}
@Override
public void close() throws HiveException {
super.forward(new String[] {"complete","finish"});
}
}
UDF:一进一出
UDAF:多进一出,一般聚合用
UDTF:一进多出