Hive UDF教程(一)

Hive UDF教程(一)

Hive UDF教程(二)

Hive UDF教程(三)

1.Hive UDF简介

在Hive中,用户可以自定义一些函数,用于扩展HiveQL的功能,而这类函数叫做UDF(用户自定义函数)。UDF分为两大类:UDAF(用户自定义聚合函数)和UDTF(用户自定义表生成函数)。在介绍UDAF和UDTF实现之前,我们先在本章介绍简单点的UDF实现——UDF和GenericUDF,然后以此为基础在下一章介绍UDAF和UDTF的实现。

Hive有两个不同的接口编写UDF程序。一个是基础的UDF接口,一个是复杂的GenericUDF接口。

  1. org.apache.hadoop.hive.ql. exec.UDF 基础UDF的函数读取和返回基本类型,即Hadoop和Hive的基本类型。如,Text、IntWritable、LongWritable、DoubleWritable等。
  2. org.apache.hadoop.hive.ql.udf.generic.GenericUDF 复杂的GenericUDF可以处理Map、List、Set类型。
@Describtion注解是可选的,用于对函数进行说明,其中的_FUNC_字符串表示函数名,当使用DESCRIBE FUNCTION命令时,替换成函数名。@Describtion包含三个属性:
  • name:用于指定Hive中的函数名。
  • value:用于描述函数的参数。
  • extended:额外的说明,如,给出示例。当使用DESCRIBE FUNCTION EXTENDED name的时候打印。
而且,Hive要使用UDF,需要把Java文件编译、打包成jar文件,然后将jar文件加入到CLASSPATH中,最后使用CREATE FUNCTION语句定义这个Java类的函数:
  1. hive> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
  2. hive> CREATE TEMPORARY FUNCTION hello AS "edu.wzm.hive. HelloUDF";
  3. hive> DROP TEMPORARY FUNCTION IF EXIST hello;

2.UDF

本章采用的数据如下:

hive (mydb)> SELECT * FROM employee;          
OK
John Doe	100000.0	["Mary Smith","Todd Jones"]	{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}	{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}	US	CA
Mary Smith	80000.0	["Bill King"]	{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}	{"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}	US	CA
Todd Jones	70000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}	US	CA
Bill King	60000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}	US	CA
Boss Man	200000.0	["John Doe","Fred Finance"]	{"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05}	{"street":"1 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500}	US	CA
Fred Finance	150000.0	["Stacy Accountant"]	{"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05}	{"street":"2 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500}	US	CA
Stacy Accountant	60000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"300 Main St.","city":"Naperville","state":"IL","zip":60563}	US	CA
Time taken: 0.093 seconds, Fetched: 7 row(s)
hive (mydb)> DESCRIBE employee;          
OK
name                	string              	                    
salary              	float               	                    
subordinates        	array<string>       	                    
deductions          	map<string,float>   	                    
address             	struct<street:string,city:string,state:string,zip:int>	                    


简单UDF的实现很简单,只需要继承UDF,然后实现evaluate()方法就行了。

@Description(
	name = "hello",
	value = "_FUNC_(str) - from the input string"
		+ "returns the value that is \"Hello $str\" ",
	extended = "Example:\n"
		+ " > SELECT _FUNC_(str) FROM src;"
)
public class HelloUDF extends UDF{
	
	public String evaluate(String str){
		try {
			return "Hello " + str;
		} catch (Exception e) {
			// TODO: handle exception
			e.printStackTrace();
			return "ERROR";
		}
	}
}

把jar文件添加后,创建函数hello,然后执行结果如下:

hive (mydb)> SELECT hello(name) FROM employee;
OK
Hello John Doe
Hello Mary Smith
Hello Todd Jones
Hello Bill King
Hello Boss Man
Hello Fred Finance
Hello Stacy Accountant
Time taken: 0.198 seconds, Fetched: 7 row(s)

3.GenericUDF

GenericUDF实现比较复杂,需要先继承GenericUDF。这个API需要操作Object Inspectors,并且要对接收的参数类型和数量进行检查。GenericUDF需要实现以下三个方法:

//这个方法只调用一次,并且在evaluate()方法之前调用。该方法接受的参数是一个ObjectInspectors数组。该方法检查接受正确的参数类型和参数个数。
abstract ObjectInspector initialize(ObjectInspector[] arguments);

//这个方法类似UDF的evaluate()方法。它处理真实的参数,并返回最终结果。
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);

//这个方法用于当实现的GenericUDF出错的时候,打印出提示信息。而提示信息就是你实现该方法最后返回的字符串。
abstract String getDisplayString(String[] children);


下面是实现GenericUDF,判断一个数组或列表中是否包含某个元素的例子:

class ComplexUDFExample extends GenericUDF {

  ListObjectInspector listOI;
  StringObjectInspector elementsOI;
  StringObjectInspector argOI;

  @Override
  public String getDisplayString(String[] arg0) {
    return "arrayContainsExample()"; // this should probably be better
  }

  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    if (arguments.length != 2) {
      throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
    }
    // 1. Check we received the right object types.
    ObjectInspector a = arguments[0];
    ObjectInspector b = arguments[1];
    if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
    }
    this.listOI = (ListObjectInspector) a;
    this.elementsOI = (StringObjectInspector) this.listOI.getListElementObjectInspector();
    this.argOI = (StringObjectInspector) b;
    
    // 2. Check that the list contains strings
    if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list of strings");
    }
    
    // the return type of our function is a boolean, so we provide the correct object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // get the list and string from the deferred objects using the object inspectors
//    List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
    int elemNum = this.listOI.getListLength(arguments[0].get());
//    LazyListObjectInspector llst = (LazyListObjectInspector) arguments[0].get();
//    List<String> lst = llst.
    
    LazyString larg = (LazyString) arguments[1].get();
    String arg = argOI.getPrimitiveJavaObject(larg);
    
//    System.out.println("Length: =======================================================>>>" + elemNum);
//    System.out.println("arg: =======================================================>>>" + arg);
    // see if our list contains the value we need
    for(int i = 0; i < elemNum; i++) {
    	LazyString lelement = (LazyString) this.listOI.getListElement(arguments[0].get(), i);
    	String element = elementsOI.getPrimitiveJavaObject(lelement);
    	if(arg.equals(element)){
    		return new Boolean(true);
    	}
    }
    return new Boolean(false);
  }
  
}


注意:在Hive-1.0.1估计之后的版本也是,evaluate()方法中从Object Inspectors取出的值,需要先保存为Lazy包中的数据类型(org.apache.hadoop.hive.serde2.lazy),然后才能转换成Java的数据类型进行处理。否则会报错,解决方案可以参考Hive报错集锦中的第5个。

把jar文件添加后,创建函数contains,然后执行结果如下:

hive (mydb)> select contains(subordinates, subordinates[0]), subordinates from employee;
OK
true	["Mary Smith","Todd Jones"]
true	["Bill King"]
false	[]
false	[]
true	["John Doe","Fred Finance"]
true	["Stacy Accountant"]
false	[]
Time taken: 0.169 seconds, Fetched: 7 row(s)

现在我们在回头看看GenericUDF的模型:

  1. 这个UDF使用默认的构造方法初始化。
  2. initialize()和一个Object Inspectors数组(ListObjectInspector、StringObjectInspector)参数一起被调用。

    • 先检查参数个数(2个),和这些参数的类型;
    • 为evaluate()方法保存Object Inspectors(listOI、argOI、elementsOI)
    • 返回一个ObjectInspector(BooleanObjectInspector),且是Hive可以读取的方法结果。 

  1. 对于查询的每一行都调用evaluate()(如,contains(subordinates, subordinates[0]))

    • 取出存储在Object Inspectors中的值;
    • 处理完initialize()方法返回的Object Inspectors之后,返回一个值(如,list.contains(elemement) ? true : false)。

源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf



你可能感兴趣的:(hive,udf,GenericUDF)