Hive UDF教程(一)
Hive UDF教程(二)
Hive UDF教程(三)
在Hive中,用户可以自定义一些函数,用于扩展HiveQL的功能,而这类函数叫做UDF(用户自定义函数)。UDF分为两大类:UDAF(用户自定义聚合函数)和UDTF(用户自定义表生成函数)。在介绍UDAF和UDTF实现之前,我们先在本章介绍简单点的UDF实现——UDF和GenericUDF,然后以此为基础在下一章介绍UDAF和UDTF的实现。
Hive有两个不同的接口编写UDF程序。一个是基础的UDF接口,一个是复杂的GenericUDF接口。
本章采用的数据如下:
hive (mydb)> SELECT * FROM employee; OK John Doe 100000.0 ["Mary Smith","Todd Jones"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600} US CA Mary Smith 80000.0 ["Bill King"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601} US CA Todd Jones 70000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700} US CA Bill King 60000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100} US CA Boss Man 200000.0 ["John Doe","Fred Finance"] {"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05} {"street":"1 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500} US CA Fred Finance 150000.0 ["Stacy Accountant"] {"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05} {"street":"2 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500} US CA Stacy Accountant 60000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"300 Main St.","city":"Naperville","state":"IL","zip":60563} US CA Time taken: 0.093 seconds, Fetched: 7 row(s) hive (mydb)> DESCRIBE employee; OK name string salary float subordinates array<string> deductions map<string,float> address struct<street:string,city:string,state:string,zip:int>
简单UDF的实现很简单,只需要继承UDF,然后实现evaluate()方法就行了。
@Description( name = "hello", value = "_FUNC_(str) - from the input string" + "returns the value that is \"Hello $str\" ", extended = "Example:\n" + " > SELECT _FUNC_(str) FROM src;" ) public class HelloUDF extends UDF{ public String evaluate(String str){ try { return "Hello " + str; } catch (Exception e) { // TODO: handle exception e.printStackTrace(); return "ERROR"; } } }
hive (mydb)> SELECT hello(name) FROM employee; OK Hello John Doe Hello Mary Smith Hello Todd Jones Hello Bill King Hello Boss Man Hello Fred Finance Hello Stacy Accountant Time taken: 0.198 seconds, Fetched: 7 row(s)
GenericUDF实现比较复杂,需要先继承GenericUDF。这个API需要操作Object Inspectors,并且要对接收的参数类型和数量进行检查。GenericUDF需要实现以下三个方法:
//这个方法只调用一次,并且在evaluate()方法之前调用。该方法接受的参数是一个ObjectInspectors数组。该方法检查接受正确的参数类型和参数个数。 abstract ObjectInspector initialize(ObjectInspector[] arguments); //这个方法类似UDF的evaluate()方法。它处理真实的参数,并返回最终结果。 abstract Object evaluate(GenericUDF.DeferredObject[] arguments); //这个方法用于当实现的GenericUDF出错的时候,打印出提示信息。而提示信息就是你实现该方法最后返回的字符串。 abstract String getDisplayString(String[] children);
下面是实现GenericUDF,判断一个数组或列表中是否包含某个元素的例子:
class ComplexUDFExample extends GenericUDF { ListObjectInspector listOI; StringObjectInspector elementsOI; StringObjectInspector argOI; @Override public String getDisplayString(String[] arg0) { return "arrayContainsExample()"; // this should probably be better } @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { if (arguments.length != 2) { throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T"); } // 1. Check we received the right object types. ObjectInspector a = arguments[0]; ObjectInspector b = arguments[1]; if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) { throw new UDFArgumentException("first argument must be a list / array, second argument must be a string"); } this.listOI = (ListObjectInspector) a; this.elementsOI = (StringObjectInspector) this.listOI.getListElementObjectInspector(); this.argOI = (StringObjectInspector) b; // 2. Check that the list contains strings if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) { throw new UDFArgumentException("first argument must be a list of strings"); } // the return type of our function is a boolean, so we provide the correct object inspector return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector; } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // get the list and string from the deferred objects using the object inspectors // List<String> list = (List<String>) this.listOI.getList(arguments[0].get()); int elemNum = this.listOI.getListLength(arguments[0].get()); // LazyListObjectInspector llst = (LazyListObjectInspector) arguments[0].get(); // List<String> lst = llst. LazyString larg = (LazyString) arguments[1].get(); String arg = argOI.getPrimitiveJavaObject(larg); // System.out.println("Length: =======================================================>>>" + elemNum); // System.out.println("arg: =======================================================>>>" + arg); // see if our list contains the value we need for(int i = 0; i < elemNum; i++) { LazyString lelement = (LazyString) this.listOI.getListElement(arguments[0].get(), i); String element = elementsOI.getPrimitiveJavaObject(lelement); if(arg.equals(element)){ return new Boolean(true); } } return new Boolean(false); } }
注意:在Hive-1.0.1估计之后的版本也是,evaluate()方法中从Object Inspectors取出的值,需要先保存为Lazy包中的数据类型(org.apache.hadoop.hive.serde2.lazy),然后才能转换成Java的数据类型进行处理。否则会报错,解决方案可以参考Hive报错集锦中的第5个。
把jar文件添加后,创建函数contains,然后执行结果如下:hive (mydb)> select contains(subordinates, subordinates[0]), subordinates from employee; OK true ["Mary Smith","Todd Jones"] true ["Bill King"] false [] false [] true ["John Doe","Fred Finance"] true ["Stacy Accountant"] false [] Time taken: 0.169 seconds, Fetched: 7 row(s)
现在我们在回头看看GenericUDF的模型:
源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf