/** * A Generic User-defined function (GenericUDF) for the use with Hive. * * New GenericUDF classes need to inherit from this GenericUDF class. * * The GenericUDF are superior to normal UDFs in the following ways: 1. It can * accept arguments of complex types, and return complex types. 2. It can accept * variable length of arguments. 3. It can accept an infinite number of function * signature - for example, it's easy to write a GenericUDF that accepts * array, array * can do short-circuit evaluations using DeferedObject. */> and so on (arbitrary levels of nesting). 4. It
区别主要分四点:
可以接收和返回复杂类型参数(arrays,maps,structs,union)
可以接受可变长度参数
可以接受无限长度的参数
可以通过DeferedObject来缩短计算。
其实1~3点可以总结为可以处理复杂数据类型,第4点通过DeferedObject可以使计算变简短
接下来,通过两个两个例子解开UDF和GenericUDF的神秘面纱
先配置环境,jdk采用1.8版本,maven配置如下
<dependencies> <dependency> <groupId>org.apache.hivegroupId> <artifactId>hive-execartifactId> <version>2.3.3version> <scope>providedscope> dependency> dependencies>
编写udf只需要实现evaluate方法即可。注意,此处不是Override(重写)evaluate方法。忍不住要问,为什么是实现evaluate方法,而不是实现cat(阿猫)或dog(阿狗)方法。于是机智的我们打开源码看到如下注释
* Implement one or more methods named {@code evaluate} which will be called by Hive (the exact * way in which Hive resolves the method to call can be configured by setting a custom {@link * UDFMethodResolver}). The following are some examples:
从注释可以看出evaluate方法是在UDFMethodResolver中进行的配置,UDFMethodResolver接口的默认实现类为DefaultUDFMethodResolver,从该类中可以看到具体实现方法,原来是在此处进行的注册,疑问得到解答。
public Method getEvalMethod(List<TypeInfo> argClasses) throws UDFArgumentException { return FunctionRegistry.getMethodInternal(udfClass, "evaluate", false, argClasses); }
在工作中我们经常遇到得到层次的情况,需要实现函数如下
eg:get_levels('A/B/C/D','/',1,3) -> 'A/B/C'
get_levels('A/B/C/D','/',2) -> 'A/B/C/D'
/** * eg:get_levels('A/B/C/D','/',1,3) -> 'A/B/C' * get_levels('A/B/C/D','/',2) -> 'A/B/C/D' */ public class UDFGetLevels extends UDF{ /** * source,为源数据,sep为分割字符串,start为起始位置,从1开始 end为结束位置 * * @param source * @param sep * @param start * @param end * @return */ public String evaluate(String source,String sep,int start,int end) { if(source == null || sep == null) { return null; } String[] arr = source.split(sep); return StringUtils.join(ArrayUtils.subarray(arr,start - 1,end),sep); } /** * 结束位置为最末尾 * * @param source * @param sep * @param start * @return */ public String evaluate(String source,String sep,int start) { if(source == null || sep == null) { return null; } String[] arr = source.split(sep); return StringUtils.join(ArrayUtils.subarray(arr,start - 1,arr.length),sep); } /** * 默认以"/" 为分割 * * @param source * @param start * @param end * @return */ public String evaluate(String source,int start,int end) { return evaluate(source,"/",start,end); } }
执行效果如下
hive> add jar /Users/liufeifei/hive/jar/hive.jar;
Added [/Users/liufeifei/hive/jar/hive.jar] to class path
Added resources: [/Users/liufeifei/hive/jar/hive.jar]
hive> create temporary function get_levels as 'com.practice.hive.udf.UDFGetLevels';
OK
Time taken: 0.002 seconds
hive> select get_levels('A/B/C/D','/',1,3);
OK
A/B/C
Time taken: 0.041 seconds, Fetched: 1 row(s)
hive> select get_levels('A/B/C/D','/',2);
OK
B/C/D
Time taken: 0.034 seconds, Fetched: 1 row(s)
编写GenericUDF主要实现initialize、evaluate、getDisplayString三个方法,
initialize在初始化的时候调用一次,用来检查输入参数且可以将输入参数(DeferredObject)进行转换, evaluate 用来进行逻辑处理, getDisplayString 用来在 explain 的时候进行显示,注释如下
/** * Initialize this GenericUDF. This will be called once and only once per * GenericUDF instance. * * @param arguments * The ObjectInspector for the arguments * @throws UDFArgumentException * Thrown when arguments have wrong types, wrong length, etc. * @return The ObjectInspector for the return value */ public abstract ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException; /** * Evaluate the GenericUDF with the arguments. * * @param arguments * The arguments as DeferedObject, use DeferedObject.get() to get the * actual argument Object. The Objects can be inspected by the * ObjectInspectors passed in the initialize call. * @return The */ public abstract Object evaluate(DeferredObject[] arguments) throws HiveException; /** * Get the String to be displayed in explain. */ public abstract String getDisplayString(String[] children);
对ObjectInspector我们比较感兴趣,查看源码注释
/** * ObjectInspector helps us to look into the internal structure of a complex * object. * * A (probably configured) ObjectInspector instance stands for a specific type * and a specific way to store the data of that type in the memory. * * For native java Object, we can directly access the internal structure through * member fields and methods. ObjectInspector is a way to delegate that * functionality away from the Object, so that we have more control on the * behavior of those actions. * * An efficient implementation of ObjectInspector should rely on factory, so * that we can make sure the same ObjectInspector only has one instance. That * also makes sure hashCode() and equals() methods of java.lang.Object directly * works for ObjectInspector as well. */ public interface ObjectInspector extends Cloneable
可以看到 ObjectInspector 主要是用来检查数据格式,并且格式化数据,通过工厂可以得到单例
eg:select array_to_map(array('zhangsan','18','90')); -> {"score":"90","name":"zhangsan","age":"18"}
传入的为数组,长度为3,分别对应用户的姓名、年龄、分数,返回相应的map信息
代码如下:
/** * eg:select array_to_map(array('zhangsan','18','90')); -> {"score":"90","name":"zhangsan","age":"18"} */ public class GenericUDFArrayToMap extends GenericUDF{ private ListObjectInspector listInspector; private MapObjectInspector mapInspector; /** * 输入参数判断 * * @param arguments * @return * @throws UDFArgumentException */ public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { if(arguments.length != 1) { throw new UDFArgumentException("Must have one parameter"); // 检查传入参数是否为array } else if(! (arguments[0] instanceof StandardListObjectInspector)) { throw new UDFArgumentException("Must be array"); } // ListObjectInspector ArraylistInspector = ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.javaStringObjectInspector); // MapObjectInspector Map mapInspector = ObjectInspectorFactory.getStandardMapObjectInspector(PrimitiveObjectInspectorFactory.javaStringObjectInspector,PrimitiveObjectInspectorFactory.javaStringObjectInspector); // 返回类型为 Map return mapInspector; } /** * 逻辑计算 * * @param arguments * @return * @throws HiveException */ public Object evaluate(DeferredObject[] arguments) throws HiveException { Map<String ,String> resMap = Maps.newHashMap(); // listInspector 解析数据 List<String> arr = (List<String>)listInspector.getList(arguments[0].get()); resMap.put("name",arr.get(0)); resMap.put("age",arr.get(1)); resMap.put("score",arr.get(2)); return resMap; } /** * 显示字符串 * * @param children * @return */ public String getDisplayString(String[] children) { return "function ArrayToMap"; } }
打包在hive上执行
hive> add jar /Users/liufeifei/hive/jar/hive.jar;
Added [/Users/liufeifei/hive/jar/hive.jar] to class path
Added resources: [/Users/liufeifei/hive/jar/hive.jar]
hive> create temporary function array_to_map as 'com.practice.hive.udf.generic.GenericUDFArrayToMap';
OK
Time taken: 0.002 seconds
hive> select array_to_map(array('zhangsan','18','90'));
OK
{"score":"90","name":"zhangsan","age":"18"}
Time taken: 0.039 seconds, Fetched: 1 row(s)
getDisplayString方法的显示如下,可以看到 expressions: function ArrayToMap 这一行正是我们在方法中编写的
hive> explain select array_to_map(array('zhangsan','18','90')); OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: _dummy_table Row Limit Per Split: 1 Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: function ArrayToMap (type: map<string,string>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 772 Basic stats: COMPLETE Column stats: COMPLETE ListSink Time taken: 0.044 seconds, Fetched: 18 row(s)
网上可以搜到很多udf教程,很多没有全面介绍和详细说明,这里结合个人学习经验及实战对udf进行分析。文章拷贝了很多英文注释,一方面是可以看到很多知识在文档上都有说明,另一方面是直接翻译过来意思有失偏颇。
以上为查阅文档所写,理解上可能存在误解。想进一步了解原理,可以查看官方文档及源码