Hive UDF教程(一)
Hive UDF教程(二)
Hive UDF教程(三)
上一篇介绍了基础的UDF——UDF和GenericUDF的实现,这一篇将介绍更复杂的用户自定义表生成函数(UDTF)。用户自定义表生成函数(UDTF)接受零个或多个输入,然后产生多列或多行的输出,如explode()。要实现UDTF,需要继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF,同时实现三个方法:
// 该方法指定输入输出参数:输入的Object Inspectors和输出的Struct。 abstract StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException; // 该方法处理输入记录,然后通过forward()方法返回输出结果。 abstract void process(Object[] record) throws HiveException; // 该方法用于通知UDTF没有行可以处理了。可以在该方法中清理代码或者附加其他处理输出。 abstract void close() throws HiveException;
现在我们看一个分割字符串的例子:
@Description( name = "explode_name", value = "_FUNC_(col) - The parameter is a column name." + " The return value is two strings.", extended = "Example:\n" + " > SELECT _FUNC_(col) FROM src;" + " > SELECT _FUNC_(col) AS (name, surname) FROM src;" + " > SELECT adTable.name,adTable.surname" + " > FROM src LATERAL VIEW _FUNC_(col) adTable AS name, surname;" ) public class ExplodeNameUDTF extends GenericUDTF{ @Override public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException { if(argOIs.length != 1){ throw new UDFArgumentException("ExplodeStringUDTF takes exactly one argument."); } if(argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE && ((PrimitiveObjectInspector)argOIs[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING){ throw new UDFArgumentTypeException(0, "ExplodeStringUDTF takes a string as a parameter."); } ArrayList<String> fieldNames = new ArrayList<String>(); ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(); fieldNames.add("name"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("surname"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } @Override public void process(Object[] args) throws HiveException { // TODO Auto-generated method stub String input = args[0].toString(); String[] name = input.split(" "); forward(name); } @Override public void close() throws HiveException { // TODO Auto-generated method stub } }
hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar; hive (mydb)> CREATE TEMPORARY FUNCTION explode_name AS "edu.wzm.hive.udtf.ExplodeNameUDTF"; hive (mydb)> SELECT explode_name(name) FROM employee; Query ID = root_20160118000909_c2052a8b-dc3b-4579-931e-d9059c00c25b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1453096763931_0005, Tracking URL = http://master:8088/proxy/application_1453096763931_0005/ Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0005 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-01-18 00:09:08,643 Stage-1 map = 0%, reduce = 0% 2016-01-18 00:09:14,152 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec MapReduce Total cumulative CPU time: 1 seconds 30 msec Ended Job = job_1453096763931_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.03 sec HDFS Read: 1040 HDFS Write: 80 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 30 msec OK John Doe Mary Smith Todd Jones Bill King Boss Man Fred Finance Stacy Accountant Time taken: 13.765 seconds, Fetched: 7 row(s)
UDTF有两种使用方法:
1. 直接和SELECT一起使用
SELECT explode_name(name) AS (name, surname) FROM employee SELECT explode_name(name) FROM employee
SELECT name,explode_name(name) AS (name, surname) FROM employee
SELECT explode_name(explode_name(name)) FROM employee
SELECT explode_name(name) FROM employee GROUP BY name
2. 和LATERAL VIEW一起使用
SELECT adTable.name,adTable.surname FROM employee LATERAL VIEW explode_name(name) adTable AS name, surname;