Hive UDF教程(二)

Hive UDF教程(一)

Hive UDF教程(二)

Hive UDF教程(三)

1.UDTF

上一篇介绍了基础的UDF——UDF和GenericUDF的实现,这一篇将介绍更复杂的用户自定义表生成函数(UDTF)。用户自定义表生成函数(UDTF)接受零个或多个输入,然后产生多列或多行的输出,如explode()。要实现UDTF,需要继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF,同时实现三个方法:

// 该方法指定输入输出参数:输入的Object Inspectors和输出的Struct。
abstract StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException; 

// 该方法处理输入记录,然后通过forward()方法返回输出结果。
abstract void process(Object[] record) throws HiveException;

// 该方法用于通知UDTF没有行可以处理了。可以在该方法中清理代码或者附加其他处理输出。
abstract void close() throws HiveException;


2.示例

现在我们看一个分割字符串的例子:

@Description(
	name = "explode_name",
	value = "_FUNC_(col) - The parameter is a column name."
		+ " The return value is two strings.",
	extended = "Example:\n"
		+ " > SELECT _FUNC_(col) FROM src;"
		+ " > SELECT _FUNC_(col) AS (name, surname) FROM src;"
		+ " > SELECT adTable.name,adTable.surname"
		+ " > FROM src LATERAL VIEW _FUNC_(col) adTable AS name, surname;"
)
public class ExplodeNameUDTF extends GenericUDTF{

	@Override
	public StructObjectInspector initialize(ObjectInspector[] argOIs)
			throws UDFArgumentException {
		
		if(argOIs.length != 1){
			throw new UDFArgumentException("ExplodeStringUDTF takes exactly one argument.");
		}
		if(argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE
				&& ((PrimitiveObjectInspector)argOIs[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING){
			throw new UDFArgumentTypeException(0, "ExplodeStringUDTF takes a string as a parameter.");
		}
		
		ArrayList<String> fieldNames = new ArrayList<String>();
		ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
		fieldNames.add("name");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		fieldNames.add("surname");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
			
		return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
	}
	
	@Override
	public void process(Object[] args) throws HiveException {
		// TODO Auto-generated method stub
		String input = args[0].toString();
		String[] name = input.split(" ");
		forward(name);
	}

	@Override
	public void close() throws HiveException {
		// TODO Auto-generated method stub
		
	}

}

然后我们把代码编译打包后的jar文件添加到CLASSPATH,然后创建函数explode_name(),最后仍然使用上一节的数据表employee:

hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
hive (mydb)> CREATE TEMPORARY FUNCTION explode_name AS "edu.wzm.hive.udtf.ExplodeNameUDTF";
hive (mydb)> SELECT explode_name(name) FROM employee;                                         
Query ID = root_20160118000909_c2052a8b-dc3b-4579-931e-d9059c00c25b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453096763931_0005, Tracking URL = http://master:8088/proxy/application_1453096763931_0005/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453096763931_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-18 00:09:08,643 Stage-1 map = 0%,  reduce = 0%
2016-01-18 00:09:14,152 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.03 sec
MapReduce Total cumulative CPU time: 1 seconds 30 msec
Ended Job = job_1453096763931_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.03 sec   HDFS Read: 1040 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 30 msec
OK
John	Doe
Mary	Smith
Todd	Jones
Bill	King
Boss	Man
Fred	Finance
Stacy	Accountant
Time taken: 13.765 seconds, Fetched: 7 row(s)

3.UDTF的使用

UDTF有两种使用方法:

1. 直接和SELECT一起使用

SELECT explode_name(name) AS (name, surname) FROM employee
SELECT explode_name(name) FROM employee

但是,不能添加其他字段:

SELECT name,explode_name(name) AS (name, surname) FROM employee

不能嵌套其他函数:

SELECT explode_name(explode_name(name)) FROM employee

不能和GROUP BY/SORT BY/DISTRIBUTE BY/CLUSTER BY一起使用:
SELECT explode_name(name) FROM employee GROUP BY name


2. 和LATERAL VIEW一起使用

SELECT adTable.name,adTable.surname 
FROM employee 
LATERAL VIEW explode_name(name) adTable AS name, surname;

源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf


你可能感兴趣的:(hive,udf,UDTF)