现在就学习如何使用Hive API自定义函数。要想自定义Hive函数,只需要继承org.apache.hadoop.hive.ql.exec.UDF类,并在实现类中定义一个或者多个evaluate 方法。在查询处理过程中,对于函数的每次使用都会实例化函数类的一个实例,对每个输入行调用一次evaluate方法。下面参考一下Hive内置的sin函数的定义,然后再定义自己的函数。
/** * UDFSin. * */ @Description(name = "sin", value = "_FUNC_(x) - returns the sine of x (x is in radians)", extended = "Example:\n " + " > SELECT _FUNC_(0) FROM src LIMIT 1;\n" + " 0") @VectorizedExpressions({FuncSinLongToDouble.class, FuncSinDoubleToDouble.class}) public class UDFSin extends UDFMath { private final DoubleWritable result = new DoubleWritable(); public UDFSin() { } /** * Take Sine of a. */ public DoubleWritable evaluate(DoubleWritable a) { if (a == null) { return null; } else { result.set(Math.sin(a.get())); return result; } } }
package learning; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; @Description(name = "castStatusToDes", value = "_FUNC_(x) - returns the description of x (x is status code of HTTP)", extended = "Example:\n " + " > SELECT _FUNC_(200) FROM src LIMIT 1;\n" + " OK") public class CastStatusToDes extends UDF{ public Text evaluate(IntWritable status){ if(status == null) return null; else if(status.get() == 200) return new Text("OK"); else return new Text("Others"); } }
hive> CREATE TEMPORARY FUNCTION cast_http as 'learning.CastStatusToDes'; OK Time taken: 0.054 seconds hive> describe function cast_http; OK cast_http(x) - returns the description of x (x is status code of HTTP) Time taken: 0.762 seconds, Fetched: 1 row(s) hive> describe function extended cast_http; OK cast_http(x) - returns the description of x (x is status code of HTTP) Example: > SELECT cast_http(200) FROM src LIMIT 1; OK Time taken: 0.097 seconds, Fetched: 4 row(s)
hive> SELECT cast_http(200) FROM ccp LIMIT 1; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201409091422_0002, Tracking URL = http://hadoop:50030/jobdetails.jsp?jobid=job_201409091422_0002 Kill Command = /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201409091422_0002 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2014-09-09 15:04:15,097 Stage-1 map = 0%, reduce = 0% 2014-09-09 15:04:25,203 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.93 sec 2014-09-09 15:04:29,250 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.93 sec MapReduce Total cumulative CPU time: 6 seconds 930 msec Ended Job = job_201409091422_0002 MapReduce Jobs Launched: Job 0: Map: 2 Cumulative CPU: 6.93 sec HDFS Read: 9242 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 930 msec OK OK Time taken: 31.337 seconds, Fetched: 1 row(s)