shark udf好称跟hive完全兼容,今天尝试了一下,尚未遇到问题。
参考资料[1]中是基于hadoop 1的,这里要在基于hadoop2的解决同样的问题。
问题
topk的经典问题,举例如下
源数据
100 10 200 12 300 33 100 4 100 8 200 20 300 31 300 3 400 4 200 2
100 10 100 8 200 20 200 12 300 33 300 31 400 4
2. distribute by and sort by
udf编写
源码
package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Rank extends UDF{ private int counter; private String last_key; public int evaluate(final String key){ if (!key.equalsIgnoreCase(this.last_key)) { this.counter = 1; this.last_key = key; } return this.counter++; } }编写Makefile,方便一个make命令编译生成jar问题,这里没有用流行的mvn,感觉用Makefile更简介些
UDF_CLASSPATH := $(addprefix ${HADOOP_HOME}/share/hadoop/mapreduce/,hadoop-mapreduce-client-core-2.2.0.jar \ hadoop-mapreduce-client-common-2.2.0.jar \ hadoop-mapreduce-client-jobclient-2.2.0.jar) UDF_CLASSPATH += $(addprefix ${HADOOP_HOME}/share/hadoop/common/,hadoop-common-2.2.0.jar) UDF_CLASSPATH += $(addprefix ${HIVE_HOME}/lib/,hive-exec-0.11.0.jar) UDF_CLASSPATH := $(shell echo ${UDF_CLASSPATH} | tr -s ' ' ':') JAR := jar JAVAC := javac CLASSDIR := class __mkdir := $(shell for i in ${CLASSDIR}; do [ -d $$i ] || mkdir -p $$i; done) SRC := $(wildcard *.java) TARGET := $(patsubst %.java,%.jar,${SRC}) .PHONY: all clean all: ${TARGET} ${TARGET}:%.jar:%.java ${JAVAC} -cp ${UDF_CLASSPATH} -d ${CLASSDIR} $< ${JAR} -cf $@ -C ${CLASSDIR} . clean: ${RM} -r ${TARGET} ${CLASSDIR}
add jar Rank.jar; create temporary function myrank as "com.example.hive.udf.Rank"; select key,val from (select key,val,myrank(key) r from (select key,val from rank distribute by key sort by key, val desc) t1) t2 where t2.r < 3; OK 100 10 100 8 200 20 200 12 300 33 300 31 400 4 Time taken: 1.093 seconds
参考资料
[1] http://www.cnblogs.com/Torstan/p/3423859.html