Hive 学习笔记(三)

自定义函数

当写hive UDF时,有两个选择:一是继承 UDF类,二是继承抽象类GenericUDF。这两种实现不同之处是:GenericUDF 可以处理复杂类型参数,并且继承GenericUDF更加有效率,因为UDF class 需要HIve使用反射的方式去实现。

Hive 学习笔记(三)_第1张图片

UDF

一个UDF 必须满足两个条件:
1. 必须继承 org.apache.hadoop.hive.ql.exec.UDF类
2. 至少实现了一个 evaluate() 方法

/**
 * A User-defined function (UDF) for the use with Hive.
 *
 * New UDF classes need to inherit from this UDF class.
 *
 * Required for all UDF classes: 1. Implement one or more methods named
 * "evaluate" which will be called by Hive. The following are some examples:
 * public int evaluate(); public int evaluate(int a); public double evaluate(int
 * a, double b); public String evaluate(String a, int b, String c);
 *
 * "evaluate" should never be a void method. However it can return "null" if
 * needed.
 */
@UDFType(deterministic = true)
public class UDF {

  /**
   * The resolver to use for method resolution.
   */
  private UDFMethodResolver rslv;

  /**
   * The constructor.
   */
  public UDF() {
    rslv = new DefaultUDFMethodResolver(this.getClass());
  }

  /**
   * The constructor with user-provided UDFMethodResolver.
   */
  protected UDF(UDFMethodResolver rslv) {
    this.rslv = rslv;
  }

  /**
   * Sets the resolver.
   *
   * @param rslv
   *          The method resolver to use for method resolution.
   */
  public void setResolver(UDFMethodResolver rslv) {
    this.rslv = rslv;
  }

  /**
   * Get the method resolver.
   */
  public UDFMethodResolver getResolver() {
    return rslv;
  }

  /**
   * These can be overriden to provide the same functionality as the
   * correspondingly named methods in GenericUDF.
   */
  public String[] getRequiredJars() {
    return null;
  }

  public String[] getRequiredFiles() {
    return null;
  }
}

UDF 函数只能处理简单的java标量类型,如int,double,String等类型

GenericUDF

使用Geoip 根据IP定位国家,实现GenericUDF。

public class GeoLocUDF extends GenericUDF {

    private LookupService service;
    //用于将输入类型转换为目标类型
    private ObjectInspectorConverters.Converter[] converters;


    /**
     * 初始化
     * @param arguments
     * @return
     * @throws UDFArgumentException
     */
    @Override
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {

        if (arguments.length !=2){
            throw new UDFArgumentLengthException(
                    "The function COUNTRY(ip, geolocfile) takes exactly 2 arguments.");
        }

        converters = new ObjectInspectorConverters.Converter[arguments.length];
        for (int i = 0; i < arguments.length; i++) {
            //创建一个Converter,可以在evaluate方法中使用它 将参数的原生数据类型转存未Java Strings。
            converters[i] = ObjectInspectorConverters.getConverter(arguments[i],
                    PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        }
        // 指定UDF (evaluate函数)返回类型为String
        return PrimitiveObjectInspectorFactory
                .getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.STRING);
    }

    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
        if (arguments[0].get() == null || arguments[1].get() == null) {
            return null;
        }

        String ip = (String) converters[0].convert(arguments[0].get());
        String filename = (String) converters[1].convert(arguments[1].get());

        return lookup(ip, filename);
    }
    //在出现异常时,会调用这个方法,输出调用信息
    @Override
    public String getDisplayString(String[] children) {
        return "country(" + children[0] + ", " + children[1] + ")";
    }


    protected String lookup(String ip, String filename) throws HiveException {
        try {
            if (service == null) {
                URL u = getClass().getClassLoader().getResource(filename);
                if (u == null) {
                    throw new HiveException("Couldn't find geolocation file '" + filename + "'");
                }
                service =
                        new LookupService(u.getFile(), LookupService.GEOIP_MEMORY_CACHE);
            }

            String countryCode = service.getCountry(ip).getCode();

            if ("--".equals(countryCode)) {
                return null;
            }

            return countryCode;
        } catch (IOException e) {
            throw new HiveException("Caught IO exception", e);
        }
    }
}

在Hive中调用UDF

ADD JAR /path/hive-test.jar;
CREATE temporary FUNCTION geoIp AS 'com.hadoop2.hive.GeoLocUDF';

temporary 关键字表示UDF只是为这个Hive会话过程定义的。

深入学习:https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

你可能感兴趣的:(hadoop,hive)