Hive中常被忽视的利器——宏

我们都知道Hive中有UDF(user defined function)——即用户自定义函数,但是由于UDF是Java编写的,代码中堆变量的内存回收完全不受开发者控制,而UDF程序又是嵌套在Hive SQL中执行的,对规模较大的表,就往往会出现由于UDF内存回收不及时造成的out-of-memory错误。因此,在生产环境中,UDF是严格受限的。那么,怎么办呢?

幸好,大多数情况下我们并不是真正需要(不得不用)UDF,大多数时候我们仅仅只是需要把一些虽然繁琐但其实结构简单的逻辑封装起来以便重复使用。举个栗子:

SELECT
        if(birthday is not null and length(birthday)=4,
            case 
                when birthday >= '0321' and birthday <= '0420' then '白羊座'
                when birthday >= '0421' and birthday <= '0520' then '金牛座'
                when birthday >= '0521' and birthday <= '0621' then '双子座'
                when birthday >= '0622' and birthday <= '0722' then '巨蟹座'
                when birthday >= '0723' and birthday <= '0822' then '狮子座'
                when birthday >= '0823' and birthday <= '0922' then '处女座'
                when birthday >= '0923' and birthday <= '1022' then '天秤座'
                when birthday >= '1023' and birthday <= '1121' then '天蝎座'
                when birthday >= '1122' and birthday <= '1221' then '射手座'
                when birthday >= '1222' and birthday <= '1231' then '摩羯座'
                when birthday >= '0101' and birthday <= '0119' then '摩羯座'
                when birthday >= '0120' and birthday <= '0218' then '水瓶座'
                when birthday >= '0219' and birthday <= '0320' then '双鱼座'
                else null
            end, null) as zodiac
FROM
    employee;

这么长一大段其实只是完成了一个很简单的逻辑——把生日转换成星座,如果这个逻辑需要在多个地方重复使用,那代码就会变得非常难看,几乎没法维护。这时往往我们就想到要去定义一个UDF,但其实没必要,杀鸡焉用牛刀。现在轮到本文的主角隆重出场了。我们可以创建一个宏:
--根据生日推算星座

DROP TEMPORARY MACRO IF EXISTS getZodiacFromBirth;
CREATE TEMPORARY MACRO getZodiacFromBirth(birthday string)
    if(birthday is not null and length(birthday)=4,
        case 
            when birthday >= '0321' and birthday <= '0420' then '白羊座'
            when birthday >= '0421' and birthday <= '0520' then '金牛座'
            when birthday >= '0521' and birthday <= '0621' then '双子座'
            when birthday >= '0622' and birthday <= '0722' then '巨蟹座'
            when birthday >= '0723' and birthday <= '0822' then '狮子座'
            when birthday >= '0823' and birthday <= '0922' then '处女座'
            when birthday >= '0923' and birthday <= '1022' then '天秤座'
            when birthday >= '1023' and birthday <= '1121' then '天蝎座'
            when birthday >= '1122' and birthday <= '1221' then '射手座'
            when birthday >= '1222' and birthday <= '1231' then '摩羯座'
            when birthday >= '0101' and birthday <= '0119' then '摩羯座'
            when birthday >= '0120' and birthday <= '0218' then '水瓶座'
            when birthday >= '0219' and birthday <= '0320' then '双鱼座'
            else null
        end, null);

然后,我们前面这个SQL就可以简化成下面这个版本了。

select getZodiacFromBirth(birthday) from employee;

是不是很清爽呢?其实大多数不需要循环结构的逻辑,基本上都可以用宏来处理。它不仅可以用来做字段值的转换映射,也可以做逻辑校验。比如:

--判断身份证号是否合法

DROP TEMPORARY MACRO IS_VALID_IDNO;
CREATE TEMPORARY MACRO IS_VALID_IDNO(idno string)
IF(idno rlike '^[1-9]\\d{5}(19|20)\\d{2}(0[1-9]|1[0-2])([0-2]\\d|3[0-1])\\d{4}$'
    OR
    idno rlike '^[1-9]\\d{5}\\d{2}(0[1-9]|1[0-2])([0-2]\\d|3[0-1])\\d{3}$',true,false);

聪明如你,现在应该已经可以举一反三的运用宏这柄利器了,打字很累,我也就不再举例了。
最后,说一下宏的局限性。宏只能是临时宏,只在本次会话中可见、有效。因此你需要将宏脚本放在SQL脚本的头部。

--------------------------------------------------end--------------------------------------------------

你可能感兴趣的:(大数据,hadoop,hive)