基于Redis的BloomFilter 实操

BloomFilter

Bloom Filter 是一种多哈希函数映射的快速查找算法，通常应用于大数据和高并发下的数据去重处理，但是又不对准确率有严格的100%的正确率。
Bloom Filter过滤器的工作步骤为:

预设m 位长度的BitSet对象。
将去重对象经过K次hash，判断每次hash后的值在BitSet中对应索引位置的值是不是为1。
如果步骤2中每次获取的值都是1，那么可以判定当前对象已经已经被存储过，可以被去重或者过滤。

参数设计

通过上面的描述，相信大家对Bloom Filter有了大致的了解，现在我们来列出下一个Bloom Filter需要的参数：

插入的的对象个数 n
Bloom Filter 的误判率 P
hash 函数的个数 K
BitSet的位数 m

在实际的项目中，欲插入的对象数目和误判率我们都可以预估和设定，那么现在看来最重要的是如何设定m和k的值。对于上述参数的设定和评估，都有计算公式:

误判率 P(true)：

Hash 函数的个数 K：

求导之后：

BitArray数组的大小 m

通过联立误判率和hash 函数的k的两个公式可以得到：

通过上述公式可以求出：

Redis

熟悉redis的人知道， redis中存在一种位存储，这种为位存储可以极大的降低redis的内存。位操作常用的命令为：

SETBIT KEY OFFSET VALUE 
GETBIT KEY OFFSET

这种结构如果和BloomFilter 结合起来就可以实现分布式的布隆过滤器了。
实现思路为：

对校验的对象做K次hash得到位移offset
调用getbit 命令检查是不是每次返回的值都是1
如果返回K个1表示这个对象已经被存储过
如果没有的话，可以对该对象进行存储

经过上述讲解，流程和逻辑基本都差不多了，万事俱备开始撸码：
因为我们在使用布隆过滤器之前，我们可以预先预估误判率P和想要插入的个数n

计算获bitMap预分配的长度

从上面公式可以推算bit 的长度，但是需要注意的是公式计算出来的是浮点数

    /**
     * 计算bit数组的长度，
     * m = -n * Math.log(p)/Math.pow(ln2,2)
     * @param n 插入条数
     * @param p 误判概率
     */
    private int numOfBits(int n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        int sizeOfBitArray = (int) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
        return sizeOfBitArray;
    }

计算hash的次数

    /**
     * 计算hash方法执行次数
     * k = m/n*ln2
     * @param n 插入的数据条数
     * @param m 数据位数
     */
    private int numberOfHashFunctions(long n, long m) {
        int countOfHash = Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
        return countOfHash;
    }

获取hash函数计算之后的位移集合

这个hash函数采用的是guava中的murmur函数

   //hash函数的次数
    private int numHashFunctions;
    //bit长度
    private int bitSize;

    private Funnel funnel;

    private static int MAX_BIT_SIZE = 1 << 30;

    private static int DEFAULT_HASH = 3;

    public BloomFilterHelper(Funnel funnel, int expectedInsertions, double fpp) {
        Preconditions.checkArgument(funnel != null, "funnel不能为空");
        this.funnel = funnel;
        bitSize = Math.min(optimalNumOfBits(expectedInsertions, fpp),MAX_BIT_SIZE);
        numHashFunctions = Math.min(optimalNumOfHashFunctions(expectedInsertions, bitSize), DEFAULT_HASH);
    }
    /**
     * 计算hash函数之后的位移集合
     * @param value
     * @return
     */
    public List murmurHashOffset(T value) {
        List offsetList = new ArrayList<>(numHashFunctions);
        long hash64 = Hashing.murmur3_128().hashObject(value, funnel).asLong();
        long hash2 =  (hash64 >>> 32);
        for (int i = 1; i <= numHashFunctions; i++) {
            long nextHash = hash64 + i * hash2;
            if (nextHash < 0) {
                nextHash = ~nextHash;
            }
            offsetList.add(nextHash % bitSize);
        }
        return offsetList;
    }

单机的布隆过滤器已经建好了，接下来就是和redis整合了，由于可能会有多次的setbit操作，这样可能会发生多次的网络请求，所以考虑的是用lua脚本来执行：

  private static final String GET_BIT_LUA = "for i=1,#ARGV\n" +
            "do\n" +
            "    local value =  redis.call(\"GETBIT\", KEYS[1], ARGV[i])\n" +
            "    if value == 0\n" +
            "    then\n" +
            "        return 0\n" +
            "    end\n" +
            "end\n" +
            "return 1";

    private static final String SET_BIT_LUA = "for i=1, #ARGV\n" +
            "do\n" +
            "    redis.call(\"SETBIT\",KEYS[1], ARGV[i],1)\n" +
            "end\n";

布隆过滤器的插入和判断操作分别如下：

public static  void addByBloomFilter(IRedisHelper redisHelper, BloomFilterHelper bloomFilterHelper, Object key, T value) {
        Preconditions.checkArgument(bloomFilterHelper != null, "bloomFilterHelper不能为空");
        List offsetList = bloomFilterHelper.murmurHashOffset(value);
        if(CollectionUtils.isEmpty(offsetList)){
            return ;
        } 
       redisHelper.eval(routeKey, SET_BIT_LUA, Lists.newArrayList(key.getRawKey()), offsetList);
    }

    /**
     * 根据给定的布隆过滤器判断值是否存在
     */
    public static  boolean includeByBloomFilter(IRedisHelper redisHelper, BloomFilterHelper bloomFilterHelper, Object key, T value) {
        Preconditions.checkArgument(bloomFilterHelper != null, "bloomFilterHelper不能为空");
        List offsetList = bloomFilterHelper.murmurHashOffset(value);
        if(CollectionUtils.isEmpty(offsetList)){
            return false;
        }

      String result = String.valueOf(eval);
      if("1".equalsIgnoreCase(result)){
        return true;
       }        
    return false;
    }

对于redis的bitmap 存在一个问题，就是内存初始化的问题,
下面是来自官方的原话：

 When setting the last possible bit (offset equal to 2^32 -1) and the string value stored at key does not yet 
hold a string value, or holds a small string value, Redis needs to allocate all intermediate memory which can
 block the server for some time. On a 2010 MacBook Pro, setting bit number 2^32 -1 (512MB allocation) 
takes ~300ms, setting bit number 2^30 -1 (128MB allocation) takes ~80ms, 
setting bit number 2^28 -1 (32MB allocation) takes ~30ms and setting bit number 2^26 -1 (8MB allocation) takes ~8ms.

如果bitmap的长度是2^32的话，可能需要300ms 分配内存， 2^30 需要80ms, 2^28需要30ms, 2&26只需要8ms, 假如项目需要对性能和延迟有要求，那么如何分配这个bitmap是个需要考虑的问题。

参考

1.https://zhuanlan.zhihu.com/p/140545941
2.https://redis.io/commands/setbit