BloomFilter:大量数据快速去重

布隆过滤器 用于大量数据快速去重

先说结果

一共有4348524条数据, 不重复的有55888
数据库使用 select count(DISTINCT(address)) 花了 13.644s
从数据库读取数据,再用bloom_filter处理,花了 1m26.710s, 统计出不重复的有55888条,
准确率一致
4348524条数据从文本读取,去重时间达到 0m12.150s

目前400w的数量级 bloom filter 跟数据库的DISTINCT 处理时间基本一样

更新

总数据 13766256, 去重后的数据 363202,
数据库 select count(DISTINCT(address)) 花了 42.433s,
bloom_filter 处理花了 0m50.983s
准确率一致

python code

import mmh3
from bitarray import bitarray

BIT_SIZE = 1 << 30

class BloomFilter:

    def __init__(self):
        # Initialize bloom filter, set size and all bits to 0
        bit_array = bitarray(BIT_SIZE)
        bit_array.setall(0)

        self.bit_array = bit_array

    def add(self, val):
        point_list = self.get_postions(val)

        for b in point_list:
            self.bit_array[b] = 1

    def get_postions(self, val):
        # Get points positions in bit vector.
        # 提供不同的hash种子得到多个hash函数, seed最好为质数

        point1 = mmh3.hash(val, 5)  % BIT_SIZE
        point2 = mmh3.hash(val, 7)  % BIT_SIZE
        point3 = mmh3.hash(val, 11) % BIT_SIZE
        point4 = mmh3.hash(val, 13) % BIT_SIZE
        point7 = mmh3.hash(val, 19) % BIT_SIZE
        point5 = mmh3.hash(val, 23) % BIT_SIZE
        point6 = mmh3.hash(val, 31) % BIT_SIZE

        return [point1, point2, point3, point4, point5, point6]

    def is_contains(self, val):
        point_list = self.get_postions(val)

        result = True
        for b in point_list:
            result = result and self.bit_array[b]

        return result


if __name__ == '__main__':

    bf = BloomFilter()

    # 第一次会显示 not exists

    if bf.is_contains('zqw'):
        print('exists')
    else:
        print('not exists')
        bf.add('zqw')

    if bf.is_contains('shooter'):
        print('exists!')
    else:
        bf.add('shooter')

    if bf.is_contains('zqw'):
        print('exists!')
    else:
        bf.add('zqw')

Hash算法

计算散列值使用了mmh3算法,因为 mmh3 是非加密哈希算法,计算速度非常快。据说xxHash计算速度更快

Note

Bloom Filter 可以知道某个key “可能”存在于集合中,
也能确定整个key肯定不存在于集合中,但无法算集合中总共有多少个不同的key。


参考:
https://github.com/luw2007/bloomfilter/blob/master/doc/bloomfilter_principle.md

http://www.zhiwenli.com/wordpress/?p=752
https://llimllib.github.io/bloomfilter-tutorial/
https://segmentfault.com/a/1190000010990136
http://www.alloyteam.com/2017/05/hash-functions-introduction/

https://coolshell.cn/articles/17225.html
https://yq.aliyun.com/articles/136154
https://blog.csdn.net/jiaomeng/article/details/1495500
https://github.com/LiuXingMing/Scrapy_Redis_Bloomfilter
https://china.googleblog.com/2007/07/bloom-filter_7469.html
https://www.cnblogs.com/cpselvis/p/6265825.html
http://beginman.cn/redis/2015/11/03/redis-bit/
https://en.wikipedia.org/wiki/MurmurHash
https://github.com/jaybaird/python-bloomfilter/
https://github.com/andymccurdy/redis-py
海量大数据处理单机方案

你可能感兴趣的:(BloomFilter:大量数据快速去重)