场景
- 爬虫时判断某个URL是否已经被爬取过
- 黑名单过滤
- 防止缓存穿透
- ...
实现原理
定义一个长度为m的bit型数组flag[] (用来添加元素以及判断元素是否存在,因为Integer的最大值为2147483647,所以m取该值即可)
定义n个不同的hash函数(在添加元素时,需要设置flag[]哪些位为1; 在判断元素是否存在时,需要取flag[]哪些位来判断)
添加某个元素时,通过n个hash函数算出该元素的n个hash值(整型值),把flag[]对应的位置1
判断某个元素是否存在时,通过n个hash函数算出该元素的n个hash值(整型值),在flag[]取出对应的值,只要有一个不为1 ,即可判断为不存在.否则就任务元素存在
布隆过滤器优缺点
优点: 大大节省空间
场景: 在10亿数据中判断某个数据是否存在
如果使用HashSet/HashMap来实现的话
查找的时间复杂度是O(1),但是我们来算一下存储空间,Hash值为Integer类型,占四个字节,那10亿条数据占用的空间就是:10亿*4/1024/1024/1024约等于3.7G......这个实现方案很明显不现实
如果使用布隆过滤器实现
占用的空间大约为2147483647/8/1024/1024=256M
缺点: 误差
由上面的分析可知, hash函数是存在hash冲突的, 所以布隆过滤器是会有误判的情况.
表现为:
如果某条记录被判断为不存在,则该记录必然不存在
如果某条记录被判断存在,则该记录可能会不存在
代码实现
JDK版
public class BloomFilterDemo {
private static final int insertions = 1000000;
@Test
public void bfTest1(){
//初始化一个存储string数据的布隆过滤器,初始化大小100w,不能设置为0
BloomFilter bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions,0.001);
//初始化一个存储string数据的set,初始化大小100w
Set sets = new HashSet<>(insertions);
//初始化一个存储string数据的set,初始化大小100w
List lists = new ArrayList<>(insertions);
//向三个容器初始化100万个随机并且唯一的字符串---初始化操作
for (int i = 0; i < insertions; i++) {
String uuid = UUID.randomUUID().toString();
bf.put(uuid);
sets.add(uuid);
lists.add(uuid);
}
//布隆过滤器错误判断的次数
int wrong = 0;
//布隆过滤器正确判断的次数
int right = 0;
for (int i = 0; i < 10000; i++) {
//按照一定比例选择bf中肯定存在的字符串
String test = i%100==0?lists.get(i/100):UUID.randomUUID().toString();
if(bf.mightContain(test)){
if(sets.contains(test)){
right ++;
}else{
wrong ++;
}
}
}
//100
System.out.println("=================right====================="+right);
System.out.println("=================wrong====================="+wrong);
}
@Test
public void bfTest2() {
//预计要插入多少数据
int size = 1000000;
//期望的误判率
double fpp = 0.01;
BloomFilter bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);
//插入数据
for (int i = 0; i < 1000000; i++) {
bloomFilter.put(i);
}
int count = 0;
for (int i = 0; i < 1000000; i++) {
if (!bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "误判了");
}
}
for (int i = 1000000; i < 2000000; i++) {
if (bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "误判了");
}
}
System.out.println("总共的误判数:" + count);
}
}
redis版
/**
*
* 原理是和 JDK 自带的BloomFilter类似的,我们看add方法,它先 Redis 缓存中是否有指定 key(如:orderBloomFilter) 的值,如果没有,则在 offset = 0 处,添加一个值为false,即为 0;
* 然后调用createHashes(byte[] data, int hashes)方法,根据字节数组的内容生成digest,并将结果分割成 4 字节的整数并存储在数组中,数组中的整数可以理解为每次hash所得的hashcode的值。
* 最后,遍历hashcode数组,将hashcode%sizeOfBloomFilter取模所得下标所对应的值设为true,即为 1。
*
* contains方法,同样调用createHashes(byte[] data, int hashes)得到字节数组内容所对应的hashcode数组。
* 遍历hashcode数组,如果有一个hashcode所对应的下标的值不为1,则该数据不存在。反之,只有所有的hashcode所对应的下标的值都为1,才能说明该数据已经存在。
*
* @author: ZENGQINGXUN178
* @Date: 2019-8-27 10:30
* @Description:
*/
@Component
public class BloomFilterService {
@Autowired
private JedisCluster jedisCluster;
/**
* total length of the Bloom filter
*/
private static int sizeOfBloomFilter;
/**
* number of hash functions
*/
private static int numberOfHashFunctions;
/**
* 误差率
*/
private static final double falsePositiveProbability = 0.01;
/**
* 预计容量
*/
private static final int expectedNumberOfElements = 10000;
private static String bloom_name = "bloom_name_1";
/**
* encoding used for storing hash values as strings
*/
private final Charset charset = Charset.forName("UTF-8");
/**
* MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed
*/
private static final String hashName = "MD5";
private static final MessageDigest digestFunction;
/**
* The digest method is reused between instances
*/
static {
MessageDigest messageDigest;
try {
messageDigest = java.security.MessageDigest.getInstance(hashName);
} catch (NoSuchAlgorithmException e) {
messageDigest = null;
}
digestFunction = messageDigest;
// numberOfHashFunctions = ceil(-ln(falsePositiveProbability)/ln2)
numberOfHashFunctions = (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)));
// sizeOfBloomFilter = ceil(numberOfHashFunctions*expectedNumberOfElements/ln2)
sizeOfBloomFilter = (int) Math.ceil(numberOfHashFunctions * expectedNumberOfElements / Math.log(2));
}
@PostConstruct
public void init(){
// 1. 获取数据
List datas = new ArrayList<>();
for (int i = 0; i < expectedNumberOfElements; i++) {
datas.add(i + "");
}
if (jedisCluster.get(bloom_name) == null) {
jedisCluster.setbit(bloom_name, 0, false);
}
// 2. 把数据放进bloomFilter
JedisClusterPipeline pipelined = JedisClusterPipeline.pipelined(jedisCluster);
try {
datas.forEach(e -> add(e.getBytes(charset),pipelined));
}finally {
pipelined.syncAndReturnAll();
}
}
/**
* Adds an object to the Bloom filter. The output from the object's
* toString() method is used as input to the hash functions.
*
* @param element is an element to register in the Bloom filter.
*/
public void add(E element) {
add(element.toString().getBytes(charset));
}
private void add(byte[] bytes) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
jedisCluster.setbit(bloom_name, Math.abs(hash % sizeOfBloomFilter), true);
}
}
/**
* Adds an array of bytes to the Bloom filter.
*
* @param bytes array of bytes to add to the Bloom filter.
*/
private void add(byte[] bytes,JedisClusterPipeline pipelined) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
pipelined.setbit(bloom_name, Math.abs(hash % sizeOfBloomFilter), true);
}
}
/**
* Adds all elements from a Collection to the Bloom filter.
*
* @param c Collection of elements.
*/
public void addAll(Collection extends E> c) {
for (E element : c) {
add(element);
}
}
/**
* Returns true if the element could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param element element to check.
* @return true if the element could have been inserted into the Bloom filter.
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(charset));
}
public int failCount(List elements) {
JedisClusterPipeline pipelined = JedisClusterPipeline.pipelined(jedisCluster);
int count = 0;
try {
for (E e : elements) {
if (!contains(e.toString().getBytes(charset), pipelined)) {
count++;
}
}
}finally {
pipelined.close();
}
return count;
}
/**
* Returns true if the array of bytes could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param bytes array of bytes to check.
* @return true if the array could have been inserted into the Bloom filter.
*/
private boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
if (!jedisCluster.getbit(bloom_name, Math.abs(hash % sizeOfBloomFilter))) {
return false;
}
}
return true;
}
private boolean contains(byte[] bytes,JedisClusterPipeline pipelined) {
int[] hashes = createHashes(bytes, numberOfHashFunctions);
for (int hash : hashes) {
pipelined.getbit(bloom_name, Math.abs(hash % sizeOfBloomFilter));
}
List