hbase中有BloomFilter的功能,可以在有些情况下过滤掉不需要的hfile,节省IO.但是BloomFilter在什么情况下起作用呢?根据
hbase文档,只有get操作才使用到BloomFilter
引用
In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to contain the desired Row. The potential performance gain increases with the number of parallel reads.
从代码上来看
StoreFileScanner的shouldUseScanner方法中,测试这个storefile是否应该被读取
public boolean shouldUseScanner(Scan scan, SortedSet<byte[]> columns, long oldestUnexpiredTS) {
return reader.passesTimerangeFilter(scan, oldestUnexpiredTS)
&& reader.passesKeyRangeFilter(scan) && reader.passesBloomFilter(scan, columns);
}
passesBloomFilter方法中,一开始就pass了get以外的操作
if (!scan.isGetScan()) {
return true;
}
所以passesBloomFilter只对get起作用,scan是不起作用的
[url] http://www.quora.com/How-are-bloom-filters-used-in-HBase[/url]
里提到如果集中的批量的对一个row进行put,那么这个row分布在少量的Hfile中,如果均匀的在不同column上进行put,就可能导致一个row的不同keyvalue分布在所有Hfile中,那么row 级别的bloom filter就不起作用了.