bitcoin bloom filter

20180901
冉小龙（xiaolong.ran）
致谢：齐巍

导言

问题：在大数据场景下，我们如何从海量的数据池中去判断某一个东西是否存在它里面？

两个要点：

查询
海量数据池

思路：要去构建这个海量数据池，我们存储的方案无外乎这几种

存到一个巨大的数组中，通过数组下标去寻找。
存到一个链表、Map中。
存到一个树形结构中。
存到一个hash表中，直接通过k去取v。
最差劲的也就是随机扔下去，每次我都全盘扫一遍。

为什么就是bloom filter

一般来讲，计算机中的集合是用哈希表（hash table）来存储的。它的好处是快速准确，缺点是费存储空间。当集合比较小时，这个问题不显著，但是当集合巨大时，哈希表存储效率低的问题就显现出来了。关于这个，只需要根据元素的数量和大小简单的计算一下就知道了。虽然可以适用分布式K-V系统（如Redis）来承载，但是成本仍然高昂。布隆过滤器只需要哈希表 1/8 到 1/4 的大小就能解决同样的问题，以一定的误判率为代价。所需要的内存大小可以通过公式精确的计算出来：Bloom Filter Calculator。

摘要

巴顿.布隆于一九七零年提出。
概率型数据结构
可能存在或者一定不存在

原理

image.png

解释：

假设位数组的长度为m，哈希函数的个数为k；集合M里面有3个元素{x, y, z}，哈希函数的个数为k=3。
将位数组进行初始化，里面所有的值都置为0.
对集合M里面的每一个元素依次通过k=3个hash函数进行映射，将hash到的位置的值置为1.

为什么需要多个hash函数呢？

降低误判率

如果哈希函数的个数多，那么在对一个不属于集合的元素进行查询时得到0的概率就大；但另一方面，如果哈希函数的个数少，那么位数组中的0就多。所以就需要权衡，具体怎么权衡能？不知道

添加

将要添加的元素给k个哈希函数进行hash运算
得到对应于位数组上的k个位置
将这k个位置设为1

查询

将要查询的元素给k个哈希函数
得到对应于位数组上的k个位置（每一次hash对应一个位置）
如果k个位置有一个为0，则肯定不在集合中
如果k个位置全部为1，则可能在集合中（可能存在一定的误判率）

删除呢？

BloomFilter中不允许有删除操作，因为删除后，可能会造成原来存在的元素返回不存在。因为假设两个hash函数hash到同一个位置的时候，看到这个位置为1，就不做处理了，所以，你删除之后，这个位置的标记1也跟着删除了。

优化

把存储数组的每一个元素扩展一下（原来是1b）用来存储该位置被置1的次数。存储是，计数次数加一；删除的时候，计数次数减一。

误判率解释

观察上图，假设某个元素通过hash映射对应下标为4，5，6这3个点。虽然这3个点都为1，但是很明显这3个点是不同元素经过哈希得到的位置，因此这种情况说明元素虽然不在集合中，也可能对应的都是1，这是误判率存在的原因。

使用场景

leveldb，提升查询未命中的效率。在从磁盘加载数据前，先从布隆过滤器中判断数据是否存在。如果不存在，就直接返回。这样可以减少磁盘访问，提升响应速度。
垃圾邮件的过滤。
爬虫

总结

回答是或者不是的问题。你需要判断一个元素是否属于某个集合，仅仅这样。你不应该要求更多。如果你想获得该元素对应的value或者还有其他payload，那么bloom filter不适合你，你需要哈希表。
允许false positive。也就是说，发生false positive不应该是致命的。比如说，搜索引擎的爬虫里，如果url不是set的元素，却被bloom filter过滤了，那么顶多就是不抓它而已，没啥特别大的损失。
空间敏感。作为一种概率数据结构，Bloom Filter不存储原始数据（比如说url），这也是它为什么space efficient的本质原因。

比特币中布隆过滤器是在BIP-0037中提到，主要是提供给spv节点使用,主要是去过滤发送给他们的交易。

比特币网络中主要有两种节点类型

全节点：存放所有区块数据和交易
SPV节点：只存放区块头（Block Header）

Bloom Filter就是一个过滤器，用来过滤不属于钱包的UTXO,通过bloom filter，钱包既能保护用户的隐私，还能节省存储空间和宽带。

代码分析：

规定bloom filter最多允许有50个hash函数，最大是3.5Kb左右

static const unsigned int MAX_BLOOM_FILTER_SIZE = 36000; // bytes
static const unsigned int MAX_HASH_FUNCS = 50;

基础结构：

    std::vector vData; //bloom filter位数组的数量
    bool isFull;
    bool isEmpty;
    unsigned int nHashFuncs;//要在此过滤器中使用的哈希函数的数量。此字段中允许的最大值为50
    unsigned int nTweak;//要添加到bloom过滤器使用的散列函数中的种子值的随机值。
    uint8_t nFlags;//一组标志，用于控制如何将匹配项添加到过滤器。

nflag:

enum bloomflags {
    BLOOM_UPDATE_NONE = 0,//表示找到匹配项时不调整过滤器
    BLOOM_UPDATE_ALL = 1,//如果过滤器与scriptPubKey中的任何数据元素匹配，则将该出口序列化并插入过滤器。
    // Only adds outpoints to the filter if the output is a
    // pay-to-pubkey/pay-to-multisig script
    BLOOM_UPDATE_P2PUBKEY_ONLY = 2,//只有当scriptPubKey中的数据元素匹配时，才会将outpoint插入过滤器，并且该脚本具有标准的“pay to pubkey”或“pay to multisig”形式。
    BLOOM_UPDATE_MASK = 3,
};

序列化操作：

template 
    inline void SerializationOp(Stream &s, Operation ser_action) {
        READWRITE(vData);//vData是bloom filter的集合key
        READWRITE(nHashFuncs);//需要做几次hash运算
        READWRITE(nTweak);
        READWRITE(nFlags);
    }

生成不同hash函数的操作：

inline unsigned int
CBloomFilter::Hash(unsigned int nHashNum,
                   const std::vector &vDataToHash) const {
    // 0xFBA4C795 chosen as it guarantees a reasonable bit difference between
    // nHashNum values.
    return MurmurHash3(nHashNum * 0xFBA4C795 + nTweak, vDataToHash) %
           (vData.size() * 8);
}

按照bloom filter的算法对新增的key做几次hash然后修改位数组:

void CBloomFilter::insert(const std::vector& vKey)
{
    if (isFull)
        return;
    //n次不同hash，不代表需要n个不同的hash函数，直接根据index更改hash seed即可实现
    for (unsigned int i = 0; i < nHashFuncs; i++)
    {
        unsigned int nIndex = Hash(i, vKey);
        // Sets bit nIndex of vData
        vData[nIndex >> 3] |= (1 << (7 & nIndex));
    }
    isEmpty = false;
}

添加操作：

void CBloomFilter::insert(const std::vector &vKey) {
    if (isFull) return;
    for (unsigned int i = 0; i < nHashFuncs; i++) {
        unsigned int nIndex = Hash(i, vKey);
        // Sets bit nIndex of vData
        //每一次key hash生成的结果对应到bitArray的1bit的index, 而vData是char对象，总共有4 bit，
        // 所以nIndex >> 3先找到对一个char的index, 1 << (7 & nIndex) 找到index对应4位中的哪一位
        vData[nIndex >> 3] |= (1 << (7 & nIndex));
    }
    isEmpty = false;
}

filter具体过滤过程

bool CBloomFilter::IsRelevantAndUpdate(const CTransaction &tx) {
    bool fFound = false;
    // Match if the filter contains the hash of tx for finding tx when they
    // appear in a block
    if (isFull) return true;
    if (isEmpty) return false;
    ////获取txhash,看是否在bloom filter集合中
    const uint256 &txid = tx.GetId();
    if (contains(txid)) fFound = true;

    for (unsigned int i = 0; i < tx.vout.size(); i++) {
        const CTxOut &txout = tx.vout[i];
        // Match if the filter contains any arbitrary script data element in any
        // scriptPubKey in tx. If this matches, also add the specific output
        // that was matched. This means clients don't have to update the filter
        // themselves when a new relevant tx is discovered in order to find
        // spending transactions, which avoids round-tripping and race
        // conditions.
        CScript::const_iterator pc = txout.scriptPubKey.begin();
        std::vector data;
        while (pc < txout.scriptPubKey.end()) {
            opcodetype opcode;
            //获取锁定脚本中的数据，以用于验证这些数据是否在bloom filter集合中
            if (!txout.scriptPubKey.GetOp(pc, opcode, data)) break;
            //验证是否在在bloom filter集合中
            if (data.size() != 0 && contains(data)) {
                fFound = true;
                if ((nFlags & BLOOM_UPDATE_MASK) == BLOOM_UPDATE_ALL)
                    insert(COutPoint(txid, i));
                else if ((nFlags & BLOOM_UPDATE_MASK) ==
                         BLOOM_UPDATE_P2PUBKEY_ONLY) {
                    txnouttype type;
                    std::vector> vSolutions;
                    if (Solver(txout.scriptPubKey, type, vSolutions) &&
                        (type == TX_PUBKEY || type == TX_MULTISIG))
                        insert(COutPoint(txid, i));
                }
                break;
            }
        }
    }

    if (fFound) return true;

    for (const CTxIn &txin : tx.vin) {
        // Match if the filter contains an outpoint tx spends
        // txin.prevout是否在bloom filter集合中
        if (contains(txin.prevout)) return true;

        // Match if the filter contains any arbitrary script data element in any
        // scriptSig in tx
        CScript::const_iterator pc = txin.scriptSig.begin();
        std::vector data;
        while (pc < txin.scriptSig.end()) {
            opcodetype opcode;
            //获取解锁脚本
            if (!txin.scriptSig.GetOp(pc, opcode, data)) break;
            //验证是否在在bloom filter集合中
            if (data.size() != 0 && contains(data)) return true;
        }
    }

    return false;
}

获取指定blockhash中满足bloom filter的block 内容

else if (inv.type == MSG_FILTERED_BLOCK) {
                        bool sendMerkleBlock = false;
                        CMerkleBlock merkleBlock;
                        {
                            LOCK(pfrom->cs_filter);
                            if (pfrom->pfilter) {
                                sendMerkleBlock = true;
                                merkleBlock =
                                    CMerkleBlock(block, *pfrom->pfilter);
                            }
                        }
                        if (sendMerkleBlock) {
                        ////返回merkleBlock
                            connman.PushMessage(
                                pfrom, msgMaker.Make(NetMsgType::MERKLEBLOCK,
                                                     merkleBlock));
                            // CMerkleBlock just contains hashes, so also push
                            // any transactions in the block the client did not
                            // see. This avoids hurting performance by
                            // pointlessly requiring a round-trip. Note that
                            // there is currently no way for a node to request
                            // any single transactions we didn't send here -
                            // they must either disconnect and retry or request
                            // the full block. Thus, the protocol spec specified
                            // allows for us to provide duplicate txn here,
                            // however we MUST always provide at least what the
                            // remote peer needs.
                            typedef std::pair PairType;
                            for (PairType &pair : merkleBlock.vMatchedTxn) {
                                //返回符合filter条件的transaction 数据
                                connman.PushMessage(
                                    pfrom,
                                    msgMaker.Make(NetMsgType::TX,
                                                  *block.vtx[pair.first]));
                            }
                        }

ProcessMessage:

// 如果该节点不支持BLOOM过滤器，但是命令是bloom 过滤器的命令。
    if (!(pfrom->GetLocalServices() & NODE_BLOOM) &&
        (strCommand == NetMsgType::FILTERLOAD ||
         strCommand == NetMsgType::FILTERADD)) {
        // 如果节点不支持bloom过滤器，出错退出。
        if (pfrom->nVersion >= NO_BLOOM_VERSION) {
            LOCK(cs_main);
            Misbehaving(pfrom, 100, "no-bloom-version");
            return false;
        } else {
            // 否则，应该是节点信息出错，断开该链接。
            pfrom->fDisconnect = true;
            return false;
        }
    }

load filter:

else if (strCommand == NetMsgType::FILTERLOAD) {
        CBloomFilter filter;
        vRecv >> filter;

        if (!filter.IsWithinSizeConstraints()) {
            // There is no excuse for sending a too-large filter
            LOCK(cs_main);
            Misbehaving(pfrom, 100, "oversized-bloom-filter");
        } else {
            LOCK(pfrom->cs_filter);
            delete pfrom->pfilter;
            pfrom->pfilter = new CBloomFilter(filter);
            pfrom->pfilter->UpdateEmptyFull();
            pfrom->fRelayTxes = true;
        }
    }

add filter:

else if (strCommand == NetMsgType::FILTERADD) {
        std::vector vData;
        vRecv >> vData;

        // Nodes must NEVER send a data item > 520 bytes (the max size for a
        // script data object, and thus, the maximum size any matched object can
        // have) in a filteradd message.
        bool bad = false;
        if (vData.size() > MAX_SCRIPT_ELEMENT_SIZE) {
            bad = true;
        } else {
            LOCK(pfrom->cs_filter);
            if (pfrom->pfilter) {
                pfrom->pfilter->insert(vData);
            } else {
                bad = true;
            }
        }
        if (bad) {
            LOCK(cs_main);
            // The structure of this code doesn't really allow for a good error
            // code. We'll go generic.
            Misbehaving(pfrom, 100, "invalid-filteradd");
        }
    }

clear filter

else if (strCommand == NetMsgType::FILTERCLEAR) {
        LOCK(pfrom->cs_filter);
        if (pfrom->GetLocalServices() & NODE_BLOOM) {
            delete pfrom->pfilter;
            pfrom->pfilter = new CBloomFilter();
        }
        pfrom->fRelayTxes = true;
    }

spv节点中的bloom filter如何保护用户的隐私：

假设目前没有bloom filter，用户A是一个spv节点的用户，他有两个pubKey，那么用户A就只会接收跟我这两个pubKey相关的交易，因为整个网络是明文传输的，我很容易通过监控中心，直接获取到该用户的账户余额等信息；但是加入bloom filter就不一样了，bloom filter的缺点恰好可以用来保护用户的隐私，因为bloom filter的假阳性是可以控制的，我可以适当的增加这个假阳性的概率，进而把不属于我这个pubKey的交易也发到我账户上，真真假假，虚虚实实，混淆有恶意行为的用户的视听。