上一篇SPV钱包里utxos同步提到了bloom filter,这一章节我们将从源码分析角度来个深度解剖
Bloom filter基本原理
An example of a Bloom filter, representing the set {
x
,
y
,
z
}. The colored arrows show the positions in the bit array that each set element is mapped to. The element
w
is not in the set {
x
,
y
,
z
}, because it hashes to one bit-array position containing 0. For this figure,
m
= 18 and
k
= 3.
下面的bitarray是一个m位的位数组。
filter的key集合是{x,y,z},
x添加到filter时,算法会将x进行三次不同的hash而生成3个值,将这个值当做bitarray的index,并将对应index的内容置位1, 蓝色的3个箭头代表x的3个index,w验证时同样经过3次hash, 最后会生成3个index,然后从bitarray中查找这3个index的内容,如果都为1,则证明存在,有一个不为1,说明不存在.
hash算法的特点是,相同输入产生固定的输出(index),不同的输入可能会得到相同的输出(index), 所以bloom filter能完全确定不属于集合的Key值,但是可能会错误的将不属于集合的key值认为是属于集合的。
为了降低错误率,其实就是要降低不同key值再几次hash产生相同输出的概率。Bitmap的长度我们定为m,几次hash我们定义为k. m增大能降低一次hash不同输入产生相同输出的概率,k增大能降低所有hash都相同的概率。所以合适的m和k值对降低错误率很关键.具体怎么选取m, k值有相关的数学公式,大家可以参阅
https://en.wikipedia.org/wiki/Bloom_filter
Bitcoin bloom filter流程
1)load filter (net_processing.cpp)
else if (strCommand == NetMsgType::FILTERLOAD)
{
CBloomFilter filter;
vRecv >> filter;
if (!filter.IsWithinSizeConstraints())
{
// There is no excuse for sending a too-large filter
LOCK(cs_main);
Misbehaving(pfrom->GetId(), 100);
}
else
{
LOCK(pfrom->cs_filter);
pfrom->pfilter.reset(new CBloomFilter(filter));
pfrom->pfilter->UpdateEmptyFull();
pfrom->fRelayTxes = true;
}
}
filter的数据序列化
template
inline void SerializationOp(Stream& s, Operation ser_action) {
//vData是bloom filter的集合key
READWRITE(vData);
//几次hash函数
READWRITE(nHashFuncs);
READWRITE(nTweak);
READWRITE(nFlags);
}
2)新增filter
else if (strCommand == NetMsgType::FILTERADD)
{
std::vector vData;
vRecv >> vData;
// Nodes must NEVER send a data item > 520 bytes (the max size for a script data object,
// and thus, the maximum size any matched object can have) in a filteradd message
bool bad = false;
if (vData.size() > MAX_SCRIPT_ELEMENT_SIZE) {
bad = true;
} else {
LOCK(pfrom->cs_filter);
if (pfrom->pfilter) {
pfrom->pfilter->insert(vData);
} else {
bad = true;
}
}
if (bad) {
LOCK(cs_main);
Misbehaving(pfrom->GetId(), 100);
}
}
其实就是按照bloom filter的算法对新增的key做几次hash然后修改bitArray
void CBloomFilter::insert(const std::vector& vKey)
{
if (isFull)
return;
//n次不同hash,不代表需要n个不同的hash函数,直接根据index更改hash seed即可实现
for (unsigned int i = 0; i < nHashFuncs; i++)
{
unsigned int nIndex = Hash(i, vKey);
// Sets bit nIndex of vData
vData[nIndex >> 3] |= (1 << (7 & nIndex));
}
isEmpty = false;
}
上面的
vData[nIndex >> 3] |= (1 << (7 & nIndex)); 每一次key hash生成的结果对应到bitArray的1bit的index, 而vData是char对象,总共有4 bit,所以nIndex >> 3先找到对一个char的index, 1 << (7 & nIndex) 找到index对应4位中的哪一位
class CBloomFilter
{
private:
std::vector vData;
unsigned int nHashFuncs;
unsigned int nTweak;
}
nHashFuncs是int, 说好的不同的hash函数呢?
inline unsigned int CBloomFilter::Hash(unsigned int nHashNum, const std::vector& vDataToHash) const
{
// 0xFBA4C795 chosen as it guarantees a reasonable bit difference between nHashNum values.
return MurmurHash3(
nHashNum * 0xFBA4C795 + nTweak
, vDataToHash) % (vData.size() * 8);
}
从这里可以看出,n个不同的hash函数,其实确实可以通过n个不同int即可实现,这里直接通过‘nHashNum * 0xFBA4C795 + nTweak’就达到了不同hash的效果
3)filter应用场景
我们以FILTERED_BLOCK消息为例,该消息的意思是获取指定blockhash中满足bloom filter的block 内容
else if (inv.type == MSG_FILTERED_BLOCK)
{
bool sendMerkleBlock = false;
CMerkleBlock merkleBlock;
{
LOCK(pfrom->cs_filter);
if (pfrom->pfilter) {
sendMerkleBlock = true;
//merkleBlock只包含包头,符合条件的娥txhash及partial merklepath
//是一种被过滤掉的block content
merkleBlock = CMerkleBlock(*pblock, *pfrom->pfilter);
}
}
if (sendMerkleBlock) {
//返回merkleBlock
connman->PushMessage(pfrom, msgMaker.Make(NetMsgType::MERKLEBLOCK, merkleBlock));
// CMerkleBlock just contains hashes, so also push any transactions in the block the client did not see
// This avoids hurting performance by pointlessly requiring a round-trip
// Note that there is currently no way for a node to request any single transactions we didn't send here -
// they must either disconnect and retry or request the full block.
// Thus, the protocol spec specified allows for us to provide duplicate txn here,
// however we MUST always provide at least what the remote peer needs
typedef std::pair PairType;
for (PairType& pair : merkleBlock.vMatchedTxn)
//返回符合filter条件的transaction 数据
connman->PushMessage(pfrom, msgMaker.Make(SERIALIZE_TRANSACTION_NO_WITNESS, NetMsgType::TX, *pblock->vtx[pair.first]));
}
// else
// no response
}
}
filter具体过滤过程
CMerkleBlock::CMerkleBlock(const CBlock& block, CBloomFilter* filter, const std::set* txids)
{
header = block.GetBlockHeader();
std::vector vMatch;
std::vector vHashes;
vMatch.reserve(block.vtx.size());
vHashes.reserve(block.vtx.size());
for (unsigned int i = 0; i < block.vtx.size(); i++)
{
const uint256& hash = block.vtx[i]->GetHash();
if (txids && txids->count(hash)) {
vMatch.push_back(true);
} else if (filter && filter->IsRelevantAndUpdate(*block.vtx[i])) {
vMatch.push_back(true);
vMatchedTxn.emplace_back(i, hash);
} else {
vMatch.push_back(false);
}
vHashes.push_back(hash);
}
txn = CPartialMerkleTree(vHashes, vMatch);
}
bool CBloomFilter::
IsRelevantAndUpdate
(const CTransaction& tx)
{
bool fFound = false;
// Match if the filter contains the hash of tx
// for finding tx when they appear in a block
if (isFull)
return true;
if (isEmpty)
return false;
//获取txhash,看是否在bloom filter集合中
const uint256& hash = tx.GetHash();
if (contains(hash))
fFound = true;
for (unsigned int i = 0; i < tx.vout.size(); i++)
{
const CTxOut& txout = tx.vout[i];
// Match if the filter contains any arbitrary script data element in any scriptPubKey in tx
// If this matches, also add the specific output that was matched.
// This means clients don't have to update the filter themselves when a new relevant tx
// is discovered in order to find spending transactions, which avoids round-tripping and race conditions.
CScript::const_iterator pc = txout.scriptPubKey.begin();
std::vector data;
while (pc < txout.scriptPubKey.end())
{
opcodetype opcode;
//获取锁定脚本中的数据,以用于验证这些数据是否在bloom filter集合中
if (!txout.scriptPubKey.GetOp(pc, opcode, data))
break;
//验证是否在
在bloom filter集合中
if (data.size() != 0 && contains(data))
{
fFound = true;
break;
}
}
}
if (fFound)
return true;
for (const CTxIn& txin : tx.vin)
{
// Match if the filter contains an outpoint tx spends
//txin.prevout是否在bloom filter集合中
if (contains(txin.prevout))
return true;
// Match if the filter contains any arbitrary script data element in any scriptSig in tx
CScript::const_iterator pc = txin.scriptSig.begin();
std::vector data;
while (pc < txin.scriptSig.end())
{
opcodetype opcode;
//获取解锁脚本以验证是否在在bloom filter集合中
if (!txin.scriptSig.GetOp(pc, opcode, data))
break;
//验证是否在
在bloom filter集合中
if (data.size() != 0 && contains(data))
return true;
}
}
return false;
}
bool CBloomFilter::contains(const std::vector& vKey) const
{
for (unsigned int i = 0; i < nHashFuncs; i++)
{
unsigned int nIndex = Hash(i, vKey);
// Checks bit nIndex of vData
if (!(vData[nIndex >> 3] & (1 << (7 & nIndex))))
return false;
}
return true;
}
总结,用来filter的数据可以是tx.hash,也可以是txout.scriptPubKey中的data,也可以是txin.scriptSig中的data
比如根据交易的publicKey来过滤交易,就可以在transaction的txin, txout的上做文章.想P2PK的解锁和锁定脚本中都有pubKey,可以用来filter.
/********************************
* 本文来自CSDN博主"爱踢门"
* 转载请标明出处
: http://blog.csdn.net/itleaks
******************************************/