介绍完BoltDB后,我们回到btcd/database的源代码。了解了BoltDB的实现后,btcd/database的接口定义和其调用方法将变得容易理解。然而,包database并未实现一个数据库,它实际上是btcd中的存储框架,使btcd支持多种数据库,其中,ffldb是database包中提供的默认数据库。在clone完代码后,可以发现包database主要包含的文件有:
- cmd/dbtool: 实现了一个从db文件中读写block的工具。
- ffldb: 实现了一个默认的数据库驱动,它参考BoltDB实现了DB、Bucket、Tx等;
- internal/treap:一个树堆的实现,用于缓存元数据;
- testdata: 包含用于测试的db文件;
- driver.go: 定义了Driver类型及注册、打开数据库的方法;
- interface.go: 定义了DB、Bucket、Tx、Cursor等接口,几乎与BoltDB中的定义一致;
- error.go: 定义了包database中的错误代码及对应的提示字符;
- doc.go: 包database的描述;
- driver_test.go、error_test.go、example_test.go、export_test.go: 对应的测试文件;
需要说明的是,ffldb并不是真正意义上的数据库,它利用leveldb来存储元数据,用文件来存区块。对元数据的存储,ffldb参考BoltDB的实现,支持Bucket及嵌套子Bucket;对区块或者元数据的读写,它也实现了类似的Transaction。特别地,ffldb通过leveldb存储元数据时,增加了一层缓存以提高读写效率。它的基本框架如下图所示:
我们先来看看包database中的接口DB的定义:
//btcd/database/interface.go
type DB interface {
// Type returns the database driver type the current database instance
// was created with.
Type() string
......
Begin(writable bool) (Tx, error)
......
View(fn func(tx Tx) error) error
......
Update(fn func(tx Tx) error) error
......
Close() error
}
可以看出,其中的接口定义与BoltDB中的定义几乎一样。事实上,Bucket及Cursor等接口均与BoltDB类似,Tx接口由于增加了对metadata和block的操作,有所不同:
//btcd/database/interface.go
// Tx represents a database transaction. It can either by read-only or
// read-write. The transaction provides a metadata bucket against which all
// read and writes occur.
//
// As would be expected with a transaction, no changes will be saved to the
// database until it has been committed. The transaction will only provide a
// view of the database at the time it was created. Transactions should not be
// long running operations.
type Tx interface {
// Metadata returns the top-most bucket for all metadata storage.
Metadata() Bucket
......
StoreBlock(block *btcutil.Block) error
......
HasBlock(hash *chainhash.Hash) (bool, error)
......
HasBlocks(hashes []chainhash.Hash) ([]bool, error)
......
FetchBlockHeader(hash *chainhash.Hash) ([]byte, error)
......
FetchBlockHeaders(hashes []chainhash.Hash) ([][]byte, error)
......
FetchBlock(hash *chainhash.Hash) ([]byte, error)
......
FetchBlocks(hashes []chainhash.Hash) ([][]byte, error)
......
FetchBlockRegion(region *BlockRegion) ([]byte, error)
......
FetchBlockRegions(regions []BlockRegion) ([][]byte, error)
// ******************************************************************
// Methods related to both atomic metadata storage and block storage.
// ******************************************************************
......
Commit() error
......
Rollback() error
}
由于篇幅原因,我们略去了各接口的注释,读者可以从源文件中阅读。从Tx接口的定义中可以看出,它主要定义了三类方法:
- Metadata(), 通过它可以获得根Bucket,所有的元数据均归属于Bucket,Bucket及其中的K/V对最终存于leveldb中。在一个Transaction中,对元数据的操作均是通过Metadata()得到Bucket后,再在Bucket中进行操作的;
- XxxBlockXxx,与Block操作相关的接口,它们主要是通过读写文件来读写Block;
- Commit()和Rollback,在可写Tx中写元数据或者区块后,均需要通过Commit()来提交修改并关闭Tx,或者通过Rollback来丢弃修改或关闭只读Tx,作用与BoltDB中的一致;
ffldb提供了对上述各接口的实现,我们接下来着重分析它的代码。我们先来看看它的db类型定义:
//btcd/database/ffldb/db.go
// db represents a collection of namespaces which are persisted and implements
// the database.DB interface. All database access is performed through
// transactions which are obtained through the specific Namespace.
type db struct {
writeLock sync.Mutex // Limit to one write transaction at a time.
closeLock sync.RWMutex // Make database close block while txns active.
closed bool // Is the database closed?
store *blockStore // Handles read/writing blocks to flat files.
cache *dbCache // Cache layer which wraps underlying leveldb DB.
}
其中各字段意义是:
- writeLock: 互斥锁,保证同时只有一个可写transaction;
- closeLock: 保证数据库Close时所有已经打开的transaction均已结束;
- closed: 指示数据库是否已经关闭;
- store: 指向blockStore,用于读写区块;
- cach: 指向dbCache,用于读写元数据;
db实现了database.DB接口,其中各方法的实现与BoltDB中基本类似,也是通过View()或者Update()的回调方法获取Tx对象或其引用,然后调用Tx中接口进行数据库操作,故我们不再分析db的各方法实现,重点分析Tx的实现。ffldb中transaction的定义如下,它实现了database.Tx接口:
//btcd/database/ffldb/db.go
// transaction represents a database transaction. It can either be read-only or
// read-write and implements the database.Bucket interface. The transaction
// provides a root bucket against which all read and writes occur.
type transaction struct {
managed bool // Is the transaction managed?
closed bool // Is the transaction closed?
writable bool // Is the transaction writable?
db *db // DB instance the tx was created from.
snapshot *dbCacheSnapshot // Underlying snapshot for txns.
metaBucket *bucket // The root metadata bucket.
blockIdxBucket *bucket // The block index bucket.
// Blocks that need to be stored on commit. The pendingBlocks map is
// kept to allow quick lookups of pending data by block hash.
pendingBlocks map[chainhash.Hash]int
pendingBlockData []pendingBlock
// Keys that need to be stored or deleted on commit.
pendingKeys *treap.Mutable
pendingRemove *treap.Mutable
// Active iterators that need to be notified when the pending keys have
// been updated so the cursors can properly handle updates to the
// transaction state.
activeIterLock sync.RWMutex
activeIters []*treap.Iterator
}
其中各字段意义:
- managed: transaction是否被db托管,托管状态的transaction不能再主动调用Commit()或者Rollback();
- closed: 指示当前transaction是否已经结束;
- writable: 指示当前transaction是否可写;
- db: 指向与当前transaction绑定的db对象;
- snapshot: 当前transaction读到的元数据缓存的一个快照,在transaction打开的时候对dbCache进行快照得到的,也是元数据存储中MVCC机制的一部分,类似于BoltDB中读meta page;
- metaBucket: 存储元数据的根Bucket;
- blockIdxBucket: 存储区块hash与其序号的Bucket,它是metaBucket的第一个子Bucket,且只在ffldb内部使用;
- pendingBlocks: 记录待提交Block的哈希与其在pendingBlockData中的位置的对应关系;
- pendingBlockData: 顺序记录所有待提交Block的字节序列;
- pendingKeys: 待添加或者更新的元数据集合,请注意,它指向一个树堆;
- pendingRemove: 待删除的元数据集合,它也指向一个树堆,与pendingKeys一样,它们均通过dbCache向leveldb中更新;
- activeIterLock: 对activeIters的保护锁;
- activeIters: 用于记录当前transaction中查找dbCache的Iterators,当向dbCache中更新Key时,树堆旋转会更新节点间关系,故需将所有活跃的Iterator复位;
我们说transaction中主要有三类方法,我们先来看看它的Metadata()方法:
//btcd/database/ffldb/db.go
// Metadata returns the top-most bucket for all metadata storage.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) Metadata() database.Bucket {
return tx.metaBucket
}
可以看出它仅仅是返回根Bucket,剩下的操作均通过它来进行。我们来看看bucket的定义,它实现了database.Bucket:
//btcd/database/ffldb/db.go
// bucket is an internal type used to represent a collection of key/value pairs
// and implements the database.Bucket interface.
type bucket struct {
tx *transaction
id [4]byte
}
需要注意的是,ffldb中的bucket与BoltDB中的Bucket虽然有着相同的接口定义,但它们底层实际存储K/V对的数据结构并不相同,所以bucket的定义和查找方法大不相同。ffldb利用leveldb来存储K/V,leveldb底层数据结构为LSM树(log-structured merge-tree),而BoltDB采用B+Tree。ffldb利用leveldb提供的接口来读写K/V,而levealdb中没有Bucket的概念,也没有对Key进行分层管理的方法,那ffldb中是如何实现bucket的呢?我们可以通过CreateBucket()来分析:
//btcd/database/ffldb/db.go
// CreateBucket creates and returns a new nested bucket with the given key.
//
// Returns the following errors as required by the interface contract:
// - ErrBucketExists if the bucket already exists
// - ErrBucketNameRequired if the key is empty
// - ErrIncompatibleValue if the key is otherwise invalid for the particular
// implementation
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) CreateBucket(key []byte) (database.Bucket, error) {
......
// Ensure bucket does not already exist.
bidxKey := bucketIndexKey(b.id, key)
......
// Find the appropriate next bucket ID to use for the new bucket. In
// the case of the special internal block index, keep the fixed ID.
var childID [4]byte
if b.id == metadataBucketID && bytes.Equal(key, blockIdxBucketName) {
childID = blockIdxBucketID
} else {
var err error
childID, err = b.tx.nextBucketID()
if err != nil {
return nil, err
}
}
// Add the new bucket to the bucket index.
if err := b.tx.putKey(bidxKey, childID[:]); err != nil {
str := fmt.Sprintf("failed to create bucket with key %q", key)
return nil, convertErr(str, err)
}
return &bucket{tx: b.tx, id: childID}, nil
}
上面代码主要包含:
- 通过bucketIndexKey()创建子Bucket的Key;
- 为子Bucket指定或者选择一个id;
- 将子Bucket的Key和id作为K/V记录存入父Bucket中,这一点与BoltDB相似;
与BoltDB中通过K/V的flag来标记Bucket不同,ffldb中通过Key的格式来标记Bucket:
//btcd/database/ffldb/db.go
// bucketIndexKey returns the actual key to use for storing and retrieving a
// child bucket in the bucket index. This is required because additional
// information is needed to distinguish nested buckets with the same name.
func bucketIndexKey(parentID [4]byte, key []byte) []byte {
// The serialized bucket index key format is:
//
indexKey := make([]byte, len(bucketIndexPrefix)+4+len(key))
copy(indexKey, bucketIndexPrefix)
copy(indexKey[len(bucketIndexPrefix):], parentID[:])
copy(indexKey[len(bucketIndexPrefix)+4:], key)
return indexKey
}
可以看出,一个子Bucket的Key总是“
//btcd/database/ffldb/db.go
// Put saves the specified key/value pair to the bucket. Keys that do not
// already exist are added and keys that already exist are overwritten.
//
// Returns the following errors as required by the interface contract:
// - ErrKeyRequired if the key is empty
// - ErrIncompatibleValue if the key is the same as an existing bucket
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Bucket interface implementation.
func (b *bucket) Put(key, value []byte) error {
......
return b.tx.putKey(bucketizedKey(b.id, key), value)
}
其中的关键也在于Key,在Bucket中添加记录时,会通过bucketizedKey()对key进行处理:
//btcd/database/ffldb/db.go
// bucketizedKey returns the actual key to use for storing and retrieving a key
// for the provided bucket ID. This is required because bucketizing is handled
// through the use of a unique prefix per bucket.
func bucketizedKey(bucketID [4]byte, key []byte) []byte {
// The serialized block index key format is:
//
bKey := make([]byte, 4+len(key))
copy(bKey, bucketID[:])
copy(bKey[4:], key)
return bKey
}
也就是说,在向bucket中添加K/V时,Key会被转换成“
//btcd/database/ffldb/db.go
// putKey adds the provided key to the list of keys to be updated in the
// database when the transaction is committed.
//
// NOTE: This function must only be called on a writable transaction. Since it
// is an internal helper function, it does not check.
func (tx *transaction) putKey(key, value []byte) error {
// Prevent the key from being deleted if it was previously scheduled
// to be deleted on transaction commit.
tx.pendingRemove.Delete(key)
// Add the key/value pair to the list to be written on transaction
// commit.
tx.pendingKeys.Put(key, value)
tx.notifyActiveIters()
return nil
}
类似地,bucket的Delete()方法也是调用transaction的deleteKey()方法来实现。deleteKey()中,会将要删除的key添加到pendingRemove中,待transaction Commit最终将pendingKeys添加到leveldb中,pendingRemove中的Key从leveldb中删除。bucket的Get()方法也会最终调用transaction的fetchKey()方法来查询,fetchKey()先从
pendingRemove或者pendingKeys查找,如果找不到再从dbCache的一个快照中查找。
transaction中第二类是读取Block相关的方法,我们主要分析StoreBlock()和FetchBlock(),先来看看StoreBlock:
//btcd/database/ffldb/db.go
// StoreBlock stores the provided block into the database. There are no checks
// to ensure the block connects to a previous block, contains double spends, or
// any additional functionality such as transaction indexing. It simply stores
// the block in the database.
//
// Returns the following errors as required by the interface contract:
// - ErrBlockExists when the block hash already exists
// - ErrTxNotWritable if attempted against a read-only transaction
// - ErrTxClosed if the transaction has already been closed
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) StoreBlock(block *btcutil.Block) error {
......
// Reject the block if it already exists.
blockHash := block.Hash()
......
blockBytes, err := block.Bytes()
......
// Add the block to be stored to the list of pending blocks to store
// when the transaction is committed. Also, add it to pending blocks
// map so it is easy to determine the block is pending based on the
// block hash.
if tx.pendingBlocks == nil {
tx.pendingBlocks = make(map[chainhash.Hash]int)
}
tx.pendingBlocks[*blockHash] = len(tx.pendingBlockData)
tx.pendingBlockData = append(tx.pendingBlockData, pendingBlock{
hash: blockHash,
bytes: blockBytes,
})
log.Tracef("Added block %s to pending blocks", blockHash)
return nil
}
可以看出,StoreBlock()主要是把block先放入pendingBlockData,等待Commit时写入文件。我们再来看看FetchBlock():
//btcd/database/ffldb/db.go
// FetchBlock returns the raw serialized bytes for the block identified by the
// given hash. The raw bytes are in the format returned by Serialize on a
// wire.MsgBlock.
//
// Returns the following errors as required by the interface contract:
// - ErrBlockNotFound if the requested block hash does not exist
// - ErrTxClosed if the transaction has already been closed
// - ErrCorruption if the database has somehow become corrupted
//
// In addition, returns ErrDriverSpecific if any failures occur when reading the
// block files.
//
// NOTE: The data returned by this function is only valid during a database
// transaction. Attempting to access it after a transaction has ended results
// in undefined behavior. This constraint prevents additional data copies and
// allows support for memory-mapped database implementations.
//
// This function is part of the database.Tx interface implementation.
func (tx *transaction) FetchBlock(hash *chainhash.Hash) ([]byte, error) {
......
// When the block is pending to be written on commit return the bytes
// from there.
if idx, exists := tx.pendingBlocks[*hash]; exists {
return tx.pendingBlockData[idx].bytes, nil
}
// Lookup the location of the block in the files from the block index.
blockRow, err := tx.fetchBlockRow(hash)
if err != nil {
return nil, err
}
location := deserializeBlockLoc(blockRow)
// Read the block from the appropriate location. The function also
// performs a checksum over the data to detect data corruption.
blockBytes, err := tx.db.store.readBlock(hash, location)
if err != nil {
return nil, err
}
return blockBytes, nil
}
读Block时先从pendingBlocks中查找,如果有则直接从pendingBlockData中返回;否则,通过db中的blockStore读出区块。我们先不深入blockStore,待介绍完transaction的Commit后再来分析它。关键地,我们可以发现,通过transaction读写元数据或者Block时,均会先对pendingBlocks或pendingKeys与pedingRemove读写,它们可以看作transaction的缓冲,在Commit时被同步到文件或者leveldb中。Commit()最终调用writePendingAndCommit()进行实际操作:
//btcd/database/ffldb/db.go
// writePendingAndCommit writes pending block data to the flat block files,
// updates the metadata with their locations as well as the new current write
// location, and commits the metadata to the memory database cache. It also
// properly handles rollback in the case of failures.
//
// This function MUST only be called when there is pending data to be written.
func (tx *transaction) writePendingAndCommit() error {
......
// Loop through all of the pending blocks to store and write them.
for _, blockData := range tx.pendingBlockData {
log.Tracef("Storing block %s", blockData.hash)
location, err := tx.db.store.writeBlock(blockData.bytes)
if err != nil {
rollback()
return err
}
// Add a record in the block index for the block. The record
// includes the location information needed to locate the block
// on the filesystem as well as the block header since they are
// so commonly needed.
blockHdr := blockData.bytes[0:blockHdrSize]
blockRow := serializeBlockRow(location, blockHdr)
err = tx.blockIdxBucket.Put(blockData.hash[:], blockRow)
if err != nil {
rollback()
return err
}
}
// Update the metadata for the current write file and offset.
writeRow := serializeWriteRow(wc.curFileNum, wc.curOffset)
if err := tx.metaBucket.Put(writeLocKeyName, writeRow); err != nil {
rollback()
return convertErr("failed to store write cursor", err)
}
// Atomically update the database cache. The cache automatically
// handles flushing to the underlying persistent storage database.
return tx.db.cache.commitTx(tx)
}
writePendingAndCommit()中,主要包含:
- 通过blockStore将pendingBlockData中的区块写入文件,同时将区块的hash与它在文件中的位置写入blockIdxBucket,以便后续查找;
- 更新metaBucket中记录当前文件读写位置的K/V;
- 通过dbCache中commitTx()将待提交的K/V写入树堆缓存,必要时写入leveldb;
blockStore
transaction中读写元数据或者区块时,最终会通过blockStore读写文件或者dbCache读写树堆或者leveldb。所以接下来,我们主要分析blockStore和dbCache。我们先来看blockStore的定义:
//btcd/database/ffldb/blockio.go
// blockStore houses information used to handle reading and writing blocks (and
// part of blocks) into flat files with support for multiple concurrent readers.
type blockStore struct {
// network is the specific network to use in the flat files for each
// block.
network wire.BitcoinNet
// basePath is the base path used for the flat block files and metadata.
basePath string
// maxBlockFileSize is the maximum size for each file used to store
// blocks. It is defined on the store so the whitebox tests can
// override the value.
maxBlockFileSize uint32
// The following fields are related to the flat files which hold the
// actual blocks. The number of open files is limited by maxOpenFiles.
//
// obfMutex protects concurrent access to the openBlockFiles map. It is
// a RWMutex so multiple readers can simultaneously access open files.
//
// openBlockFiles houses the open file handles for existing block files
// which have been opened read-only along with an individual RWMutex.
// This scheme allows multiple concurrent readers to the same file while
// preventing the file from being closed out from under them.
//
// lruMutex protects concurrent access to the least recently used list
// and lookup map.
//
// openBlocksLRU tracks how the open files are refenced by pushing the
// most recently used files to the front of the list thereby trickling
// the least recently used files to end of the list. When a file needs
// to be closed due to exceeding the the max number of allowed open
// files, the one at the end of the list is closed.
//
// fileNumToLRUElem is a mapping between a specific block file number
// and the associated list element on the least recently used list.
//
// Thus, with the combination of these fields, the database supports
// concurrent non-blocking reads across multiple and individual files
// along with intelligently limiting the number of open file handles by
// closing the least recently used files as needed.
//
// NOTE: The locking order used throughout is well-defined and MUST be
// followed. Failure to do so could lead to deadlocks. In particular,
// the locking order is as follows:
// 1) obfMutex
// 2) lruMutex
// 3) writeCursor mutex
// 4) specific file mutexes
//
// None of the mutexes are required to be locked at the same time, and
// often aren't. However, if they are to be locked simultaneously, they
// MUST be locked in the order previously specified.
//
// Due to the high performance and multi-read concurrency requirements,
// write locks should only be held for the minimum time necessary.
obfMutex sync.RWMutex
lruMutex sync.Mutex
openBlocksLRU *list.List // Contains uint32 block file numbers.
fileNumToLRUElem map[uint32]*list.Element
openBlockFiles map[uint32]*lockableFile
// writeCursor houses the state for the current file and location that
// new blocks are written to.
writeCursor *writeCursor
// These functions are set to openFile, openWriteFile, and deleteFile by
// default, but are exposed here to allow the whitebox tests to replace
// them when working with mock files.
openFileFunc func(fileNum uint32) (*lockableFile, error)
openWriteFileFunc func(fileNum uint32) (filer, error)
deleteFileFunc func(fileNum uint32) error
}
其各字段意义如下:
- network: 指示当前Block网络类型,比如MainNet、TestNet或SimNet,在向文件中写入区块时会指定该区块来自哪类网络;
- basePath: 存储Block的文件在磁盘上的存储路径;
- maxBlockFileSize: 存储Block文件的最大的Size;
- obfMutex: 对openBlockFiles进行保护的读写锁;
- lruMutex:对openBlocksLRU和fileNumToLRUElem进行保护的互斥锁;
- openBlocksLRU: 已打开文件的序号的LRU列表,默认的最大打开文件数是25;
- fileNumToLRUElem: 记录文件序号与openBlocksLRU中元素的对应关系;
- openBlockFiles: 记录所有打开的只读文件的序号与文件指针的对应关系;
- writeCursor: 指向当前写入的文件,记录其文件序号和写偏移;
- openFileFunc、openWriteFileFunc以及deleteFileFunc: openFile、openWriteFile和deleteFile的接口方法,主要用于测试,它的默认实现是blockStore的对应方法。
我们还是通过blockStore的readBlock()和writeBlock()方法来了解blockStore的工作机制。我们先来看看readBlokc():
//btcd/database/ffldb/blockio.go
// readBlock reads the specified block record and returns the serialized block.
// It ensures the integrity of the block data by checking that the serialized
// network matches the current network associated with the block store and
// comparing the calculated checksum against the one stored in the flat file.
// This function also automatically handles all file management such as opening
// and closing files as necessary to stay within the maximum allowed open files
// limit.
//
// Returns ErrDriverSpecific if the data fails to read for any reason and
// ErrCorruption if the checksum of the read data doesn't match the checksum
// read from the file.
//
// Format:
func (s *blockStore) readBlock(hash *chainhash.Hash, loc blockLocation) ([]byte, error) {
// Get the referenced block file handle opening the file as needed. The
// function also handles closing files as needed to avoid going over the
// max allowed open files.
blockFile, err := s.blockFile(loc.blockFileNum)
if err != nil {
return nil, err
}
serializedData := make([]byte, loc.blockLen)
n, err := blockFile.file.ReadAt(serializedData, int64(loc.fileOffset))
blockFile.RUnlock()
if err != nil {
str := fmt.Sprintf("failed to read block %s from file %d, "+
"offset %d: %v", hash, loc.blockFileNum, loc.fileOffset,
err)
return nil, makeDbErr(database.ErrDriverSpecific, str, err)
}
// Calculate the checksum of the read data and ensure it matches the
// serialized checksum. This will detect any data corruption in the
// flat file without having to do much more expensive merkle root
// calculations on the loaded block.
serializedChecksum := binary.BigEndian.Uint32(serializedData[n-4:])
calculatedChecksum := crc32.Checksum(serializedData[:n-4], castagnoli)
if serializedChecksum != calculatedChecksum {
str := fmt.Sprintf("block data for block %s checksum "+
"does not match - got %x, want %x", hash,
calculatedChecksum, serializedChecksum)
return nil, makeDbErr(database.ErrCorruption, str, nil)
}
// The network associated with the block must match the current active
// network, otherwise somebody probably put the block files for the
// wrong network in the directory.
serializedNet := byteOrder.Uint32(serializedData[:4])
if serializedNet != uint32(s.network) {
str := fmt.Sprintf("block data for block %s is for the "+
"wrong network - got %d, want %d", hash, serializedNet,
uint32(s.network))
return nil, makeDbErr(database.ErrDriverSpecific, str, nil)
}
// The raw block excludes the network, length of the block, and
// checksum.
return serializedData[8 : n-4], nil
}
其主要步骤为:
- 通过blockFile()查询已经打开的文件或者新打开一个文件;
- 通过file.ReadAt()方法从文件中的loc.fileOffset位置读出区块数据,它的格式是“
”; - 从区块数据中解析出block的字节流;
其中比较重要的是通过blockFile()得到一个文件句柄,我们来看看它的实现:
//btcd/database/ffldb/blockio.go
// blockFile attempts to return an existing file handle for the passed flat file
// number if it is already open as well as marking it as most recently used. It
// will also open the file when it's not already open subject to the rules
// described in openFile.
//
// NOTE: The returned block file will already have the read lock acquired and
// the caller MUST call .RUnlock() to release it once it has finished all read
// operations. This is necessary because otherwise it would be possible for a
// separate goroutine to close the file after it is returned from here, but
// before the caller has acquired a read lock.
func (s *blockStore) blockFile(fileNum uint32) (*lockableFile, error) {
// When the requested block file is open for writes, return it.
wc := s.writeCursor
wc.RLock()
if fileNum == wc.curFileNum && wc.curFile.file != nil {
obf := wc.curFile
obf.RLock()
wc.RUnlock()
return obf, nil
}
wc.RUnlock()
// Try to return an open file under the overall files read lock.
s.obfMutex.RLock()
if obf, ok := s.openBlockFiles[fileNum]; ok {
s.lruMutex.Lock()
s.openBlocksLRU.MoveToFront(s.fileNumToLRUElem[fileNum])
s.lruMutex.Unlock()
obf.RLock()
s.obfMutex.RUnlock()
return obf, nil
}
s.obfMutex.RUnlock()
// Since the file isn't open already, need to check the open block files
// map again under write lock in case multiple readers got here and a
// separate one is already opening the file.
s.obfMutex.Lock() (1)
if obf, ok := s.openBlockFiles[fileNum]; ok {
obf.RLock()
s.obfMutex.Unlock()
return obf, nil
}
// The file isn't open, so open it while potentially closing the least
// recently used one as needed.
obf, err := s.openFileFunc(fileNum)
if err != nil {
s.obfMutex.Unlock()
return nil, err
}
obf.RLock()
s.obfMutex.Unlock()
return obf, nil
}
它的主要步骤是:
- 检查要查找的文件是否是writeCursor指向的文件,如果是则直接返回;请注意,对writeCursor的访问通过其读锁保护;同时,blockFile()返回的lockableFile对象已经被自己的读锁保护,由调用方负责释放文件的读锁。如果返回writeCursor指向的文件,即正在向该文件写入区块,它在被写满时会被关闭,读锁可以保护关闭该文件时必须等读文件结束;
- 接着,从blockStore记录的openBlockFiles中查找文件,如果找到,将文件移至LRU列表的首位置,同时获得文件读锁后返回;
- 代码(1)处获取s.obfMutex的写锁并再次从openBlockFiles中查找文件,这是为了防止刚刚从openBlockFiles查找完成后,目标文件被其他线程打开并添加到openBlockFiles中了,如果不作此保护,在openBlockFiles中未找到就打开新文件,可能出现同一个文件被多次打开的情况。有读者可能会想到: 为什么不在第一次查找openBlockFiles就通过s.obfMutex的写锁保护呢?这里也为了提高对openBlockFiles的读写并发,openBlockFiles存的均是最近打开过的文件,有较大概率在第一次查找openBlockFiles就能找到目标文件,通过s.obfMutex的读锁保护,能提高从openBlockFiles查找的并发量;
- 如果openBlockFiles中找不到目标文件,就调用openFile()打开新文件,请注意整个openFile()调用均在s.obfMutex的写锁保护下;
//btcd/database/ffldb/blockio.go
// openFile returns a read-only file handle for the passed flat file number.
// The function also keeps track of the open files, performs least recently
// used tracking, and limits the number of open files to maxOpenFiles by closing
// the least recently used file as needed.
//
// This function MUST be called with the overall files mutex (s.obfMutex) locked
// for WRITES.
func (s *blockStore) openFile(fileNum uint32) (*lockableFile, error) {
// Open the appropriate file as read-only.
filePath := blockFilePath(s.basePath, fileNum)
file, err := os.Open(filePath)
if err != nil {
return nil, makeDbErr(database.ErrDriverSpecific, err.Error(),
err)
}
blockFile := &lockableFile{file: file}
// Close the least recently used file if the file exceeds the max
// allowed open files. This is not done until after the file open in
// case the file fails to open, there is no need to close any files.
//
// A write lock is required on the LRU list here to protect against
// modifications happening as already open files are read from and
// shuffled to the front of the list.
//
// Also, add the file that was just opened to the front of the least
// recently used list to indicate it is the most recently used file and
// therefore should be closed last.
s.lruMutex.Lock()
lruList := s.openBlocksLRU
if lruList.Len() >= maxOpenFiles {
lruFileNum := lruList.Remove(lruList.Back()).(uint32)
oldBlockFile := s.openBlockFiles[lruFileNum]
// Close the old file under the write lock for the file in case
// any readers are currently reading from it so it's not closed
// out from under them.
oldBlockFile.Lock()
_ = oldBlockFile.file.Close()
oldBlockFile.Unlock()
delete(s.openBlockFiles, lruFileNum)
delete(s.fileNumToLRUElem, lruFileNum)
}
s.fileNumToLRUElem[fileNum] = lruList.PushFront(fileNum)
s.lruMutex.Unlock()
// Store a reference to it in the open block files map.
s.openBlockFiles[fileNum] = blockFile
return blockFile, nil
}
openFile()中主要执行:
- 直接通过os.Open()调用以只读模式打开目标文件;
- 检测openBlocksLRU是否已满,如果已满,则将列表末尾元素移除,同时将对应的文件关闭并从openBlockFiles将其移除,然后将新打开的文件添加了列表首位置;其中对openBlocksLRU和fileNumToLRUElem的访问均在s.lruMutex保护下;
- 将新打开的文件放入openBlockFiles中;
从openFile()中可以看出,blockStore通过openBlockFiles和openBlocksLRU及fileNumToLRUElem维护了一个已经打开的只读文件的LRU缓存列表,可以加快从文件中读区块的速度。接下来,我们再来看看writeBlock():
//btcd/database/ffldb/blockio.go
// writeBlock appends the specified raw block bytes to the store's write cursor
// location and increments it accordingly. When the block would exceed the max
// file size for the current flat file, this function will close the current
// file, create the next file, update the write cursor, and write the block to
// the new file.
//
// The write cursor will also be advanced the number of bytes actually written
// in the event of failure.
//
// Format:
func (s *blockStore) writeBlock(rawBlock []byte) (blockLocation, error) {
// Compute how many bytes will be written.
// 4 bytes each for block network + 4 bytes for block length +
// length of raw block + 4 bytes for checksum.
blockLen := uint32(len(rawBlock))
fullLen := blockLen + 12
// Move to the next block file if adding the new block would exceed the
// max allowed size for the current block file. Also detect overflow
// to be paranoid, even though it isn't possible currently, numbers
// might change in the future to make it possible.
//
// NOTE: The writeCursor.offset field isn't protected by the mutex
// since it's only read/changed during this function which can only be
// called during a write transaction, of which there can be only one at
// a time.
wc := s.writeCursor
finalOffset := wc.curOffset + fullLen
if finalOffset < wc.curOffset || finalOffset > s.maxBlockFileSize {
// This is done under the write cursor lock since the curFileNum
// field is accessed elsewhere by readers.
//
// Close the current write file to force a read-only reopen
// with LRU tracking. The close is done under the write lock
// for the file to prevent it from being closed out from under
// any readers currently reading from it.
wc.Lock()
wc.curFile.Lock() (1)
if wc.curFile.file != nil {
_ = wc.curFile.file.Close()
wc.curFile.file = nil
}
wc.curFile.Unlock()
// Start writes into next file.
wc.curFileNum++ (2)
wc.curOffset = 0 (3)
wc.Unlock()
}
// All writes are done under the write lock for the file to ensure any
// readers are finished and blocked first.
wc.curFile.Lock()
defer wc.curFile.Unlock()
// Open the current file if needed. This will typically only be the
// case when moving to the next file to write to or on initial database
// load. However, it might also be the case if rollbacks happened after
// file writes started during a transaction commit.
if wc.curFile.file == nil {
file, err := s.openWriteFileFunc(wc.curFileNum) (4)
if err != nil {
return blockLocation{}, err
}
wc.curFile.file = file
}
// Bitcoin network.
origOffset := wc.curOffset (5)
hasher := crc32.New(castagnoli)
var scratch [4]byte
byteOrder.PutUint32(scratch[:], uint32(s.network))
if err := s.writeData(scratch[:], "network"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(scratch[:])
// Block length.
byteOrder.PutUint32(scratch[:], blockLen)
if err := s.writeData(scratch[:], "block length"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(scratch[:])
// Serialized block.
if err := s.writeData(rawBlock[:], "block"); err != nil {
return blockLocation{}, err
}
_, _ = hasher.Write(rawBlock)
// Castagnoli CRC-32 as a checksum of all the previous.
if err := s.writeData(hasher.Sum(nil), "checksum"); err != nil {
return blockLocation{}, err
}
loc := blockLocation{ (6)
blockFileNum: wc.curFileNum,
fileOffset: origOffset,
blockLen: fullLen,
}
return loc, nil
}
其主要步骤为:
- 检测写入区块后是否超过文件大小限制,如果超过,则关闭当前文件,新创建一个文件; 否则,直接在当前文件的wc.curOffset偏移处开始写区块;
- 代码(1)处关闭writeCursor指向的文件,在调用Close()之前,获取了lockableFile的写锁,以防其他线程正在读该文件;
- 代码(2)将writeCursor指向下一个文件,代码(3)处将文件内偏移复位;
- 代码(4)处调用openWriteFile()以可读写方式打开或者创建一个新的文件,同时将writeCursor指向该文件;
- 代码(5)处记录下写区块的文件内起始偏移位置,随后开始向文件中写区块数据;
- 依次向文件中写入网络号、区块长度值、区块数据和前三项的crc32检验和,可以看出存于文件上的区块封装格式为: "
" - 代码(6)处创建被写入区块对应的blockLocation对象,它由存储区块的文件的序号、区块存储位置在该文件内的起始偏移及封装后的区块长度构成,最后返回该blockLocation对象;
dbCache
通过readBlock()和writeBlock()我们基本上可以了解blockStore的整个工作机制,它主要是通过一个LRU列表来管理已经打开的只读文件,并通过writeCursor来记录当前写的入文件及文件内偏移,在写入区块时,如果写入区块后超过了设置的最大文件Size,则另起一个新的文件写入。理解了这一点后,blockStore的其他代码均不难理解。接下来,我们主要分析dbCache的代码,先来看看它的定义:
//btcd/database/ffldb/dbcache.go
// dbCache provides a database cache layer backed by an underlying database. It
// allows a maximum cache size and flush interval to be specified such that the
// cache is flushed to the database when the cache size exceeds the maximum
// configured value or it has been longer than the configured interval since the
// last flush. This effectively provides transaction batching so that callers
// can commit transactions at will without incurring large performance hits due
// to frequent disk syncs.
type dbCache struct {
// ldb is the underlying leveldb DB for metadata.
ldb *leveldb.DB
// store is used to sync blocks to flat files.
store *blockStore
// The following fields are related to flushing the cache to persistent
// storage. Note that all flushing is performed in an opportunistic
// fashion. This means that it is only flushed during a transaction or
// when the database cache is closed.
//
// maxSize is the maximum size threshold the cache can grow to before
// it is flushed.
//
// flushInterval is the threshold interval of time that is allowed to
// pass before the cache is flushed.
//
// lastFlush is the time the cache was last flushed. It is used in
// conjunction with the current time and the flush interval.
//
// NOTE: These flush related fields are protected by the database write
// lock.
maxSize uint64
flushInterval time.Duration
lastFlush time.Time
// The following fields hold the keys that need to be stored or deleted
// from the underlying database once the cache is full, enough time has
// passed, or when the database is shutting down. Note that these are
// stored using immutable treaps to support O(1) MVCC snapshots against
// the cached data. The cacheLock is used to protect concurrent access
// for cache updates and snapshots.
cacheLock sync.RWMutex
cachedKeys *treap.Immutable
cachedRemove *treap.Immutable
}
其中各字段意义如下:
- ldb: 指向leveldb的DB对象,用于向leveldb中存取K/V;
- store: 指向当前db下的blockStore,用于向leveldb中写元数据之前,通过blockStore将区块缓存强制写入磁盘;
- maxSize: 简单地讲,它是缓存的待添加和删除的元数据的总大小限制,默认值为100M;
- flushInterval: 向leveldb中写数据的时间间隔;
- lastFlush: 上次向leveldb中写数据的时间戳;
- cacheLock: 对cachedKeys和cachedRemove进行读写保护,它们会在dbCache向leveldb写数据时更新,在dbCache快照时被读取;
- cachedKeys: 缓存待添加的Key,它指向一个树堆;
- cachedRemove: 缓存待删除的Key,它也指向一个树堆,请注意,cachedKeys和cachedRemove与transaction中的pendingKeys和pendingRemove有区别,pendingKeys和pendingRemove是可修改树堆(*treap.Mutable),而cachedKeys和cachedRemove是不可修改树堆(*treap.Immutable),且通常情况下(不满足needsFlush()时)pendingKeys和pendingRemove先向cachedKeys和cachedRemove同步,再向leveldb中更新,我们将在dbCache的commitTx()中更清楚地了解这一点。treap.Mutable和treap.Immutable将在本文最后介绍。
我们在transaction的writePendingAndCommit()方法中看到transaction Commit的最后一步就是调用dbCache的commitTx()来提交元数据的更新,所以我们先来看看commitTX()方法:
//btcd/database/ffldb/dbcache.go
// commitTx atomically adds all of the pending keys to add and remove into the
// database cache. When adding the pending keys would cause the size of the
// cache to exceed the max cache size, or the time since the last flush exceeds
// the configured flush interval, the cache will be flushed to the underlying
// persistent database.
//
// This is an atomic operation with respect to the cache in that either all of
// the pending keys to add and remove in the transaction will be applied or none
// of them will.
//
// The database cache itself might be flushed to the underlying persistent
// database even if the transaction fails to apply, but it will only be the
// state of the cache without the transaction applied.
//
// This function MUST be called during a database write transaction which in
// turn implies the database write lock will be held.
func (c *dbCache) commitTx(tx *transaction) error {
// Flush the cache and write the current transaction directly to the
// database if a flush is needed.
if c.needsFlush(tx) { (1)
if err := c.flush(); err != nil { (2)
return err
}
// Perform all leveldb updates using an atomic transaction.
err := c.commitTreaps(tx.pendingKeys, tx.pendingRemove) (3)
if err != nil {
return err
}
// Clear the transaction entries since they have been committed.
tx.pendingKeys = nil
tx.pendingRemove = nil
return nil
}
// At this point a database flush is not needed, so atomically commit
// the transaction to the cache.
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
newCachedKeys := c.cachedKeys
newCachedRemove := c.cachedRemove
c.cacheLock.RUnlock()
// Apply every key to add in the database transaction to the cache.
tx.pendingKeys.ForEach(func(k, v []byte) bool { (5)
newCachedRemove = newCachedRemove.Delete(k)
newCachedKeys = newCachedKeys.Put(k, v)
return true
})
tx.pendingKeys = nil
// Apply every key to remove in the database transaction to the cache.
tx.pendingRemove.ForEach(func(k, v []byte) bool { (6)
newCachedKeys = newCachedKeys.Delete(k)
newCachedRemove = newCachedRemove.Put(k, nil)
return true
})
tx.pendingRemove = nil
// Atomically replace the immutable treaps which hold the cached keys to
// add and delete.
c.cacheLock.Lock()
c.cachedKeys = newCachedKeys (7)
c.cachedRemove = newCachedRemove
c.cacheLock.Unlock()
return nil
}
其中的主要步骤为:
- 如果离上一次flush已经超过一个刷新周期且dbCache已满,则调用flush()将树堆中的缓存写入leveldb,并将transaction中的待添加和移除的Keys通过commitTreaps()方法直接写入leveldb,写完后清空pendingKeys和pendingRemove;
- 如果不需要flush,则代码(5)和(6)处将transaction中的pendingKeys添加到newCachedKeys中,将pendingRemove添加到newCachedRemove,即将tx中待添加和删除的Keys写入dbCache。这里要注意两点: 1). 将pendingKeys中的Key添加到newCachedKeys时,得先将相同的Key从newCachedRemove中移除,以免写入leveldb时该Key被删除。向newCachedRemove添加Key时也须将相同Key从newCachedKeys移除,以免本来要删除的Key又被写入leveldb;2). cachedKeys和cachedRemove均是treap.Immutable指针,相应地,newCachedKeys和newCachedRemove也是treap.Immutable指针。treap.Immutable类型的树堆实现了类似于写时复制(COW)的机制来提高读写并发,当通过Put()或者Delete()来更新树堆的节点时,需要更新的节点会复制一份与不需要更新的老的节点组成一颗新的树堆返回。代码(5)和(6)处newCachedKeys和newCachedRemove重新指向Delete()或者Put()调用的返回值,实际上是指向了一个新的树堆,而c.cachedKeys和c.cachedRemove仍然指向修改之前的树堆,所以这时如果通过Snapshot()获取dbCache的快照,快照中的cachedKeys和cachedRemove并不包含transation的pendingKeys和pendingRemove。这可以看成是dbCache的MVCC实现。
- 最后,代码(7)处更新dbCache中的cachedKeys和cachedRemove。请注意,更新操作通过c.cacheLock的写锁保护。更新c.cachedKeys和c.cachedRemove后,再通过Snapshot()拿到的dbCache快照中就包含了transaction提交的pendingKeys和pendingRemove;
接下来,我们看看flush的实现:
//btcd/database/ffldb/dbcache.go
// flush flushes the database cache to persistent storage. This involes syncing
// the block store and replaying all transactions that have been applied to the
// cache to the underlying database.
//
// This function MUST be called with the database write lock held.
func (c *dbCache) flush() error {
c.lastFlush = time.Now()
// Sync the current write file associated with the block store. This is
// necessary before writing the metadata to prevent the case where the
// metadata contains information about a block which actually hasn't
// been written yet in unexpected shutdown scenarios.
if err := c.store.syncBlocks(); err != nil { (1)
return err
}
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
cachedKeys := c.cachedKeys
cachedRemove := c.cachedRemove
c.cacheLock.RUnlock()
// Nothing to do if there is no data to flush.
if cachedKeys.Len() == 0 && cachedRemove.Len() == 0 {
return nil
}
// Perform all leveldb updates using an atomic transaction.
if err := c.commitTreaps(cachedKeys, cachedRemove); err != nil { (2)
return err
}
// Clear the cache since it has been flushed.
c.cacheLock.Lock()
c.cachedKeys = treap.NewImmutable() (3)
c.cachedRemove = treap.NewImmutable()
c.cacheLock.Unlock()
return nil
}
其中主要步骤为:
- 调用blockStore的syncBlocks()强制将文件缓冲写入磁盘文件,以防止meta数据与区块文件中的状态不一致;
- 通过commitTreaps()将dbCache中的缓存写入leveldb;
- 将cachedKeys和cachedRemove置为空的树堆,实际上是清空dbCache;
dbCache的commitTreaps()比较简单,它主要是调用leveldb的Put和Delete依次将cachedKeys和cachedRemove更新到leveldb中,我们就不作专门分析了,读者可以自行阅读其源代码。我们来看看dbCache的Snapshot():
//btcd/database/ffldb/dbcache.go
// Snapshot returns a snapshot of the database cache and underlying database at
// a particular point in time.
//
// The snapshot must be released after use by calling Release.
func (c *dbCache) Snapshot() (*dbCacheSnapshot, error) {
dbSnapshot, err := c.ldb.GetSnapshot()
if err != nil {
str := "failed to open transaction"
return nil, convertErr(str, err)
}
// Since the cached keys to be added and removed use an immutable treap,
// a snapshot is simply obtaining the root of the tree under the lock
// which is used to atomically swap the root.
c.cacheLock.RLock()
cacheSnapshot := &dbCacheSnapshot{
dbSnapshot: dbSnapshot,
pendingKeys: c.cachedKeys,
pendingRemove: c.cachedRemove,
}
c.cacheLock.RUnlock()
return cacheSnapshot, nil
}
可以看到,它实际上就是通过leveldb的Snapshot、c.cachedKeys和c.cachedRemove构建一个dbCacheSnapshot对象,在dbCacheSnapshot中查找Key时,先从cachedKeys或cachedRemove查找,再从leveldb的Snapshot查找。transaction中的snapshot就是指向该对象。
treap
通过上面几个方法的分析,我们就清楚了dbCache缓存Key、刷新缓存及读缓存的过程。dbCache中用于实际缓存的数据结构是treap.Immutable,它是dbCache的核心。Btcd中的treap既实现了Immutable版本,同时也提供了Muttable版本。接下来,我们就开始分析treap的实现。对于不了解treap的读者,可以阅读BYVoid同学写的《随机平衡二叉查找树Treap的分析与应用》。简单地讲,树堆是二叉查找树与堆的结合体,为了实现动态平衡,在二叉查找树的节点中引入一个随机值,用于对节点进行堆排序,让二叉查找树同时形成最大堆或者最小堆,从而保证其平衡性。树堆查找的时间复杂度为O(logN)。由于篇幅限制,我们不打算完整分析treap的代码,将主要分析Mutable和Immutable的Put()方法来了解treap的构建、添加节点后的旋转及Immutable的写时复制等过程。
我们先来看看Immutable、Mutable的定义:
//btcd/database/internal/treap/mutable.go
// Mutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance. Search, insert, and delete
// operations are all O(log n).
type Mutable struct {
root *treapNode
count int
// totalSize is the best estimate of the total size of of all data in
// the treap including the keys, values, and node sizes.
totalSize uint64
}
//btcd/database/internal/treap/immutable.go
// Immutable represents a treap data structure which is used to hold ordered
// key/value pairs using a combination of binary search tree and heap semantics.
// It is a self-organizing and randomized data structure that doesn't require
// complex operations to maintain balance. Search, insert, and delete
// operations are all O(log n). In addition, it provides O(1) snapshots for
// multi-version concurrency control (MVCC).
//
// All operations which result in modifying the treap return a new version of
// the treap with only the modified nodes updated. All unmodified nodes are
// shared with the previous version. This is extremely useful in concurrent
// applications since the caller only has to atomically replace the treap
// pointer with the newly returned version after performing any mutations. All
// readers can simply use their existing pointer as a snapshot since the treap
// it points to is immutable. This effectively provides O(1) snapshot
// capability with efficient memory usage characteristics since the old nodes
// only remain allocated until there are no longer any references to them.
type Immutable struct {
root *treapNode
count int
// totalSize is the best estimate of the total size of of all data in
// the treap including the keys, values, and node sizes.
totalSize uint64
}
Immutable和Mutable的定义完全一样,它们的区别在于Immutable提供了写时复制,我们将在Put()方法中看到他们的区别。其中的root字段指向树堆的根节点,节点的定义为:
//btcd/database/internal/treap/common.go
// treapNode represents a node in the treap.
type treapNode struct {
key []byte
value []byte
priority int
left *treapNode
right *treapNode
}
treapNode中的key和value就是树堆节点的值,priority是用于构建堆的随机修正值,也叫节点的优先级,left和right分别指向左右子树根节点。我们先来看看Mutable的Put()方法,来了解树堆的构建和插入节点后的旋转过程:
//btcd/database/internal/treap/mutable.go
// Put inserts the passed key/value pair.
func (t *Mutable) Put(key, value []byte) {
// Use an empty byte slice for the value when none was provided. This
// ultimately allows key existence to be determined from the value since
// an empty byte slice is distinguishable from nil.
if value == nil {
value = emptySlice
}
// The node is the root of the tree if there isn't already one.
if t.root == nil { (1)
node := newTreapNode(key, value, rand.Int())
t.count = 1
t.totalSize = nodeSize(node)
t.root = node
return
}
// Find the binary tree insertion point and construct a list of parents
// while doing so. When the key matches an entry already in the treap,
// just update its value and return.
var parents parentStack
var compareResult int
for node := t.root; node != nil; {
parents.Push(node)
compareResult = bytes.Compare(key, node.key)
if compareResult < 0 {
node = node.left (2)
continue
}
if compareResult > 0 {
node = node.right (3)
continue
}
// The key already exists, so update its value.
t.totalSize -= uint64(len(node.value))
t.totalSize += uint64(len(value))
node.value = value (4)
return
}
// Link the new node into the binary tree in the correct position.
node := newTreapNode(key, value, rand.Int()) (5)
t.count++
t.totalSize += nodeSize(node)
parent := parents.At(0)
if compareResult < 0 {
parent.left = node (6)
} else {
parent.right = node (7)
}
// Perform any rotations needed to maintain the min-heap.
for parents.Len() > 0 {
// There is nothing left to do when the node's priority is
// greater than or equal to its parent's priority.
parent = parents.Pop()
if node.priority >= parent.priority { (8)
break
}
// Perform a right rotation if the node is on the left side or
// a left rotation if the node is on the right side.
if parent.left == node {
node.right, parent.left = parent, node.right (9)
} else {
node.left, parent.right = parent, node.left (10)
}
t.relinkGrandparent(node, parent, parents.At(0))
}
}
......
// relinkGrandparent relinks the node into the treap after it has been rotated
// by changing the passed grandparent's left or right pointer, depending on
// where the old parent was, to point at the passed node. Otherwise, when there
// is no grandparent, it means the node is now the root of the tree, so update
// it accordingly.
func (t *Mutable) relinkGrandparent(node, parent, grandparent *treapNode) {
// The node is now the root of the tree when there is no grandparent.
if grandparent == nil {
t.root = node (11)
return
}
// Relink the grandparent's left or right pointer based on which side
// the old parent was.
if grandparent.left == parent {
grandparent.left = node (12)
} else {
grandparent.right = node (13)
}
}
其中的主要步骤为:
- 对于空树,添加的第一个节点直接成为根节点,如代码(1)处所示,可以看到,节点的priority是由rand.Int()生成的随机整数;
- 对于非空树,根据key来查找待插入的位置,并通过parentStack来记录查找路径。从根节点开始,如果待插入的Key小于根的Key,则进入左子树继续查找,如代码(2)处所示;如果待插入的Key大于根的Key,则进入右子树继续查找,如代码(3)处所示;如果待插入的Key正好的当前节点的Key,则直接更新其Value,如代码(4)处所示;
- 当树中没有找到Key,则应插入新的节点,此时parents中的最后一个节点就是新节点的父节点,请注意,parents.At(0)是查找路径上的最后一个节点。如果待插入的Key小于父节点的Key,则新节点变成父节点的左子节点,如代码(6)处所示;否则,成为右子节点,如代码(7)处所示;
- 由于新节点的priority是随机产生的,它插入树中后,树可能不满足最小堆性质了,所以接下来需要进行旋转。旋转过程需要向上递归进行直到整颗树满足最小难序。代码(8)处,如果新节点的优化级正好大于或者等于父节点的优先级,则不用旋转,树已经满足最小难序;如果新节点的优化级小于父节点的优化级,则需要旋转,将父节点变成新节点的子节点。如果新节点是父节点的左子节点,则需要进行右旋,如果代码(9)所示;如果新节点是父节点的右子节点,则需要进行左旋,如代码(10)所示;
- 进行左旋或右旋后,原父节点变成新节点的子节点,但祖节点(原父节点的父节点)的子节点还指向原父节点,relinkGrandparent()将继续完成旋转过程。如果祖节点是空,则说明原父节点就是树的根,不需要调整直接将新节点变成树的根即可,如代码(11)处所示;代码(12)和(13)实际上是将新节点替代原父节点,变成祖节点的左子节点或者右子节点;
- 新节点、原父节点、祖节点完成旋转后,新节点变成了新的父节点,原交节点变成子节点,祖节点不变,但此时新节点的优化级可能还大于祖节点的优化级,则新的父节点、祖节点及祖节点的父节点还要继续旋转,这一过程向上递归到根节点,保证查找路径上节点均满足最小堆序,才完成了整个旋转过程及新节点插入过程。
从Mutable的Put()方法中,我们可以完整地了解treap的构建、插入及涉及到的子树旋转过程。Immutable的Put()与Mutable的Put()实现步骤大致一致,不同的是,Immutable没有直接修改原节点或旋转原树,而是将查找路径上的所有节点均复制一份出来与原树的其它节点一起形成一颗新的树,在新树上进行更新或者旋转后返回新树。它的实现如下:
//btcd/database/internal/treap/immutable.go
// Put inserts the passed key/value pair.
func (t *Immutable) Put(key, value []byte) *Immutable {
// Use an empty byte slice for the value when none was provided. This
// ultimately allows key existence to be determined from the value since
// an empty byte slice is distinguishable from nil.
if value == nil {
value = emptySlice
}
// The node is the root of the tree if there isn't already one.
if t.root == nil {
root := newTreapNode(key, value, rand.Int())
return newImmutable(root, 1, nodeSize(root)) (1)
}
// Find the binary tree insertion point and construct a replaced list of
// parents while doing so. This is done because this is an immutable
// data structure so regardless of where in the treap the new key/value
// pair ends up, all ancestors up to and including the root need to be
// replaced.
//
// When the key matches an entry already in the treap, replace the node
// with a new one that has the new value set and return.
var parents parentStack
var compareResult int
for node := t.root; node != nil; {
// Clone the node and link its parent to it if needed.
nodeCopy := cloneTreapNode(node)
if oldParent := parents.At(0); oldParent != nil {
if oldParent.left == node {
oldParent.left = nodeCopy (2)
} else {
oldParent.right = nodeCopy (3)
}
}
parents.Push(nodeCopy) (4)
// Traverse left or right depending on the result of comparing
// the keys.
compareResult = bytes.Compare(key, node.key)
if compareResult < 0 {
node = node.left
continue
}
if compareResult > 0 {
node = node.right
continue
}
// The key already exists, so update its value.
nodeCopy.value = value (5)
// Return new immutable treap with the replaced node and
// ancestors up to and including the root of the tree.
newRoot := parents.At(parents.Len() - 1) (6)
newTotalSize := t.totalSize - uint64(len(node.value)) + (7)
uint64(len(value))
return newImmutable(newRoot, t.count, newTotalSize) (8)
}
// Link the new node into the binary tree in the correct position.
node := newTreapNode(key, value, rand.Int())
parent := parents.At(0)
if compareResult < 0 {
parent.left = node
} else {
parent.right = node
}
// Perform any rotations needed to maintain the min-heap and replace
// the ancestors up to and including the tree root.
newRoot := parents.At(parents.Len() - 1)
for parents.Len() > 0 {
// There is nothing left to do when the node's priority is
// greater than or equal to its parent's priority.
parent = parents.Pop()
if node.priority >= parent.priority {
break
}
// Perform a right rotation if the node is on the left side or
// a left rotation if the node is on the right side.
if parent.left == node {
node.right, parent.left = parent, node.right
} else {
node.left, parent.right = parent, node.left
}
// Either set the new root of the tree when there is no
// grandparent or relink the grandparent to the node based on
// which side the old parent the node is replacing was on.
grandparent := parents.At(0)
if grandparent == nil {
newRoot = node
} else if grandparent.left == parent {
grandparent.left = node
} else {
grandparent.right = node
}
}
return newImmutable(newRoot, t.count+1, t.totalSize+nodeSize(node)) (9)
}
其与Mutable的Put()方法的主要区别在于:
- 如果向空树中插入一个节点,与直接将新节点变成原树的根不同,它将以新节点为根创建一个新的树堆并返回,如代码(1)所示;
- 在查找待插入的Key时,查找路径上的所有节点被复制一份出来,如代码(2)、(3)和(4)处所示。如果找到待插入的Key,是在复制的节点上更新Value,而不是原节点上更新,如果代码(5)处所示,节点更新后,将以复制出来的根节点来创建一颗新的树并返回,如代码(6)、(7)和(8)处所示;
- 接下来,如果待插入的Key不在树中,将添加一个新的节点,且新的节点被加入到复制出来的父节点中,然后在复制的新树上进行旋转,最后返回新的树,如代码(9)所示。需要注意的是,原树上节点均没有更新,原树与新树共享查找路径以外的其他节点。
Immutable的Put()的方法通过复制查找路径上的节点并返回新的树根实现了写时复制,并进而支持了dbCache的MVCC。到此,ffldb的整个工作机制我们就介绍完了,其中的blockStore和dbCache及dbCache使用的数据结构treap我们也作了详细分析,相信大家对Bitcoin节点进行区块的查找和存入磁盘的过程有了完整而清晰的认识了。接下来的文章,我们将介绍Btcd中网络协议的实现,揭示区块在P2P网络中的传递过程。