/*
* References, 参考文献:
* [1]. Alexandrescu, Andrei. "Modern C++ Design: Generic Programming and Design
* Patterns Applied". Copyright (c) 2001. Addison-Wesley.
*
* 本文研究的源代码取自loki库SmallObj.h & SmallObj.cpp
*/
#include <boost/progress.hpp> #include <vector> using namespace boost; using namespace std; struct SmallObj { char c; }; int main() { progress_timer t; SmallObj *s = new SmallObj(); //case 1. vector<char> vc; //case 2. vector<string>vs; //case 3. vector<int> *vi = new vector<int>(1); //case 4. std::cout << t.elapsed() << std::endl; }
利用boost的progress_timer库编了一个小测试,看看c++默认的new操作大概需要多长时间.
机器配置:2.26 GHz Intel Core 2 Duo, 4GB 1067 MHz DDR3, Mac OS 10.7.3.
1. 单独执行case 1时,耗时平均3微妙,小对象内存的动态内存分配时间;
2. 同时执行2,3时,耗时最低3微妙,最多4微妙,通过std::alloc分配小对象内存;这种情况下,通常case 3是从case 2 已经分配的segregated freelist中得到分配的内存快,耗时应该很低。
3. 只执行case 4时,最低耗时4微妙,最多6微妙。
按照[1]的说法,系统缺省的free store分配器速度很慢,并且跟loki编写小型对象分配器比较,执行速度慢一个等量级。下面就看看loki的Small-Object Allocator是怎么实现的,有哪些优化技巧在实现中。
loki的小型对象分配器一共分为四层,处于最底层的是chunk struct。每一个chunk对象包含并管理一大块内存,这一大块内存本身包含整数个固定大小的区块(block)。chunk内包含逻辑信息,使用者可根据这些逻辑信息分配和归还区块,当chunk中不再剩余blocks时,分配失败并传回零。
第二层是FixedAllocator,其对象以chunk为构件。FixedAllocator主要用来满足那些“累计总量超过chunk容量”的内存分配请求。FixedAllocator会通过一个vector将chunks组合起来以达到目的。如果新的内存请求出现,但vector中的chunk都被占用了,此时FixedAllocator会产生一块新的chunk,并将它添加到vector中,再由chunk满足需求。
第三层SmallObjAllocator提供的通用性的分配归还函数,此对象拥有数个FixedAllocator对象,每一个负责分配某特定大小的对象。根据申请的bytes个数不同,SmallObjAllocator对象会将内存分配申请分发给辖内某个FixedAllocator。如果请求数量过大,会转发给系统提供的::operator new。
最后一层是SmallObject,它包装了FixedAllocator,以便向C++ Classes提供封装良好的分配服务。SmallObject重载了operator new 和 operator delete,将任务转给SmallObjAllocator对象去完成。
Chunk的定义如下
struct Chunk { void Init(std::size_t blockSize, unsigned char blocks); void* Allocate(std::size_t blockSize); void Deallocate(void* p, std::size_t blockSize); void Reset(std::size_t blockSize, unsigned char blocks); void Release(); unsigned char* pData_; unsigned char firstAvailableBlock_, blocksAvailable_; };
pData指向被管理内存本身,除此以外chunk还保存一下整数值:
因为firstAvailableBlock_, blocksAvailable_的类型都是unsigned char,因此一个chunk在一部8-bit char机器上无法拥有255个以上的区块。如果block未被使用,就拿第一个字节放置“下一个未被使用的区块”的索引号。由于firstAvailableBlock_已经持有第一个block的索引号,因此我们便有了一个由“可用区块”组成的freelist,无须占用额外内存。
void FixedAllocator::Chunk::Init(std::size_t blockSize, unsigned char blocks) { assert(blockSize > 0); assert(blocks > 0); // Overflow check assert((blockSize * blocks) / blockSize == blocks); pData_ = new unsigned char[blockSize * blocks]; Reset(blockSize, blocks); } void FixedAllocator::Chunk::Reset(std::size_t blockSize, unsigned char blocks) { assert(blockSize > 0); assert(blocks > 0); // Overflow check assert((blockSize * blocks) / blockSize == blocks); firstAvailableBlock_ = 0; blocksAvailable_ = blocks; unsigned char i = 0; unsigned char* p = pData_; for (; i != blocks; p += blockSize) { *p = ++i; } }
其中里面有两个技巧特别需要思考一下:
1. 将可用block的第一个字节记录索引号,加firstAvailableBlock_,形成freelist的好处是什么?
2. 为什么要将chunk管理的block数量用unsigned char类型的数值上限(255)加以限制。
回答:
1. 这种freelist的设计技巧在于,融于数据结构本身,无须占用额外内存,且提供了一种高效的方法chunk中可用block。
2. 假定chunk改为class template,如下
template<typename T> struct Chunk { void Init(std::size_t blockSize, T blocks); void Release(); void * Allocate(std::size_t blockSize); void Deallocate(void *p, std::size_t blockSize); unsigned char *pData_; T firstAvailableBlock_, blocksAvailable_; }; template<typename T> void Chunk<T>::Init(std::size_t blockSize, T blocks){ pData_ = new unsigned char[blockSize * blocks]; firstAvailableBlock_ = 0; blocksAvailable_ = blocks; T i = 0; for (; i != blocks; p+=blockSize) *p = ++i; }
按此修改后,如果T被unsigned short替换的话,最高值是65535, 但是我们无法分配比sizeof(unsigned short)小的内存,但是这不是十分大的问题,可以将比之小的内存对齐成sizeof(unsigned short)大小再分配;当然这会引起Internal fragmentation。
另一个问题是齐位问题,假如分配block大小是5个字节,那么区块的索引该用什么类型呢?unsigned short,还是unsigned int,在将指向这样一个5字节block的指针转换为unsigned int,会引发不确定的行为。即使按照特定的字节序进行读取这5个字节中的4个字节转换为unsigned int,也要费一番周折,付出的开销或许已经抵消了由此设计而带来的效率提升。
所以限定一下blocks的数量为255,保持在类型unsigned char的数值上限,不失为一个明智之举。一来,chunks不是那么大,二来char的大小是1个字节,无齐位问题,即使原始内存的指针也是指向unsigned char的。
分配函数Allocate()的动作就是取出firstAvailableBlock_所代表的区块,然后调整firstAvailableBlock_,使其指向下一个可用区块。
void* FixedAllocator::Chunk::Allocate(std::size_t blockSize) { if (!blocksAvailable_) return 0; assert((firstAvailableBlock_ * blockSize) / blockSize == firstAvailableBlock_); unsigned char* pResult = pData_ + (firstAvailableBlock_ * blockSize); firstAvailableBlock_ = *pResult; --blocksAvailable_; return pResult; }
注意这个allocate的成本很小,因为不需要查找。
归还函数Deallocate:
void FixedAllocator::Chunk::Deallocate(void* p, std::size_t blockSize) { assert(p >= pData_); unsigned char* toRelease = static_cast<unsigned char*>(p); // Alignment check assert((toRelease - pData_) % blockSize == 0); *toRelease = firstAvailableBlock_; firstAvailableBlock_ = static_cast<unsigned char>( (toRelease - pData_) / blockSize); // Truncation check assert(firstAvailableBlock_ == (toRelease - pData_) / blockSize); ++blocksAvailable_; }
配合看Allocate和Deallocate函数实现时,需要注意几点:
1. pData永远都是指向分配chunks的起始地址。
2. firstAvailableBlock_永远记录的是第一个可用block的索引号
3. 也是最重要的一点,(经过多次allocation和deallocation之后)每个block的第一个字节保存的是下一个可用block的索引号。
4. 一次内存请求只是一个block size,不能一次分配多个block。
小型对象分配器的第二层由FixedAllocator构成。FixedAllocator负责分配和归还“特定大小的block”,其大小不受限于chunk,因为FixedAllocator将新建的chunk对象放在一个vector中。任何一个内存请求,会找出一个适当的chunk满足,如果都被占用了,就新建一个chunk加入到vector中。下面是FixedAllocator的定义:
class FixedAllocator{ private: void DoDeallocate(void *p); bool MakeNewChunk(void); Chunk * VicinityFind(void *p) const; FixedAllocator(const FixedAllocator&); //copy ctor, but not implemented; FixedAllocator& operator=(const FixedAllocator&);// copy assignment, not implemented; typedef std::vector<Chunk> Chunks; typedef Chunks::iterator ChunkIter; typedef Chunks::const_iterator ChunkCIter; static unsigned char MinObjectsPerChunk_; // Fewest # of objects managed by a chunk; static unsigned char MaxObjectsPerChunk_; // maximum # of objects manged by a chunk; std::size_t blockSize_; //FixedAllocator manages chunks owning blockSize_ specific blocks only. unsigned char numBlocks_; Chunks chunks; // Container of Chunks Chunk *allocChunk_; //ptr to chunk used for last/next allocation. Chunk *deallocChunk_; // ptr to chunk used for last/next deallocation. Chunk * emptyChunk; // ptr to the only empty chunk if there is one, else null... public: FixedAllocator(); ~FixedAllocator(); void Initialize(std::size_t blockSize, std::size_t pageSize); void *Allocate(void); bool Deallocate(void *p, Chunk *hint); inline std::size_t BlockSize() const { return blockSize_; } bool TrimEmptyChunk(void); // releases the memory used by empty chunk. bool TrimChunkList(void); // Returns count of empty Chunks held by this allocator. std::size_t CountEmptyChunks(void) const; bool IsCorrupt(void) const; const Chunk * HasBlock(void *p) const; inline Chunk * HasBlock(void *p){ return const_cast<Chunk *>( const_cast< const FixedAllocator *>( this )->HasBlock(p) ); } };
其中,allocChunk_指针指向“最近一次分配所使用的chunk”。每次分配请求都会先查询该指针所指的chunk,如果尚有空闲空间,分配请求将由此chunk分配获得。否则会触发一次线性查找。无论上述哪种情况,allocChunk_都会更新,指向刚找到的或新添加的chunk。这样可以提高下次分配速度。以下是loki_0.1.7的FixedAlloctor::Allocate(void)的实现:
void * FixedAllocator::Allocate( void ) { if ( ( NULL == allocChunk_ ) || allocChunk_->IsFilled() ) //初始状态或者最近一次分配所用的chunk已被占用 { if ( NULL != emptyChunk_ ) //有可用的chunk,并且是空的,直接从这个chunk分配内存 { allocChunk_ = emptyChunk_; emptyChunk_ = NULL; } else //只能遍历chunks,看是否有空闲chunk可用 { for ( ChunkIter i( chunks_.begin() ); ; ++i ) { if ( chunks_.end() == i ) // 如果没有,只能创建一个新的chunk,并加入到chunks, // 同时修改allocChunk_,deallocChunk_指针 { if ( !MakeNewChunk() ) return NULL; break; } if ( !i->IsFilled() ) //如果chunks中某个chunk(first-fit)有空闲block,分配内存。 { allocChunk_ = &*i; break; } } } } else if ( allocChunk_ == emptyChunk_) //无空emptyChunk时,修改指针emptyChunk_,置空。 emptyChunk_ = NULL; void * place = allocChunk_->Allocate( blockSize_ ); return place; }
内存归还(deallocate)比较麻烦,因为不知道待还block属于哪个chunk的。是的,我们可以遍历chunks, 检查指针是否落在pData_和pData_+blockSize_ * numBlocks_之间。找到对应的chunk后,就在这个chunk内做deallocated动作。不过这么做,需要耗费线性时间归还内存。因此loki做了一些优化,设一个成员变量deallocChunk_指针,指向归还动作所用的最后那个chunk对象。任何归还动作都必须先检查这个deallocChunk_所指向的chunk,如果是错误的chunk,再进行线性搜索,搜到以后做deallocate动作,最后修改deallocChunk_指针。
bool FixedAllocator::Deallocate( void * p, Chunk * hint ) { Chunk * foundChunk = ( NULL == hint ) ? VicinityFind( p ) : hint; if ( NULL == foundChunk ) return false; deallocChunk_ = foundChunk; DoDeallocate(p); return true; } Chunk * FixedAllocator::VicinityFind( void * p ) const { if ( chunks_.empty() ) return NULL; assert(deallocChunk_); const std::size_t chunkLength = numBlocks_ * blockSize_; Chunk * lo = deallocChunk_; Chunk * hi = deallocChunk_ + 1; const Chunk * loBound = &chunks_.front(); const Chunk * hiBound = &chunks_.back() + 1; // Special case: deallocChunk_ is the last in the array if (hi == hiBound) hi = NULL; //边界条件 for (;;) { if (lo) { if ( lo->HasBlock( p, chunkLength ) ) return lo; if ( lo == loBound ) { lo = NULL; if ( NULL == hi ) break; } else --lo; } if (hi) { if ( hi->HasBlock( p, chunkLength ) ) return hi; if ( ++hi == hiBound ) { hi = NULL; if ( NULL == lo ) break; } } } return NULL; } void FixedAllocator::DoDeallocate(void* p) { // Show that deallocChunk_ really owns the block at address p. assert( deallocChunk_->HasBlock( p, numBlocks_ * blockSize_ ) ); // Either of the next two assertions may fail if somebody tries to // delete the same block twice. assert( emptyChunk_ != deallocChunk_ ); assert( !deallocChunk_->HasAvailable( numBlocks_ ) ); // prove either emptyChunk_ points nowhere, or points to a truly empty Chunk. assert( ( NULL == emptyChunk_ ) || ( emptyChunk_->HasAvailable( numBlocks_ ) ) ); // call into the chunk, will adjust the inner list but won't release memory deallocChunk_->Deallocate(p, blockSize_); if ( deallocChunk_->HasAvailable( numBlocks_ ) ) //如果deallocChunk是个空chunk { assert( emptyChunk_ != deallocChunk_ ); // deallocChunk_ is empty, but a Chunk is only released if there are 2 // empty chunks. Since emptyChunk_ may only point to a previously // cleared Chunk, if it points to something else besides deallocChunk_, // then FixedAllocator currently has 2 empty Chunks. if ( NULL != emptyChunk_ ) { // If last Chunk is empty, just change what deallocChunk_ // points to, and release the last. Otherwise, swap an empty // Chunk with the last, and then release it. Chunk * lastChunk = &chunks_.back(); if ( lastChunk == deallocChunk_ ) deallocChunk_ = emptyChunk_; else if ( lastChunk != emptyChunk_ ) std::swap( *emptyChunk_, *lastChunk ); assert( lastChunk->HasAvailable( numBlocks_ ) ); lastChunk->Release(); chunks_.pop_back(); if ( ( allocChunk_ == lastChunk ) || allocChunk_->IsFilled() ) allocChunk_ = deallocChunk_; } emptyChunk_ = deallocChunk_; } // prove either emptyChunk_ points nowhere, or points to a truly empty Chunk. assert( ( NULL == emptyChunk_ ) || ( emptyChunk_->HasAvailable( numBlocks_ ) ) ); }
注意VicinityFind函数,里面有两个迭代器,lo和hi,(不明白为什么不用std::iterator包装一下)每次迭代分别向上和向下进行。(这样实现,有点divide-and-conquer的感觉哈)。
在DoDeallocate函数实现中,当deallocChunk_是空的时候,并且emptyChunk_也存在时,说明FixedAllocator此时很有同时存在两个emptyChunk了(除了deallcoChunk_与emptyChunk_都指向同一个chunk时之外),这种情况势必要释放一个emptyChunk。根据语句 lastChunk->Release(); chunks_.pop_back();可以判断出释放动作发生在chunks vector的最后一个chunk对象上(lastChunk_)。因此,在此之前需要根据以下两种具体情况修改deallocChunk_和emptyChunk_指针:
1. 如果deallocChunk指向最后一个chunk,那么将deallocChunk指向原emptyChunk。
2. 如果lastChunk不是emptyChunk指向的那个chunk,那么交换两个chunk(不是指针交换)。
这么做的原因是为了避免一种边界条件(allocChunk_和deallocChunk_均指向同一个chunk,且该chunk已无空闲空间)。如果按照常规的allocate()的逻辑,会新建一个chunk,并加入到chunks中,并在新建的chunk上完成内存分配。可以恰好,这个刚分配完的内存又要归还。这时deallocChunk_找到新建的chunk,完成归还动作后,发现是空的chunk,就释放掉这个新建的chunk。如果这个逻辑如此反复,性能损耗不可小觑。所以只有在发现两个空的chunk时,才会归还其中之一。
loki的allocator的第三层便是这个SmallObjAllocator,它能够分配任意大小的对象。SmallObjAllocator聚集了N个FixedAllocator。当SmallObjAllocator接收到一个分配请求时,会将请求分配给最佳匹配的FixedAllocator,否则就转给缺省的::operator new. 所以SmallObjAllocator的声明如下:
class LOKI_EXPORT SmallObjAllocator { protected: SmallObjAllocator( std::size_t pageSize, std::size_t maxObjectSize, std::size_t objectAlignSize ); ~SmallObjAllocator( void ); public: void * Allocate( std::size_t size, bool doThrow ); void Deallocate( void * p, std::size_t size ); //多了一个参数,表明待还内存大小 void Deallocate( void * p ); inline std::size_t GetMaxObjectSize() const { return maxSmallObjectSize_; } /// Returns # of bytes between allocation boundaries. inline std::size_t GetAlignment() const { return objectAlignSize_; } bool TrimExcessMemory( void ); bool IsCorrupt( void ) const; private: SmallObjAllocator( void ); SmallObjAllocator( const SmallObjAllocator & ); SmallObjAllocator & operator = ( const SmallObjAllocator & ); Loki::FixedAllocator * pool_; //array instead of vector. (modification) /// Largest object size supported by allocators. 超过这值就转发到默认的::operator new const std::size_t maxSmallObjectSize_; /// Size of alignment boundaries. const std::size_t objectAlignSize_; };
void * SmallObjAllocator::Allocate( std::size_t numBytes, bool doThrow ) { if ( numBytes > GetMaxObjectSize() ) return DefaultAllocator( numBytes, doThrow ); assert( NULL != pool_ ); if ( 0 == numBytes ) numBytes = 1; const std::size_t index = GetOffset( numBytes, GetAlignment() ) - 1; const std::size_t allocCount = GetOffset( GetMaxObjectSize(), GetAlignment() ); (void) allocCount; assert( index < allocCount ); FixedAllocator & allocator = pool_[ index ]; assert( allocator.BlockSize() >= numBytes ); assert( allocator.BlockSize() < numBytes + GetAlignment() ); void * place = allocator.Allocate(); if ( ( NULL == place ) && TrimExcessMemory() ) place = allocator.Allocate(); if ( ( NULL == place ) && doThrow ) { #ifdef _MSC_VER throw std::bad_alloc( "could not allocate small object" ); #else // GCC did not like a literal string passed to std::bad_alloc. // so just throw the default-constructed exception. throw std::bad_alloc(); #endif } return place; }
pool_是个FixedAllocator数组,为了能够简单而且高效地查找到哪个FixedAllocator负责哪个大小的内存块,所以数组下标index处理index大小的内存块。但是事情总是没有那么完美的,不是吗?也许某个应用程序只需产生4字节和32字节的大小的两种对象,再没其他的了。但是pool_还是要分配负责除此两个大小以外的FixedAllocator。
void SmallObjAllocator::Deallocate( void * p, std::size_t numBytes ) { if ( NULL == p ) return; if ( numBytes > GetMaxObjectSize() ) { DefaultDeallocator( p ); return; } assert( NULL != pool_ ); if ( 0 == numBytes ) numBytes = 1; const std::size_t index = GetOffset( numBytes, GetAlignment() ) - 1; const std::size_t allocCount = GetOffset( GetMaxObjectSize(), GetAlignment() ); (void) allocCount; assert( index < allocCount ); FixedAllocator & allocator = pool_[ index ]; assert( allocator.BlockSize() >= numBytes ); assert( allocator.BlockSize() < numBytes + GetAlignment() ); const bool found = allocator.Deallocate( p, NULL ); (void) found; assert( found ); } void SmallObjAllocator::Deallocate( void * p ) //这个函数因为只有一个待还区块的指针,所以不得已需要遍历FixedAllocator,找到对应负责的FixedAllocator。 { if ( NULL == p ) return; assert( NULL != pool_ ); FixedAllocator * pAllocator = NULL; const std::size_t allocCount = GetOffset( GetMaxObjectSize(), GetAlignment() ); Chunk * chunk = NULL; for ( std::size_t ii = 0; ii < allocCount; ++ii ) { chunk = pool_[ ii ].HasBlock( p ); if ( NULL != chunk ) { pAllocator = &pool_[ ii ]; break; } } if ( NULL == pAllocator ) { DefaultDeallocator( p ); return; } assert( NULL != chunk ); const bool found = pAllocator->Deallocate( p, chunk ); (void) found; assert( found ); }
SmallObjAllocator只是FixedAllocator的简单包装。
第四层,它将第三层提供的功能做更便捷于是用的包装。SmallObject重载系统默认的::operator new 和 ::operator delete。 这样,每生成一个SmallObject对象,重载后的行为便会将分配请求发送给底层的FixedAllocator。所以定义如下
template < template <class, class> class ThreadingModel = LOKI_DEFAULT_THREADING_NO_OBJ_LEVEL, std::size_t chunkSize = LOKI_DEFAULT_CHUNK_SIZE, std::size_t maxSmallObjectSize = LOKI_MAX_SMALL_OBJECT_SIZE, std::size_t objectAlignSize = LOKI_DEFAULT_OBJECT_ALIGNMENT, template <class> class LifetimePolicy = LOKI_DEFAULT_SMALLOBJ_LIFETIME, class MutexPolicy = LOKI_DEFAULT_MUTEX > class SmallObject : public SmallObjectBase< ThreadingModel, chunkSize, maxSmallObjectSize, objectAlignSize, LifetimePolicy, MutexPolicy > { public: virtual ~SmallObject() {} protected: inline SmallObject( void ) {} private: /// Copy-constructor is not implemented. SmallObject( const SmallObject & ); /// Copy-assignment operator is not implemented. SmallObject & operator = ( const SmallObject & ); }; // end class SmallObject template < template <class, class> class ThreadingModel, std::size_t chunkSize, std::size_t maxSmallObjectSize, std::size_t objectAlignSize, template <class> class LifetimePolicy, class MutexPolicy > class SmallObjectBase { public: typedef AllocatorSingleton< ThreadingModel, chunkSize, maxSmallObjectSize, objectAlignSize, LifetimePolicy > ObjAllocatorSingleton; private: typedef ThreadingModel< ObjAllocatorSingleton, MutexPolicy > MyThreadingModel; typedef typename ObjAllocatorSingleton::MyAllocatorSingleton MyAllocatorSingleton; public: static void * operator new ( std::size_t size, const std::nothrow_t & ) throw () { typename MyThreadingModel::Lock lock; (void)lock; // get rid of warning return MyAllocatorSingleton::Instance().Allocate( size, false ); } /// Placement single-object new merely calls global placement new. inline static void * operator new ( std::size_t size, void * place ) { return ::operator new( size, place ); } static void operator delete ( void * p, std::size_t size ) throw () { typename MyThreadingModel::Lock lock; (void)lock; // get rid of warning MyAllocatorSingleton::Instance().Deallocate( p, size ); } static void operator delete ( void * p, const std::nothrow_t & ) throw() { typename MyThreadingModel::Lock lock; (void)lock; // get rid of warning MyAllocatorSingleton::Instance().Deallocate( p ); } /// Placement single-object delete merely calls global placement delete. inline static void operator delete ( void * p, void * place ) { ::operator delete ( p, place ); } protected: inline SmallObjectBase( void ) {} inline SmallObjectBase( const SmallObjectBase & ) {} inline SmallObjectBase & operator = ( const SmallObjectBase & ) { return *this; } inline ~SmallObjectBase() {} }; // end class SmallObjectBase
在上述实现过程中,作者Andrei又运用了一个C++编译器的技巧。在重载::operator delete(void *p, size_t size)时,实际上是需要SmallObject提供被析构对象的大小(运用了C++编译器删除对象前,即时产生一些代码计算被删除对象的大小),提供给SmallObject重载delete函数的“大小”参数来源于此。因此SmallObject提供了一个虚析构函数。因此,从SmallObject派生的任何类都会继承这个虚析构函数。
另一个问题就是对于整个程序而言,只需要一个唯一的SmallObjAllocator,因此SmallObject在包装第三层时,使用了Singleton模式。(Singleton不在本篇讨论范围内,在此不加描述)。
小结:
正在写测试程序,对比性能;待续。。。