本小节将从源码角度重点讲解postgres中共享内存和信号量创建,其接口函数为CreateSharedMemoryAndSemaphores。postmaster调用此函数时会初始化共享内存和信号量,而其他进程调用不进行初始化,仅获取全局变量:共享内存指针和信号量指针。关键数据结构知识回顾见 postgres 源码解析33 进程间通信–1
1) 计算共享内存总大小
2 )为各个模块分配相应大小的内存空间
3)初始化共享内存头指针
4)初始化各模块内存空间
5)启动动态共享内存设施
本文讲解蓝色字体模块逻辑
1 计算共享内存总大小
/*
* CreateSharedMemoryAndSemaphores
* Creates and initializes shared memory and semaphores.
*
* This is called by the postmaster or by a standalone backend.
* It is also called by a backend forked from the postmaster in the
* EXEC_BACKEND case. In the latter case, the shared memory segment
* already exists and has been physically attached to, but we have to
* initialize pointers in local memory that reference the shared structures,
* because we didn't inherit the correct pointer values from the postmaster
* as we do in the fork() scenario. The easiest way to do that is to run
* through the same code as before. (Note that the called routines mostly
* check IsUnderPostmaster, rather than EXEC_BACKEND, to detect this case.
* This is a bit code-wasteful and could be cleaned up.)
*/
void
CreateSharedMemoryAndSemaphores(void)
{
PGShmemHeader *shim = NULL;
if (!IsUnderPostmaster)
{
PGShmemHeader *seghdr;
Size size;
int numSemas;
/* Compute number of semaphores we'll need */
numSemas = ProcGlobalSemas(); // Pro信号量,每个进程都持有
numSemas += SpinlockSemas(); // 自旋锁构建的信号量大小
/*
* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.
*
* We take some care during this phase to ensure that the total size
* request doesn't overflow size_t. If this gets through, we don't
* need to be so careful during the actual allocation phase.
*/
// 计算各模块共享内存空间大小 + 100 K 以囊括一些过小的内存
size = 100000;
size = add_size(size, PGSemaphoreShmemSize(numSemas)); // 信号量大小
size = add_size(size, SpinlockSemaSize()); // spinlock信号量
size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE, // 哈希索引表大小【快速定位各模块的内存区域】
sizeof(ShmemIndexEnt)));
size = add_size(size, dsm_estimate_size()); // 动态共享内存大小
size = add_size(size, BufferShmemSize()); // buffer共享内存大小
size = add_size(size, LockShmemSize()); // Lock模块共享内存大小
size = add_size(size, PredicateLockShmemSize()); ///谓词锁共享内存大小
size = add_size(size, ProcGlobalShmemSize()); // 全局共享ProcGlobal内存大小
size = add_size(size, XLOGShmemSize()); // XLOG模块共享内存大小
size = add_size(size, CLOGShmemSize()); // XLOG模块共享内存大小
size = add_size(size, CommitTsShmemSize()); // CommitTs模块共享内存大小
size = add_size(size, SUBTRANSShmemSize()); // 子事务模块共享内存大小
size = add_size(size, TwoPhaseShmemSize()); // 两阶段事务模块共享内存大小
size = add_size(size, BackgroundWorkerShmemSize()); // bgworker模块共享内存大小
size = add_size(size, MultiXactShmemSize()); // multi事务模块共享内存大小
size = add_size(size, LWLockShmemSize()); // LWlock模块共享内存大小
size = add_size(size, ProcArrayShmemSize()); // Proc数组共享内存大小
size = add_size(size, BackendStatusShmemSize()); // backendstatus模块共享内存大小
size = add_size(size, SInvalShmemSize()); // 无效消息模块共享内存大小
size = add_size(size, PMSignalShmemSize()); // PM信号模块共享内存大小
size = add_size(size, ProcSignalShmemSize()); // 进程信号模块共享内存大小
size = add_size(size, CheckpointerShmemSize()); // 检查点模块共享内存大小
size = add_size(size, AutoVacuumShmemSize()); // 自动清理进程模块共享内存大小
size = add_size(size, ReplicationSlotsShmemSize()); // 复制槽模块共享内存大小
size = add_size(size, ReplicationOriginShmemSize()); // 逻辑复制模块共享内存大小
size = add_size(size, WalSndShmemSize()); // walsender模块共享内存大小
size = add_size(size, WalRcvShmemSize()); // walreceiver模块共享内存大小
size = add_size(size, PgArchShmemSize()); // 归档模块共享内存大小
size = add_size(size, ApplyLauncherShmemSize()); // applyluncher模块共享内存大小
size = add_size(size, SnapMgrShmemSize()); // 快照管理器模块共享内存大小
size = add_size(size, BTreeShmemSize()); // BTree模块共享内存大小
size = add_size(size, SyncScanShmemSize()); // 同步扫描
size = add_size(size, AsyncShmemSize()); // 异步通知 listen/notify/unlisten
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
/* freeze the addin request size and include it */
addin_request_allowed = false;
size = add_size(size, total_addin_request);
/* might as well round it off to a multiple of a typical page size */
size = add_size(size, 8192 - (size % 8192)); // 向上取整, 8K 对齐
elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
2 PGSharedMemoryCreate
1)首先进行安全性检查,如DataDir目录头信息,申请内存大小大于 PGShmemHeader结构体大小等;
2)根据共享内存类型shared_memory_type调用对应的方法申请共享内存,如SHMEM_TYPE_MMAP则调用CreateAnonymousSegment申请匿名内存段;
3)紧接着为指定的 IPC key分配一个新的内存段,将这块内存段 Attach在当前进程,注册on_shmem_exit 回调函数释放此存储空间;
4)初始化标准头信息 PGShmemHeader,返回申请内存的头地址。
/*
* PGSharedMemoryCreate
*
* Create a shared memory segment of the given size and initialize its
* standard header. Also, register an on_shmem_exit callback to release
* the storage.
*
// 创建指定大小的共享内存段,并初始化标准头信息。同时注册 on_shmem_exit 回调函数释放此存储空间
* Dead Postgres segments pertinent to this DataDir are recycled if found, but
* we do not fail upon collision with foreign shmem segments. The idea here
* is to detect and re-use keys that may have been assigned by a crashed
* postmaster or backend.
*/
PGShmemHeader *
PGSharedMemoryCreate(Size size,
PGShmemHeader **shim)
{
IpcMemoryKey NextShmemSegID;
void *memAddress;
PGShmemHeader *hdr;
struct stat statbuf;
Size sysvsize;
/*
* We use the data directory's ID info (inode and device numbers) to
* positively identify shmem segments associated with this data dir, and
* also as seeds for searching for a free shmem key.
*/
if (stat(DataDir, &statbuf) < 0)
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not stat data directory \"%s\": %m",
DataDir)));
/* Complain if hugepages demanded but we can't possibly support them */
#if !defined(MAP_HUGETLB)
if (huge_pages == HUGE_PAGES_ON)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("huge pages not supported on this platform")));
#endif
/* Room for a header? */
Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
if (shared_memory_type == SHMEM_TYPE_MMAP)
{
AnonymousShmem = CreateAnonymousSegment(&size);
AnonymousShmemSize = size;
/* Register on-exit routine to unmap the anonymous segment */
on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
/* Now we need only allocate a minimal-sized SysV shmem block. */
sysvsize = sizeof(PGShmemHeader);
}
else
sysvsize = size;
/*
* Loop till we find a free IPC key. Trust CreateDataDirLockFile() to
* ensure no more than one postmaster per data directory can enter this
* loop simultaneously. (CreateDataDirLockFile() does not entirely ensure
* that, but prefer fixing it over coping here.)
*/
NextShmemSegID = statbuf.st_ino;
for (;;)
{
IpcMemoryId shmid;
PGShmemHeader *oldhdr;
IpcMemoryState state;
/* Try to create new segment */
memAddress = InternalIpcMemoryCreate(NextShmemSegID, sysvsize);
if (memAddress)
break; /* successful create and attach */
/* Check shared memory and possibly remove and recreate */
/*
* shmget() failure is typically EACCES, hence SHMSTATE_FOREIGN.
* ENOENT, a narrow possibility, implies SHMSTATE_ENOENT, but one can
* safely treat SHMSTATE_ENOENT like SHMSTATE_FOREIGN.
*/
shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);
if (shmid < 0)
{
oldhdr = NULL;
state = SHMSTATE_FOREIGN;
}
else
state = PGSharedMemoryAttach(shmid, NULL, &oldhdr);
switch (state)
{
case SHMSTATE_ANALYSIS_FAILURE:
case SHMSTATE_ATTACHED:
ereport(FATAL,
(errcode(ERRCODE_LOCK_FILE_EXISTS),
errmsg("pre-existing shared memory block (key %lu, ID %lu) is still in use",
(unsigned long) NextShmemSegID,
(unsigned long) shmid),
errhint("Terminate any old server processes associated with data directory \"%s\".",
DataDir)));
break;
case SHMSTATE_ENOENT:
/*
* To our surprise, some other process deleted since our last
* InternalIpcMemoryCreate(). Moments earlier, we would have
* seen SHMSTATE_FOREIGN. Try that same ID again.
*/
elog(LOG,
"shared memory block (key %lu, ID %lu) deleted during startup",
(unsigned long) NextShmemSegID,
(unsigned long) shmid);
break;
case SHMSTATE_FOREIGN:
NextShmemSegID++;
break;
case SHMSTATE_UNATTACHED:
/*
* The segment pertains to DataDir, and every process that had
* used it has died or detached. Zap it, if possible, and any
* associated dynamic shared memory segments, as well. This
* shouldn't fail, but if it does, assume the segment belongs
* to someone else after all, and try the next candidate.
* Otherwise, try again to create the segment. That may fail
* if some other process creates the same shmem key before we
* do, in which case we'll try the next key.
*/
if (oldhdr->dsm_control != 0)
dsm_cleanup_using_control_segment(oldhdr->dsm_control);
if (shmctl(shmid, IPC_RMID, NULL) < 0)
NextShmemSegID++;
break;
}
if (oldhdr && shmdt(oldhdr) < 0)
elog(LOG, "shmdt(%p) failed: %m", oldhdr);
}
/* Initialize new segment. */
hdr = (PGShmemHeader *) memAddress;
hdr->creatorPID = getpid();
hdr->magic = PGShmemMagic;
hdr->dsm_control = 0;
/* Fill in the data directory ID info, too */
hdr->device = statbuf.st_dev;
hdr->inode = statbuf.st_ino;
/*
* Initialize space allocation status for segment.
*/
hdr->totalsize = size;
hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
*shim = hdr;
/* Save info for possible future use */
UsedShmemSegAddr = memAddress;
UsedShmemSegID = (unsigned long) NextShmemSegID;
/*
* If AnonymousShmem is NULL here, then we're not using anonymous shared
* memory, and should return a pointer to the System V shared memory
* block. Otherwise, the System V shared memory block is only a shim, and
* we must return a pointer to the real block.
*/
if (AnonymousShmem == NULL)
return hdr;
memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
return (PGShmemHeader *) AnonymousShmem;
}
3 InitShmemAccess
该函数负责初始化全局的共享内存指针,方便后续各模块内存初始化调用
/*
* InitShmemAccess() --- set up basic pointers to shared memory.
*
* Note: the argument should be declared "PGShmemHeader *seghdr",
* but we use void to avoid having to include ipc.h in shmem.h.
*/
void
InitShmemAccess(void *seghdr)
{
PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
ShmemSegHdr = shmhdr;
ShmemBase = (void *) shmhdr;
ShmemEnd = (char *) ShmemBase + shmhdr->totalsize;
}
4 PGReserveSemaphores
该函数用于初始化信号量模块的共享内存,以支持进程间通信
/*
* PGReserveSemaphores --- initialize semaphore support
*
* This is called during postmaster start or shared memory reinitialization.
* It should do whatever is needed to be able to support up to maxSemas
* subsequent PGSemaphoreCreate calls. Also, if any system resources
* are acquired here or in PGSemaphoreCreate, register an on_shmem_exit
* callback to release them.
*
* In the SysV implementation, we acquire semaphore sets on-demand; the
* maxSemas parameter is just used to size the arrays. There is an array
* of PGSemaphoreData structs in shared memory, and a postmaster-local array
* with one entry per SysV semaphore set, which we use for releasing the
* semaphore sets when done. (This design ensures that postmaster shutdown
* doesn't rely on the contents of shared memory, which a failed backend might
* have clobbered.)
*/
void
PGReserveSemaphores(int maxSemas)
{
struct stat statbuf;
/*
* We use the data directory's inode number to seed the search for free
* semaphore keys. This minimizes the odds of collision with other
* postmasters, while maximizing the odds that we will detect and clean up
* semaphores left over from a crashed postmaster in our own directory.
*/
if (stat(DataDir, &statbuf) < 0)
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not stat data directory \"%s\": %m",
DataDir)));
/*
* We must use ShmemAllocUnlocked(), since the spinlock protecting
* ShmemAlloc() won't be ready yet. (This ordering is necessary when we
* are emulating spinlocks with semaphores.)
*/
// 在共享内存段中申请指定带大小的信号量区域,并初始化全局信号量字段信息
sharedSemas = (PGSemaphore)
ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
numSharedSemas = 0;
maxSharedSemas = maxSemas;
maxSemaSets = (maxSemas + SEMAS_PER_SET - 1) / SEMAS_PER_SET;
mySemaSets = (IpcSemaphoreId *)
malloc(maxSemaSets * sizeof(IpcSemaphoreId));
if (mySemaSets == NULL)
elog(PANIC, "out of memory");
numSemaSets = 0;
nextSemaKey = statbuf.st_ino;
nextSemaNumber = SEMAS_PER_SET; /* force sema set alloc on 1st call */
// 注册 on_shmem_exit回调函数,释放信号量
on_shmem_exit(ReleaseSemaphores, 0);
}
5 SpinlockSemaInit
该函数用于实现SpinLock信号量初始化,一些平台支持Spinlock
/*
* Initialize spinlock emulation.
*
* This must be called after PGReserveSemaphores().
*/
void
SpinlockSemaInit(void)
{
PGSemaphore *spinsemas;
int nsemas = SpinlockSemas();
int i;
/*
* We must use ShmemAllocUnlocked(), since the spinlock protecting
* ShmemAlloc() obviously can't be ready yet.
*/
spinsemas = (PGSemaphore *) ShmemAllocUnlocked(SpinlockSemaSize());
for (i = 0; i < nsemas; ++i)
spinsemas[i] = PGSemaphoreCreate();
SpinlockSemaArray = spinsemas;
}
6 InitShmemAllocation
该函数用于初始化全局变量 ShmemLock,在后续各模块申请内存空间需持有该锁;同时初始化事务管理器中OID/XID相关的全局结构体:ShmemVariableCache
/*
* InitShmemAllocation() --- set up shared-memory space allocation.
*
* This should be called only in the postmaster or a standalone backend.
*/
void
InitShmemAllocation(void)
{
PGShmemHeader *shmhdr = ShmemSegHdr;
char *aligned;
Assert(shmhdr != NULL);
/*
* Initialize the spinlock used by ShmemAlloc. We must use
* ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
*/
ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
SpinLockInit(ShmemLock);
/*
* Allocations after this point should go through ShmemAlloc, which
* expects to allocate everything on cache line boundaries. Make sure the
* first allocation begins on a cache line boundary.
*/
aligned = (char *)
(CACHELINEALIGN((((char *) shmhdr) + shmhdr->freeoffset)));
shmhdr->freeoffset = aligned - (char *) shmhdr;
/* ShmemIndex can't be set up yet (need LWLocks first) */
shmhdr->index = NULL;
ShmemIndex = (HTAB *) NULL;
/*
* Initialize ShmemVariableCache for transaction manager. (This doesn't
* really belong here, but not worth moving.)
*/
ShmemVariableCache = (VariableCache)
ShmemAlloc(sizeof(*ShmemVariableCache));
memset(ShmemVariableCache, 0, sizeof(*ShmemVariableCache));
}