前言
磁盘块与文件 inode entry 的申请和释放的处理机制总体上是一致的,所以就放到一起进行分享。在之前的文章中介绍了ext2文件系统磁盘的的总体布局,为了方便说明,这里就假设磁盘只有一个块组。
块组布局
块组描述符
: 与超级块作用类似,记录该块组的基本属性信息数据块位图/inode位图
: 以 bit 作为基本操作对象,表示一个数据块或者 inode entry 的状态。置 0 表示空闲,置 1 表示占用inode表
: 集中存放 inode entry数据块
: 存储文件的数据与元数据
提问1: 超级块保存的是磁盘整体的信息,为什么每个块组中都包含一份?
这样做就是为了保留多个备份,有利于磁盘被损坏时的数据恢复。有的磁盘文件系统并没有采用此种设计,就比如 ufs,它是在某些特定磁盘块存放超级块数据的备份 (印象中是有3份)。
提问2: 每块区域的大小是如何确定的?
一般是采用估算法来确定的。用户可以根据自身的实际使用情况,预设每个文件平均占用的磁盘块数。因为文件一定会唯一对应一个 inode 数据结构,这样就可以计算出 inode entry 的大致个数。再进一步,可以计算得到 inode 位图和 inode 表的大小。超级块和块组描述符一般会占用完整的一个或者几个磁盘块,确定之后就可以计算得到数据块位图和数据块两个区域的大小。
每个区域的大小尽量设置为磁盘块大小的整数倍。如果存储空间比较宽裕,建议保证磁盘块大小是数据结构大小的整数倍(可能会造成一些空间浪费),好处是降低代码实现的复杂度。
主要函数分析
我们主要关注 freebsd/usr/src/sys/fs/ext2fs 下的 ext2_alloc.c 和 ext2_balloc.c 两个文件。int ext2_alloc(...)
:从磁盘申请一个空闲数据块
/*
* Allocate a block in the filesystem.
*
* A preference may be optionally specified. If a preference is given
* the following hierarchy is used to allocate a block:
* 1) allocate the requested block.
* 2) allocate a rotationally optimal block in the same cylinder.
* 3) allocate a block in the same cylinder group.
* 4) quadradically rehash into other cylinder groups, until an
* available block is located.
* If no block preference is given the following hierarchy is used
* to allocate a block:
* 1) allocate a block in the cylinder group that contains the
* inode for the file.
* 2) quadradically rehash into other cylinder groups, until an
* available block is located.
*/
int
ext2_alloc(struct inode *ip, daddr_t lbn, e4fs_daddr_t bpref, int size,
struct ucred *cred, e4fs_daddr_t *bnp)
{
struct m_ext2fs *fs;
struct ext2mount *ump;
e4fs_daddr_t bno;
int cg;
*bnp = 0;
fs = ip->i_e2fs;
ump = ip->i_ump;
mtx_assert(EXT2_MTX(ump), MA_OWNED);
#ifdef INVARIANTS
if ((u_int)size > fs->e2fs_bsize || blkoff(fs, size) != 0) {
vn_printf(ip->i_devvp, "bsize = %lu, size = %d, fs = %s\n",
(long unsigned int)fs->e2fs_bsize, size, fs->e2fs_fsmnt);
panic("ext2_alloc: bad size");
}
if (cred == NOCRED)
panic("ext2_alloc: missing credential");
#endif /* INVARIANTS */
if (size == fs->e2fs_bsize && fs->e2fs_fbcount == 0)
goto nospace;
if (cred->cr_uid != 0 &&
fs->e2fs_fbcount < fs->e2fs_rbcount)
goto nospace;
if (bpref >= fs->e2fs_bcount)
bpref = 0;
if (bpref == 0)
cg = ino_to_cg(fs, ip->i_number);
else
cg = dtog(fs, bpref);
bno = (daddr_t)ext2_hashalloc(ip, cg, bpref, fs->e2fs_bsize,
ext2_alloccg);
if (bno > 0) {
/* set next_alloc fields as done in block_getblk */
ip->i_next_alloc_block = lbn;
ip->i_next_alloc_goal = bno;
ip->i_blocks += btodb(fs->e2fs_bsize);
ip->i_flag |= IN_CHANGE | IN_UPDATE;
*bnp = bno;
return (0);
}
nospace:
EXT2_UNLOCK(ump);
ext2_fserr(fs, cred->cr_uid, "filesystem full");
uprintf("\n%s: write failed, filesystem is full\n", fs->e2fs_fsmnt);
return (ENOSPC);
}
参数分析:
daddr_t lbn
: file logical block number(文件逻辑块号),如何理解呢?假设一个文本文件大小是 4k,磁盘块大小是 512 bytes,那么文件的块大小是 8。当我们使用 vim 打开这个文件时,一定是从文件第 0 个块顺序读取到第 7 个块,否则呈现到屏幕上的文件内容将是乱序的,0-7就可以理解为文件的逻辑块号。文件逻辑块号对应的真正的磁盘块号是随机分配的,但相对于文件起始块号的偏移则是确定的
。假设用户想要在文件末尾添加一些内容,那么就需要给文件分配第 9 个逻辑块。这里就可以认为 lbn = 9。e4fs_daddr_t bpref
: prefer block number。ext2 文件系统为了加快磁盘块分配速率,会在 struct inode 中的i_next_alloc_goal 成员存放一个可能可以使用的磁盘块 (一般是此次分配磁盘块号+1),下次分配的时候会优先检查该磁盘块是否可用。如果是,则直接拿来用;否则,重新调用分配函数查找可用数据块。e4fs_daddr_t *bnp
: disk block number pointer,即上述文件逻辑块 9 对应的真正的磁盘块号函数中也有涉及到块组号 cg 的计算,这里就不再展开了。思路就是就近原则,优先分配当前文件所在块组中的磁盘块 (可在读写过程中减少磁头移动距离,提高效率)
该函数更多的是一些分配前的文件属性判断以及拿到磁盘块号后的赋值操作,真正的逻辑实现是在 ext2_alloccg(...)
函数
/*
* Determine whether a block can be allocated.
*
* Check to see if a block of the appropriate size is available,
* and if it is, allocate it.
*/
static daddr_t
ext2_alloccg(struct inode *ip, int cg, daddr_t bpref, int size)
{
struct m_ext2fs *fs;
struct buf *bp;
struct ext2mount *ump;
daddr_t bno, runstart, runlen;
int bit, loc, end, error, start;
char *bbp;
/* XXX ondisk32 */
fs = ip->i_e2fs;
ump = ip->i_ump;
if (e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) == 0)
return (0);
EXT2_UNLOCK(ump);
/* 从磁盘读取数据到内存 */
error = bread(ip->i_devvp, fsbtodb(fs,
e2fs_gd_get_b_bitmap(&fs->e2fs_gd[cg])),
(int)fs->e2fs_bsize, NOCRED, &bp);
if (error) {
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
if (EXT2_HAS_RO_COMPAT_FEATURE(fs, EXT2F_ROCOMPAT_GDT_CSUM) ||
EXT2_HAS_RO_COMPAT_FEATURE(fs, EXT2F_ROCOMPAT_METADATA_CKSUM)) {
error = ext2_cg_block_bitmap_init(fs, cg, bp);
if (error) {
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
ext2_gd_b_bitmap_csum_set(fs, cg, bp);
}
error = ext2_gd_b_bitmap_csum_verify(fs, cg, bp);
if (error) {
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
if (e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) == 0) {
/*
* Another thread allocated the last block in this
* group while we were waiting for the buffer.
*/
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
bbp = (char *)bp->b_data;
if (dtog(fs, bpref) != cg)
bpref = 0;
if (bpref != 0) {
bpref = dtogd(fs, bpref);
/*
* if the requested block is available, use it
*/
if (isclr(bbp, bpref)) {
bno = bpref;
goto gotit;
}
}
/*
* no blocks in the requested cylinder, so take next
* available one in this cylinder group.
* first try to get 8 contigous blocks, then fall back to a single
* block.
*/
if (bpref)
start = dtogd(fs, bpref) / NBBY;
else
start = 0;
end = howmany(fs->e2fs->e2fs_fpg, NBBY) - start;
/* 解析数据块位图,查找可用数据块 */
retry:
runlen = 0;
runstart = 0;
for (loc = start; loc < end; loc++) {
if (bbp[loc] == (char)0xff) {
runlen = 0;
continue;
}
/* Start of a run, find the number of high clear bits. */
if (runlen == 0) {
bit = fls(bbp[loc]);
runlen = NBBY - bit;
runstart = loc * NBBY + bit;
} else if (bbp[loc] == 0) {
/* Continue a run. */
runlen += NBBY;
} else {
/*
* Finish the current run. If it isn't long
* enough, start a new one.
*/
bit = ffs(bbp[loc]) - 1;
runlen += bit;
if (runlen >= 8) {
bno = runstart;
goto gotit;
}
/* Run was too short, start a new one. */
bit = fls(bbp[loc]);
runlen = NBBY - bit;
runstart = loc * NBBY + bit;
}
/* If the current run is long enough, use it. */
if (runlen >= 8) {
bno = runstart;
goto gotit;
}
}
if (start != 0) {
end = start;
start = 0;
goto retry;
}
bno = ext2_mapsearch(fs, bbp, bpref);
if (bno < 0) {
brelse(bp);
EXT2_LOCK(ump);
return (0);
}
gotit:
#ifdef INVARIANTS
if (isset(bbp, bno)) {
printf("ext2fs_alloccgblk: cg=%d bno=%jd fs=%s\n",
cg, (intmax_t)bno, fs->e2fs_fsmnt);
panic("ext2fs_alloccg: dup alloc");
}
#endif
setbit(bbp, bno); /* 令 bno 位 = 1,表示该块组中的这一数据块已经被占用 */
EXT2_LOCK(ump);
ext2_clusteracct(fs, bbp, cg, bno, -1);
/* 更新超级块中的成员变量 (原子操作) */
fs->e2fs_fbcount--;
e2fs_gd_set_nbfree(&fs->e2fs_gd[cg],
e2fs_gd_get_nbfree(&fs->e2fs_gd[cg]) - 1);
fs->e2fs_fmod = 1;
EXT2_UNLOCK(ump);
ext2_gd_b_bitmap_csum_set(fs, cg, bp);
bdwrite(bp); /* 刷新后的内存数据回写到磁盘中 */
return (((uint64_t)cg) * fs->e2fs->e2fs_fpg + fs->e2fs->e2fs_first_dblock + bno);
}
bread(...) / bwrite(...) 涉及到了操作系统的一些基础知识。CPU 不会对磁盘中的数据进行实时操作,因为速度太慢了,所以都是先将数据从磁盘读取到内存。从磁盘读数据也会有一定的优化策略,比如块读取。假设我们需要修改磁盘某个块中的 4 bytes 数据,磁盘驱动也会将对应磁盘块的 512 bytes 全部读取到内存中。借用局部性原理的相关表述,相邻数据后续被用到的可能性是比较高的,外加驱动程序的一些特性,总体效率还是比较高的。再例如对磁盘块号进行升序或者降序排列,最大限度保证磁头在同一阶段往同一个方向运动等。磁盘文件系统中可能还会引入 簇
的概念,个人理解跟上述原因比较类似。
freebsd 通过缓存 (struct buf) 管理从磁盘读取到的块数据,每个磁盘块只映射到一个缓存结构。这样做是为了确保文件数据读写的原子性 (简单来说就是给这块缓存区域加读写锁)。这部分内容牵涉到内存管理相关知识,我个人也没有详细去研究,以后有机会再来填坑吧。。。int ext2_valloc()
: 申请一个可用 inode
/*
* Allocate an inode in the filesystem.
*
*/
int
ext2_valloc(struct vnode *pvp, int mode, struct ucred *cred, struct vnode **vpp)
{
struct timespec ts;
struct inode *pip;
struct m_ext2fs *fs;
struct inode *ip;
struct ext2mount *ump;
ino_t ino, ipref;
int error, cg;
*vpp = NULL;
pip = VTOI(pvp);
fs = pip->i_e2fs;
ump = pip->i_ump;
EXT2_LOCK(ump);
if (fs->e2fs->e2fs_ficount == 0)
goto noinodes;
/*
* If it is a directory then obtain a cylinder group based on
* ext2_dirpref else obtain it using ino_to_cg. The preferred inode is
* always the next inode.
*/
if ((mode & IFMT) == IFDIR) {
cg = ext2_dirpref(pip);
if (fs->e2fs_contigdirs[cg] < 255)
fs->e2fs_contigdirs[cg]++;
} else {
cg = ino_to_cg(fs, pip->i_number);
if (fs->e2fs_contigdirs[cg] > 0)
fs->e2fs_contigdirs[cg]--;
}
ipref = cg * fs->e2fs->e2fs_ipg + 1;
ino = (ino_t)ext2_hashalloc(pip, cg, (long)ipref, mode, ext2_nodealloccg);
if (ino == 0)
goto noinodes;
error = VFS_VGET(pvp->v_mount, ino, LK_EXCLUSIVE, vpp);
if (error) {
ext2_vfree(pvp, ino, mode);
return (error);
}
ip = VTOI(*vpp);
/*
* The question is whether using VGET was such good idea at all:
* Linux doesn't read the old inode in when it is allocating a
* new one. I will set at least i_size and i_blocks to zero.
*/
ip->i_flag = 0;
ip->i_size = 0;
ip->i_blocks = 0;
ip->i_mode = 0;
ip->i_flags = 0;
if (EXT2_HAS_INCOMPAT_FEATURE(fs, EXT2F_INCOMPAT_EXTENTS)
&& (S_ISREG(mode) || S_ISDIR(mode)))
ext4_ext_tree_init(ip);
else
memset(ip->i_data, 0, sizeof(ip->i_data));
/*
* Set up a new generation number for this inode.
* Avoid zero values.
*/
do {
ip->i_gen = arc4random();
} while (ip->i_gen == 0);
vfs_timestamp(&ts);
ip->i_birthtime = ts.tv_sec;
ip->i_birthnsec = ts.tv_nsec;
/*
printf("ext2_valloc: allocated inode %d\n", ino);
*/
return (0);
noinodes:
EXT2_UNLOCK(ump);
ext2_fserr(fs, cred->cr_uid, "out of inodes");
uprintf("\n%s: create/symlink failed, no inodes free\n", fs->e2fs_fsmnt);
return (ENOSPC);
}
处理逻辑与上述函数类似,这就不再赘述。如果 inode 对应的是一个目录文件,对于多块组情形,尽可能与父目录分配到同一个块组中。
从 inode 设计上可以看到,当一个文件的大小超过一定范围时,直接块索引已经无法满足需要,此时就要动用间接块索引节点。那这个时候文件申请的数据块,就不单单存放内容数据,还要存放用于文件块索引的元数据。ext2 也提供相应的处理函数:
/*
* Balloc defines the structure of filesystem storage
* by allocating the physical blocks on a device given
* the inode and the logical block number in a file.
*/
int
ext2_balloc(struct inode *ip, e2fs_lbn_t lbn, int size, struct ucred *cred,
struct buf **bpp, int flags)
{
struct m_ext2fs *fs;
struct ext2mount *ump;
struct buf *bp, *nbp;
struct vnode *vp = ITOV(ip);
struct indir indirs[EXT2_NIADDR + 2];
e4fs_daddr_t nb, newb;
e2fs_daddr_t *bap, pref;
int osize, nsize, num, i, error;
*bpp = NULL;
if (lbn < 0)
return (EFBIG);
fs = ip->i_e2fs;
ump = ip->i_ump;
/*
* check if this is a sequential block allocation.
* If so, increment next_alloc fields to allow ext2_blkpref
* to make a good guess
*/
if (lbn == ip->i_next_alloc_block + 1) {
ip->i_next_alloc_block++;
ip->i_next_alloc_goal++;
}
if (ip->i_flag & IN_E4EXTENTS)
return (ext2_ext_balloc(ip, lbn, size, cred, bpp, flags));
/*
* The first EXT2_NDADDR blocks are direct blocks
*/
if (lbn < EXT2_NDADDR) {
/* 直接块索引 */
nb = ip->i_db[lbn];
/*
* no new block is to be allocated, and no need to expand
* the file
*/
if (nb != 0 && ip->i_size >= (lbn + 1) * fs->e2fs_bsize) {
error = bread(vp, lbn, fs->e2fs_bsize, NOCRED, &bp);
if (error) {
brelse(bp);
return (error);
}
bp->b_blkno = fsbtodb(fs, nb);
*bpp = bp;
return (0);
}
if (nb != 0) {
/*
* Consider need to reallocate a fragment.
*/
osize = fragroundup(fs, blkoff(fs, ip->i_size));
nsize = fragroundup(fs, size);
if (nsize <= osize) {
error = bread(vp, lbn, osize, NOCRED, &bp);
if (error) {
brelse(bp);
return (error);
}
bp->b_blkno = fsbtodb(fs, nb);
} else {
/*
* Godmar thinks: this shouldn't happen w/o
* fragments
*/
printf("nsize %d(%d) > osize %d(%d) nb %d\n",
(int)nsize, (int)size, (int)osize,
(int)ip->i_size, (int)nb);
panic(
"ext2_balloc: Something is terribly wrong");
/*
* please note there haven't been any changes from here on -
* FFS seems to work.
*/
}
} else {
if (ip->i_size < (lbn + 1) * fs->e2fs_bsize)
nsize = fragroundup(fs, size);
else
nsize = fs->e2fs_bsize;
EXT2_LOCK(ump);
error = ext2_alloc(ip, lbn,
ext2_blkpref(ip, lbn, (int)lbn, &ip->i_db[0], 0),
nsize, cred, &newb);
if (error)
return (error);
/*
* If the newly allocated block exceeds 32-bit limit,
* we can not use it in file block maps.
*/
if (newb > UINT_MAX)
return (EFBIG);
bp = getblk(vp, lbn, nsize, 0, 0, 0);
bp->b_blkno = fsbtodb(fs, newb);
if (flags & BA_CLRBUF)
vfs_bio_clrbuf(bp);
}
ip->i_db[lbn] = dbtofsb(fs, bp->b_blkno);
ip->i_flag |= IN_CHANGE | IN_UPDATE;
*bpp = bp;
return (0);
}
/*
* Determine the number of levels of indirection.
*/
pref = 0;
if ((error = ext2_getlbns(vp, lbn, indirs, &num)) != 0)
return (error);
#ifdef INVARIANTS
if (num < 1)
panic("ext2_balloc: ext2_getlbns returned indirect block");
#endif
/*
* Fetch the first indirect block allocating if necessary.
*/
--num;
nb = ip->i_ib[indirs[0].in_off];
if (nb == 0) {
EXT2_LOCK(ump);
pref = ext2_blkpref(ip, lbn, indirs[0].in_off +
EXT2_NDIR_BLOCKS, &ip->i_db[0], 0);
if ((error = ext2_alloc(ip, lbn, pref, fs->e2fs_bsize, cred,
&newb)))
return (error);
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
bp = getblk(vp, indirs[1].in_lbn, fs->e2fs_bsize, 0, 0, 0);
bp->b_blkno = fsbtodb(fs, newb);
vfs_bio_clrbuf(bp);
/*
* Write synchronously so that indirect blocks
* never point at garbage.
*/
if ((error = bwrite(bp)) != 0) {
ext2_blkfree(ip, nb, fs->e2fs_bsize);
return (error);
}
ip->i_ib[indirs[0].in_off] = newb;
ip->i_flag |= IN_CHANGE | IN_UPDATE;
}
/*
* Fetch through the indirect blocks, allocating as necessary.
*/
for (i = 1;;) {
error = bread(vp,
indirs[i].in_lbn, (int)fs->e2fs_bsize, NOCRED, &bp);
if (error) {
brelse(bp);
return (error);
}
bap = (e2fs_daddr_t *)bp->b_data;
nb = bap[indirs[i].in_off];
if (i == num)
break;
i += 1;
if (nb != 0) {
bqrelse(bp);
continue;
}
EXT2_LOCK(ump);
if (pref == 0)
pref = ext2_blkpref(ip, lbn, indirs[i].in_off, bap,
bp->b_lblkno);
error = ext2_alloc(ip, lbn, pref, (int)fs->e2fs_bsize, cred, &newb);
if (error) {
brelse(bp);
return (error);
}
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
nbp = getblk(vp, indirs[i].in_lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
vfs_bio_clrbuf(nbp);
/*
* Write synchronously so that indirect blocks
* never point at garbage.
*/
if ((error = bwrite(nbp)) != 0) {
ext2_blkfree(ip, nb, fs->e2fs_bsize);
EXT2_UNLOCK(ump);
brelse(bp);
return (error);
}
bap[indirs[i - 1].in_off] = nb;
/*
* If required, write synchronously, otherwise use
* delayed write.
*/
if (flags & IO_SYNC) {
bwrite(bp);
} else {
if (bp->b_bufsize == fs->e2fs_bsize)
bp->b_flags |= B_CLUSTEROK;
bdwrite(bp);
}
}
/*
* Get the data block, allocating if necessary.
*/
if (nb == 0) {
EXT2_LOCK(ump);
pref = ext2_blkpref(ip, lbn, indirs[i].in_off, &bap[0],
bp->b_lblkno);
if ((error = ext2_alloc(ip,
lbn, pref, (int)fs->e2fs_bsize, cred, &newb)) != 0) {
brelse(bp);
return (error);
}
if (newb > UINT_MAX)
return (EFBIG);
nb = newb;
nbp = getblk(vp, lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
if (flags & BA_CLRBUF)
vfs_bio_clrbuf(nbp);
bap[indirs[i].in_off] = nb;
/*
* If required, write synchronously, otherwise use
* delayed write.
*/
if (flags & IO_SYNC) {
bwrite(bp);
} else {
if (bp->b_bufsize == fs->e2fs_bsize)
bp->b_flags |= B_CLUSTEROK;
bdwrite(bp);
}
*bpp = nbp;
return (0);
}
brelse(bp);
if (flags & BA_CLRBUF) {
int seqcount = (flags & BA_SEQMASK) >> BA_SEQSHIFT;
if (seqcount && (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
error = cluster_read(vp, ip->i_size, lbn,
(int)fs->e2fs_bsize, NOCRED,
MAXBSIZE, seqcount, 0, &nbp);
} else {
error = bread(vp, lbn, (int)fs->e2fs_bsize, NOCRED, &nbp);
}
if (error) {
brelse(nbp);
return (error);
}
} else {
nbp = getblk(vp, lbn, fs->e2fs_bsize, 0, 0, 0);
nbp->b_blkno = fsbtodb(fs, nb);
}
*bpp = nbp;
return (0);
}
函数结构就是一个 if else,分别处理直接块索引和间接块索引两个分支。需要注意的是,在块分配的过程中要同时对元数据块进行更新。getblk(...) 是 bread() 逻辑实现的主体,所以两者的功能是一致的。间接块索引的处理算法就留给大家自己看了,我个人也是在反复阅读和实测之后,才明白了作者的实现思路。
前面文章中提出了一个问题,就是间接块索引的元数据块是否包含在文件大小中?我个人的理解是,不包含。我们定位文件逻辑块号的时候,用的是文件大小与磁盘块大小的取模值(或者再+1)。如果把元数据也包含在内,那么将无法判定某个文件指针偏移量到底对应的内容数据,还是元数据 (除非限定文件块的排列方式)。所以我个人认为,元数据起到的是辅助定位作用,不应该包含到文件大小当中。
结语
这里只列举了部分函数,其他细节上的实现还是需要大家去阅读源码。这部分代码也比较晦涩难懂,个人建议多动手测试,顺便看看有没有bug (手动添加狗头)。