暑假生活开始
考试终于结束了,大家紧张的心情也开始慢慢放松下来。由于疫情时断时续,Nanami 和 Nanase 的旅行计划被迫终止。想到之前还没有讨论完的问题,Nanami 转头问 Nanase,"上次讨论的问题要不要继续?"。"反正现在也出不去,在寝室待着也是无聊,我感觉可以,哈哈。" Nanase 笑着回答到。“吼,那我现在给 Douyiya 发消息,去图书馆见”。
此时,Douyiya 与寝室四个好基友已经整整齐齐坐在电脑前,摩拳擦掌,“今天难得五黑,必须怒爬一波天梯!”。突然微信来了消息,Douyiya 低头看了一眼,立马就开始打包好了电脑,然后轻轻来了一句:“兄弟们,人生大事,我就先走了。” 话音未落,就着急忙慌地冲出了寝室,留下一头雾水的室友们在风中凌乱。
Douyiya 的分享
“好,那今天给你们分享一下非常经典的 ext2
文件系统。它的实现比较简单,比较适合咱们这种小白入门。” Nanami 看了一下屏幕上密密麻麻的代码,就小声问了一句:“这个代码量感觉还是有点大的,咱们从哪里开始呢?” Douyiya 察觉到了 Nanami 语气中带有的那一丝丝不自信,就微笑着说:“先给你们分享一些基本的数据结构吧,对整个文件系统有个大概的认识,后续再看代码实现就能更好的理解了。” Nanami 和 Nanase 相视一笑,点了点头。
Douyiya:记得上次 Nanami 说过,我们可以将文件的属性封装到一个数据结构中统一管理,在 ext2 文件系统中就有这么一个数据结构来专门做这件事,struct inode
。
以后代码都统一摘抄自 freebsd-12.0 版本
#define EXT2_NDADDR 12 /* Direct addresses in inode. */
#define EXT2_NIADDR 3 /* Indirect addresses in inode. */
/*
* The inode is used to describe each active (or recently active) file in the
* EXT2FS filesystem. It is composed of two types of information. The first
* part is the information that is needed only while the file is active (such
* as the identity of the file and linkage to speed its lookup). The second
* part is the permanent meta-data associated with the file which is read in
* from the permanent dinode from long term storage when the file becomes
* active, and is put back when the file is no longer being used.
*/
struct inode {
// 虚拟文件系统(Virtual filesystem, VFS) 层级数据结构,暂时忽略
struct vnode *i_vnode;/* Vnode associated with this inode. */
struct ext2mount *i_ump;
uint32_t i_flag; /* flags, see below */
ino_t i_number; /* The identity of the inode. */
struct m_ext2fs *i_e2fs; /* EXT2FS */
u_quad_t i_modrev; /* Revision level for NFS lease. */
/*
* Side effects; used during directory lookup.
* 下面四个成员在查找目录下子文件时会用到,后续再详细说明
*/
int32_t i_count; /* Size of free slot in directory. */
doff_t i_endoff; /* End of useful stuff in directory. */
doff_t i_diroff; /* Offset in dir, where we found last entry. */
doff_t i_offset; /* Offset of free space in directory. */
uint32_t i_block_group; // 块组号
uint32_t i_next_alloc_block;
uint32_t i_next_alloc_goal;
/* Fields from struct dinode in UFS. */
uint16_t i_mode; /* IFMT, permissions; see below. */
int32_t i_nlink; /* File link count. */
uint32_t i_uid; /* File owner. */
uint32_t i_gid; /* File group. */
uint64_t i_size; /* File byte count. */
uint64_t i_blocks; /* Blocks actually held. */
int32_t i_atime; /* Last access time. */
int32_t i_mtime; /* Last modified time. */
int32_t i_ctime; /* Last inode change time. */
int32_t i_birthtime; /* Inode creation time. */
int32_t i_mtimensec; /* Last modified time. */
int32_t i_atimensec; /* Last access time. */
int32_t i_ctimensec; /* Last inode change time. */
int32_t i_birthnsec; /* Inode creation time. */
uint32_t i_gen; /* Generation number. */
uint64_t i_facl; /* EA block number. 文件扩展属性所在的磁盘块 */
uint32_t i_flags; /* Status flags (chflags). */
union {
struct {
uint32_t i_db[EXT2_NDADDR]; /* Direct disk blocks. */
uint32_t i_ib[EXT2_NIADDR]; /* Indirect disk blocks. */
};
uint32_t i_data[EXT2_NDADDR + EXT2_NIADDR];
};
struct ext4_extent_cache i_ext_cache; /* cache for ext4 extent */
};
Nanami:啊,,i_size
和 i_block
表示的都是文件大小吧,这么设计是不是重复了呀?
Douyiya:哈哈,这个倒不是。还记不记得上次跟你们说过的,文件在磁盘是是以 数据块
的形式存在的,那就一定会存在最后一个数据块没有被完全装满的情况。那剩下的空间又不能给其他的文件使用,所以开发者就安排了两个成员来描述这种情形。i_size
就表示 以字节为单位的文件实际大小
,而 i_blocks
则表示这个文件 实际占用的磁盘块数
,那么
i_block * disk_block_size >= i_size
Nanase:那 i_number
是做什么用的呢?
Douyiya:i_number
可以理解为文件编号,文件系统会为每个文件都分配一个唯一编号,这样我们就可以通过它去定位文件
Nanase:如此说来,每个文件都会有唯一一个 inode 与之对应喽
Douyiya:哈哈,正解。每个文件的属性都是不一样的,如果共用一个 inode 岂不是乱套了
Nanami:那 inode 中包含的 i_data
数组中存放的应该就是文件包含的磁盘块对应的块号了吧,别的属性好像也用不了这么大的空间
Douyiya:嗯嗯,确实是。它其实是分成了两大部分,一个是 直接索引
,一个是 间接索引
。直接索引很好理解,就是用于处理那些比较小的文件,把占用的磁盘块号写入到数组当中,当用户访问文件的时候按顺序读取就好了。当用户需要处理大型文件的时候,那就要用到间接块索引了。
与此同时,Douyiya 就打开了早已准备好的示意图:
间接索引则是会申请磁盘块来存放磁盘块号,而不是真正的文件数据,这类磁盘块也被叫做 index block
。假设磁盘块大小为 512 bytes,那每个 index block 可以存放 128 个磁盘块号。所以一级间接索引可以映射另外128个磁盘块。同理,二级索引块中存放的是一级索引块号,三级索引块中存放的是二级索引块号,它们可映射的磁盘块数量就呈指数级增长。
Nanami:昂,,原来是这样。那文件大小也是有上限的,如果映射到的数据块都装满了,文件就无法再写入数据了
Douyiya:嗯,是的,所以 ext4 文件系统就针对这个问题改进了设计
思考题:index blocks 算不算在 inode->i_blocks 当中?
Nanase:上次我记得说整个文件系统的状态也是需要管理的,这个应该也是有对应结构体的吧?
Douyiya: 哈哈,我刚想说这个,是一个叫做 超级块(struct ext2fs)
的数据结构:
/*
* Super block for an ext2fs file system.
*/
struct ext2fs {
uint32_t e2fs_icount; /* Inode count */
uint32_t e2fs_bcount; /* blocks count */
uint32_t e2fs_rbcount; /* reserved blocks count */
uint32_t e2fs_fbcount; /* free blocks count */
uint32_t e2fs_ficount; /* free inodes count */
uint32_t e2fs_first_dblock; /* first data block */
uint32_t e2fs_log_bsize; /* block size = 1024*(2^e2fs_log_bsize) */
uint32_t e2fs_log_fsize; /* fragment size */
uint32_t e2fs_bpg; /* blocks per group */
uint32_t e2fs_fpg; /* frags per group */
uint32_t e2fs_ipg; /* inodes per group */
uint32_t e2fs_mtime; /* mount time */
uint32_t e2fs_wtime; /* write time */
uint16_t e2fs_mnt_count; /* mount count */
uint16_t e2fs_max_mnt_count; /* max mount count */
uint16_t e2fs_magic; /* magic number */
uint16_t e2fs_state; /* file system state */
uint16_t e2fs_beh; /* behavior on errors */
uint16_t e2fs_minrev; /* minor revision level */
uint32_t e2fs_lastfsck; /* time of last fsck */
uint32_t e2fs_fsckintv; /* max time between fscks */
uint32_t e2fs_creator; /* creator OS */
uint32_t e2fs_rev; /* revision level */
uint16_t e2fs_ruid; /* default uid for reserved blocks */
uint16_t e2fs_rgid; /* default gid for reserved blocks */
/* EXT2_DYNAMIC_REV superblocks */
uint32_t e2fs_first_ino; /* first non-reserved inode */
uint16_t e2fs_inode_size; /* size of inode structure */
uint16_t e2fs_block_group_nr; /* block grp number of this sblk*/
uint32_t e2fs_features_compat; /* compatible feature set */
uint32_t e2fs_features_incompat; /* incompatible feature set */
uint32_t e2fs_features_rocompat; /* RO-compatible feature set */
uint8_t e2fs_uuid[16]; /* 128-bit uuid for volume */
char e2fs_vname[16]; /* volume name */
char e2fs_fsmnt[64]; /* name mounted on */
uint32_t e2fs_algo; /* For compression */
uint8_t e2fs_prealloc; /* # of blocks for old prealloc */
uint8_t e2fs_dir_prealloc; /* # of blocks for old prealloc dirs */
uint16_t e2fs_reserved_ngdb; /* # of reserved gd blocks for resize */
char e3fs_journal_uuid[16]; /* uuid of journal superblock */
uint32_t e3fs_journal_inum; /* inode number of journal file */
uint32_t e3fs_journal_dev; /* device number of journal file */
uint32_t e3fs_last_orphan; /* start of list of inodes to delete */
uint32_t e3fs_hash_seed[4]; /* HTREE hash seed */
char e3fs_def_hash_version;/* Default hash version to use */
char e3fs_jnl_backup_type;
uint16_t e3fs_desc_size; /* size of group descriptor */
uint32_t e3fs_default_mount_opts;
uint32_t e3fs_first_meta_bg; /* First metablock block group */
uint32_t e3fs_mkfs_time; /* when the fs was created */
uint32_t e3fs_jnl_blks[17]; /* backup of the journal inode */
uint32_t e4fs_bcount_hi; /* high bits of blocks count */
uint32_t e4fs_rbcount_hi; /* high bits of reserved blocks count */
uint32_t e4fs_fbcount_hi; /* high bits of free blocks count */
uint16_t e4fs_min_extra_isize; /* all inodes have some bytes */
uint16_t e4fs_want_extra_isize;/* inodes must reserve some bytes */
uint32_t e4fs_flags; /* miscellaneous flags */
uint16_t e4fs_raid_stride; /* RAID stride */
uint16_t e4fs_mmpintv; /* seconds to wait in MMP checking */
uint64_t e4fs_mmpblk; /* block for multi-mount protection */
uint32_t e4fs_raid_stripe_wid; /* blocks on data disks (N * stride) */
uint8_t e4fs_log_gpf; /* FLEX_BG group size */
uint8_t e4fs_chksum_type; /* metadata checksum algorithm used */
uint8_t e4fs_encrypt; /* versioning level for encryption */
uint8_t e4fs_reserved_pad;
uint64_t e4fs_kbytes_written; /* number of lifetime kilobytes */
uint32_t e4fs_snapinum; /* inode number of active snapshot */
uint32_t e4fs_snapid; /* sequential ID of active snapshot */
uint64_t e4fs_snaprbcount; /* reserved blocks for active snapshot */
uint32_t e4fs_snaplist; /* inode number for on-disk snapshot */
uint32_t e4fs_errcount; /* number of file system errors */
uint32_t e4fs_first_errtime; /* first time an error happened */
uint32_t e4fs_first_errino; /* inode involved in first error */
uint64_t e4fs_first_errblk; /* block involved of first error */
uint8_t e4fs_first_errfunc[32];/* function where error happened */
uint32_t e4fs_first_errline; /* line number where error happened */
uint32_t e4fs_last_errtime; /* most recent time of an error */
uint32_t e4fs_last_errino; /* inode involved in last error */
uint32_t e4fs_last_errline; /* line number where error happened */
uint64_t e4fs_last_errblk; /* block involved of last error */
uint8_t e4fs_last_errfunc[32]; /* function where error happened */
uint8_t e4fs_mount_opts[64];
uint32_t e4fs_usrquota_inum; /* inode for tracking user quota */
uint32_t e4fs_grpquota_inum; /* inode for tracking group quota */
uint32_t e4fs_overhead_clusters;/* overhead blocks/clusters */
uint32_t e4fs_backup_bgs[2]; /* groups with sparse_super2 SBs */
uint8_t e4fs_encrypt_algos[4];/* encryption algorithms in use */
uint8_t e4fs_encrypt_pw_salt[16];/* salt used for string2key */
uint32_t e4fs_lpf_ino; /* location of the lost+found inode */
uint32_t e4fs_proj_quota_inum; /* inode for tracking project quota */
uint32_t e4fs_chksum_seed; /* checksum seed */
uint32_t e4fs_reserved[98]; /* padding to the end of the block */
uint32_t e4fs_sbchksum; /* superblock checksum */
};
这个数据结构比较大,囊括了文件系统各种状态信息,咱们先看几个比较常用属性:
e2fs_icount
:总的 inode 数量e2fs_bcount
:总的磁盘块数量e2fs_rbcount
:ext2fs 会保留一些数据块用于应对磁盘块损坏等等情况e2fs_fbcount
:磁盘中剩余可用磁盘块数量e2fs_ficount
:可用的 inode 的数量e2fs_bpg
:每个块组
(block group,用于更加高效的管理和使用磁盘块而对整个磁盘进行分组) 中包含的磁盘块数量e2fs_fpg
:每个块组中簇
(可以看做是一定数量磁盘块的集合,用于提高数据读写效率) 的数量e2fs_first_ino
:文件系统第一个可用的 inode number (ext2 会保留一些 inode number 用于应对某些特殊情形)e2fs_inode_size
: inode 结构体在磁盘上所占用的空间大小
这些成员基本上就可反映出磁盘设备总体的使用情况。
未完待续
突然,Douyiya 的手机响了,原来是室友召唤他快点回去爬天梯。于是乎,“今天就先分享到这里把,你们可以回去再熟悉一下源码,能了解的更加全面。” Douyiya 说到。Nanami 和 Nanase 表示同意,随后三人便一起离开了图书馆。回去路上,Douyiya 为了让她们能够更有针对性的看代码,便又提出了两个小思考题,并且让她们下次再来的时候回答一下:
e2fs_icount 是如何得到的?
文件系统如何去判断 inode 和磁盘块已经被使用或者未被使用?