FreeBSD - ext2 文件系统基本数据结构分析

暑假生活开始

考试终于结束了,大家紧张的心情也开始慢慢放松下来。由于疫情时断时续,NanamiNanase 的旅行计划被迫终止。想到之前还没有讨论完的问题,Nanami 转头问 Nanase,"上次讨论的问题要不要继续?"。"反正现在也出不去,在寝室待着也是无聊,我感觉可以,哈哈。" Nanase 笑着回答到。“吼,那我现在给 Douyiya 发消息,去图书馆见”。
此时,Douyiya 与寝室四个好基友已经整整齐齐坐在电脑前,摩拳擦掌,“今天难得五黑,必须怒爬一波天梯!”。突然微信来了消息,Douyiya 低头看了一眼,立马就开始打包好了电脑,然后轻轻来了一句:“兄弟们,人生大事,我就先走了。” 话音未落,就着急忙慌地冲出了寝室,留下一头雾水的室友们在风中凌乱。

Douyiya 的分享

“好,那今天给你们分享一下非常经典的 ext2 文件系统。它的实现比较简单,比较适合咱们这种小白入门。” Nanami 看了一下屏幕上密密麻麻的代码,就小声问了一句:“这个代码量感觉还是有点大的,咱们从哪里开始呢?” Douyiya 察觉到了 Nanami 语气中带有的那一丝丝不自信,就微笑着说:“先给你们分享一些基本的数据结构吧,对整个文件系统有个大概的认识,后续再看代码实现就能更好的理解了。” NanamiNanase 相视一笑,点了点头。

Douyiya:记得上次 Nanami 说过,我们可以将文件的属性封装到一个数据结构中统一管理,在 ext2 文件系统中就有这么一个数据结构来专门做这件事,struct inode

以后代码都统一摘抄自 freebsd-12.0 版本
#define    EXT2_NDADDR    12        /* Direct addresses in inode. */
#define    EXT2_NIADDR    3        /* Indirect addresses in inode. */

/*
 * The inode is used to describe each active (or recently active) file in the
 * EXT2FS filesystem. It is composed of two types of information. The first
 * part is the information that is needed only while the file is active (such
 * as the identity of the file and linkage to speed its lookup). The second
 * part is the permanent meta-data associated with the file which is read in
 * from the permanent dinode from long term storage when the file becomes
 * active, and is put back when the file is no longer being used.
 */
struct inode {
    // 虚拟文件系统(Virtual filesystem, VFS) 层级数据结构,暂时忽略
    struct    vnode  *i_vnode;/* Vnode associated with this inode. */
    struct    ext2mount *i_ump;
    uint32_t i_flag;    /* flags, see below */
    ino_t      i_number;    /* The identity of the inode. */

    struct    m_ext2fs *i_e2fs;    /* EXT2FS */
    u_quad_t i_modrev;    /* Revision level for NFS lease. */
    /*
     * Side effects; used during directory lookup.
     * 下面四个成员在查找目录下子文件时会用到,后续再详细说明
     */
    int32_t     i_count;    /* Size of free slot in directory. */
    doff_t     i_endoff;    /* End of useful stuff in directory. */
    doff_t     i_diroff;    /* Offset in dir, where we found last entry. */
    doff_t     i_offset;    /* Offset of free space in directory. */

    uint32_t i_block_group;    // 块组号
    uint32_t i_next_alloc_block;
    uint32_t i_next_alloc_goal;

    /* Fields from struct dinode in UFS. */
    uint16_t    i_mode;        /* IFMT, permissions; see below. */
    int32_t        i_nlink;    /* File link count. */
    uint32_t    i_uid;        /* File owner. */
    uint32_t    i_gid;        /* File group. */
    uint64_t    i_size;        /* File byte count. */
    uint64_t    i_blocks;    /* Blocks actually held. */
    int32_t        i_atime;    /* Last access time. */
    int32_t        i_mtime;    /* Last modified time. */
    int32_t        i_ctime;    /* Last inode change time. */
    int32_t        i_birthtime;    /* Inode creation time. */
    int32_t        i_mtimensec;    /* Last modified time. */
    int32_t        i_atimensec;    /* Last access time. */
    int32_t        i_ctimensec;    /* Last inode change time. */
    int32_t        i_birthnsec;    /* Inode creation time. */
    uint32_t    i_gen;        /* Generation number. */
    uint64_t    i_facl;        /* EA block number. 文件扩展属性所在的磁盘块 */
    uint32_t    i_flags;    /* Status flags (chflags). */
    union {
        struct {
            uint32_t i_db[EXT2_NDADDR]; /* Direct disk blocks. */
            uint32_t i_ib[EXT2_NIADDR]; /* Indirect disk blocks. */
        };
        uint32_t i_data[EXT2_NDADDR + EXT2_NIADDR];
    };

    struct ext4_extent_cache i_ext_cache; /* cache for ext4 extent */
};

Nanami:啊,,i_sizei_block 表示的都是文件大小吧,这么设计是不是重复了呀?

Douyiya:哈哈,这个倒不是。还记不记得上次跟你们说过的,文件在磁盘是是以 数据块 的形式存在的,那就一定会存在最后一个数据块没有被完全装满的情况。那剩下的空间又不能给其他的文件使用,所以开发者就安排了两个成员来描述这种情形。i_size 就表示 以字节为单位的文件实际大小,而 i_blocks 则表示这个文件 实际占用的磁盘块数,那么

i_block * disk_block_size >= i_size

Nanase:那 i_number 是做什么用的呢?

Douyiyai_number 可以理解为文件编号,文件系统会为每个文件都分配一个唯一编号,这样我们就可以通过它去定位文件

Nanase:如此说来,每个文件都会有唯一一个 inode 与之对应喽

Douyiya:哈哈,正解。每个文件的属性都是不一样的,如果共用一个 inode 岂不是乱套了

Nanami:那 inode 中包含的 i_data 数组中存放的应该就是文件包含的磁盘块对应的块号了吧,别的属性好像也用不了这么大的空间

Douyiya:嗯嗯,确实是。它其实是分成了两大部分,一个是 直接索引,一个是 间接索引。直接索引很好理解,就是用于处理那些比较小的文件,把占用的磁盘块号写入到数组当中,当用户访问文件的时候按顺序读取就好了。当用户需要处理大型文件的时候,那就要用到间接块索引了。
与此同时,Douyiya 就打开了早已准备好的示意图:
FreeBSD - ext2 文件系统基本数据结构分析_第1张图片

间接索引则是会申请磁盘块来存放磁盘块号,而不是真正的文件数据,这类磁盘块也被叫做 index block。假设磁盘块大小为 512 bytes,那每个 index block 可以存放 128 个磁盘块号。所以一级间接索引可以映射另外128个磁盘块。同理,二级索引块中存放的是一级索引块号,三级索引块中存放的是二级索引块号,它们可映射的磁盘块数量就呈指数级增长。

Nanami:昂,,原来是这样。那文件大小也是有上限的,如果映射到的数据块都装满了,文件就无法再写入数据了

Douyiya:嗯,是的,所以 ext4 文件系统就针对这个问题改进了设计

思考题:index blocks 算不算在 inode->i_blocks 当中?

Nanase:上次我记得说整个文件系统的状态也是需要管理的,这个应该也是有对应结构体的吧?
Douyiya: 哈哈,我刚想说这个,是一个叫做 超级块(struct ext2fs) 的数据结构:

/*
 * Super block for an ext2fs file system.
 */
struct ext2fs {
    uint32_t  e2fs_icount;        /* Inode count */
    uint32_t  e2fs_bcount;        /* blocks count */
    uint32_t  e2fs_rbcount;        /* reserved blocks count */
    uint32_t  e2fs_fbcount;        /* free blocks count */
    uint32_t  e2fs_ficount;        /* free inodes count */
    uint32_t  e2fs_first_dblock;    /* first data block */
    uint32_t  e2fs_log_bsize;    /* block size = 1024*(2^e2fs_log_bsize) */
    uint32_t  e2fs_log_fsize;    /* fragment size */
    uint32_t  e2fs_bpg;        /* blocks per group */
    uint32_t  e2fs_fpg;        /* frags per group */
    uint32_t  e2fs_ipg;        /* inodes per group */
    uint32_t  e2fs_mtime;        /* mount time */
    uint32_t  e2fs_wtime;        /* write time */
    uint16_t  e2fs_mnt_count;    /* mount count */
    uint16_t  e2fs_max_mnt_count;    /* max mount count */
    uint16_t  e2fs_magic;        /* magic number */
    uint16_t  e2fs_state;        /* file system state */
    uint16_t  e2fs_beh;        /* behavior on errors */
    uint16_t  e2fs_minrev;        /* minor revision level */
    uint32_t  e2fs_lastfsck;    /* time of last fsck */
    uint32_t  e2fs_fsckintv;    /* max time between fscks */
    uint32_t  e2fs_creator;        /* creator OS */
    uint32_t  e2fs_rev;        /* revision level */
    uint16_t  e2fs_ruid;        /* default uid for reserved blocks */
    uint16_t  e2fs_rgid;        /* default gid for reserved blocks */
    /* EXT2_DYNAMIC_REV superblocks */
    uint32_t  e2fs_first_ino;    /* first non-reserved inode */
    uint16_t  e2fs_inode_size;    /* size of inode structure */
    uint16_t  e2fs_block_group_nr;    /* block grp number of this sblk*/
    uint32_t  e2fs_features_compat;    /* compatible feature set */
    uint32_t  e2fs_features_incompat; /* incompatible feature set */
    uint32_t  e2fs_features_rocompat; /* RO-compatible feature set */
    uint8_t      e2fs_uuid[16];    /* 128-bit uuid for volume */
    char      e2fs_vname[16];    /* volume name */
    char      e2fs_fsmnt[64];    /* name mounted on */
    uint32_t  e2fs_algo;        /* For compression */
    uint8_t   e2fs_prealloc;    /* # of blocks for old prealloc */
    uint8_t   e2fs_dir_prealloc;    /* # of blocks for old prealloc dirs */
    uint16_t  e2fs_reserved_ngdb;    /* # of reserved gd blocks for resize */
    char      e3fs_journal_uuid[16]; /* uuid of journal superblock */
    uint32_t  e3fs_journal_inum;    /* inode number of journal file */
    uint32_t  e3fs_journal_dev;    /* device number of journal file */
    uint32_t  e3fs_last_orphan;    /* start of list of inodes to delete */
    uint32_t  e3fs_hash_seed[4];    /* HTREE hash seed */
    char      e3fs_def_hash_version;/* Default hash version to use */
    char      e3fs_jnl_backup_type;
    uint16_t  e3fs_desc_size;    /* size of group descriptor */
    uint32_t  e3fs_default_mount_opts;
    uint32_t  e3fs_first_meta_bg;    /* First metablock block group */
    uint32_t  e3fs_mkfs_time;    /* when the fs was created */
    uint32_t  e3fs_jnl_blks[17];    /* backup of the journal inode */
    uint32_t  e4fs_bcount_hi;    /* high bits of blocks count */
    uint32_t  e4fs_rbcount_hi;    /* high bits of reserved blocks count */
    uint32_t  e4fs_fbcount_hi;    /* high bits of free blocks count */
    uint16_t  e4fs_min_extra_isize; /* all inodes have some bytes */
    uint16_t  e4fs_want_extra_isize;/* inodes must reserve some bytes */
    uint32_t  e4fs_flags;        /* miscellaneous flags */
    uint16_t  e4fs_raid_stride;    /* RAID stride */
    uint16_t  e4fs_mmpintv;        /* seconds to wait in MMP checking */
    uint64_t  e4fs_mmpblk;        /* block for multi-mount protection */
    uint32_t  e4fs_raid_stripe_wid; /* blocks on data disks (N * stride) */
    uint8_t   e4fs_log_gpf;        /* FLEX_BG group size */
    uint8_t   e4fs_chksum_type;    /* metadata checksum algorithm used */
    uint8_t   e4fs_encrypt;        /* versioning level for encryption */
    uint8_t   e4fs_reserved_pad;
    uint64_t  e4fs_kbytes_written;    /* number of lifetime kilobytes */
    uint32_t  e4fs_snapinum;    /* inode number of active snapshot */
    uint32_t  e4fs_snapid;        /* sequential ID of active snapshot */
    uint64_t  e4fs_snaprbcount;    /* reserved blocks for active snapshot */
    uint32_t  e4fs_snaplist;    /* inode number for on-disk snapshot */
    uint32_t  e4fs_errcount;    /* number of file system errors */
    uint32_t  e4fs_first_errtime;    /* first time an error happened */
    uint32_t  e4fs_first_errino;    /* inode involved in first error */
    uint64_t  e4fs_first_errblk;    /* block involved of first error */
    uint8_t   e4fs_first_errfunc[32];/* function where error happened */
    uint32_t  e4fs_first_errline;    /* line number where error happened */
    uint32_t  e4fs_last_errtime;    /* most recent time of an error */
    uint32_t  e4fs_last_errino;    /* inode involved in last error */
    uint32_t  e4fs_last_errline;    /* line number where error happened */
    uint64_t  e4fs_last_errblk;    /* block involved of last error */
    uint8_t   e4fs_last_errfunc[32]; /* function where error happened */
    uint8_t   e4fs_mount_opts[64];
    uint32_t  e4fs_usrquota_inum;    /* inode for tracking user quota */
    uint32_t  e4fs_grpquota_inum;    /* inode for tracking group quota */
    uint32_t  e4fs_overhead_clusters;/* overhead blocks/clusters */
    uint32_t  e4fs_backup_bgs[2];    /* groups with sparse_super2 SBs */
    uint8_t   e4fs_encrypt_algos[4];/* encryption algorithms in use */
    uint8_t   e4fs_encrypt_pw_salt[16];/* salt used for string2key */
    uint32_t  e4fs_lpf_ino;        /* location of the lost+found inode */
    uint32_t  e4fs_proj_quota_inum;    /* inode for tracking project quota */
    uint32_t  e4fs_chksum_seed;    /* checksum seed */
    uint32_t  e4fs_reserved[98];    /* padding to the end of the block */
    uint32_t  e4fs_sbchksum;    /* superblock checksum */
};

这个数据结构比较大,囊括了文件系统各种状态信息,咱们先看几个比较常用属性:

  • e2fs_icount:总的 inode 数量
  • e2fs_bcount:总的磁盘块数量
  • e2fs_rbcount:ext2fs 会保留一些数据块用于应对磁盘块损坏等等情况
  • e2fs_fbcount:磁盘中剩余可用磁盘块数量
  • e2fs_ficount:可用的 inode 的数量
  • e2fs_bpg:每个 块组 (block group,用于更加高效的管理和使用磁盘块而对整个磁盘进行分组) 中包含的磁盘块数量
  • e2fs_fpg:每个块组中 (可以看做是一定数量磁盘块的集合,用于提高数据读写效率) 的数量
  • e2fs_first_ino:文件系统第一个可用的 inode number (ext2 会保留一些 inode number 用于应对某些特殊情形)
  • e2fs_inode_size: inode 结构体在磁盘上所占用的空间大小

这些成员基本上就可反映出磁盘设备总体的使用情况。

未完待续

突然,Douyiya 的手机响了,原来是室友召唤他快点回去爬天梯。于是乎,“今天就先分享到这里把,你们可以回去再熟悉一下源码,能了解的更加全面。” Douyiya 说到。NanamiNanase 表示同意,随后三人便一起离开了图书馆。回去路上,Douyiya 为了让她们能够更有针对性的看代码,便又提出了两个小思考题,并且让她们下次再来的时候回答一下:

e2fs_icount 是如何得到的?
文件系统如何去判断 inode 和磁盘块已经被使用或者未被使用?

你可能感兴趣的:(FreeBSD - ext2 文件系统基本数据结构分析)