前面我们详细讨论了将一个磁盘分区格式化成ext2文件系统后,一个分区的布局,重点介绍了超级快、块组、位图和索引节点等内容。那么,内核如何跟这些ext2文件系统的对象打交道呢?这个,才是我们研究存储的人应该重点关注的对象,从本篇博文开始,我们就来重点讨论。
首先,当安装Ext2文件系统时(执行诸如mount -t ext2 /dev/sda2 /mnt/test的命令),存放在Ext2分区的磁盘数据结构中的大部分信息将被拷贝到RAM中,从而使内核避免了后来的很多读操作。那么一些数据结构如何经常更新呢?让我们考虑一些基本的操作:
1、当一个新文件被创建时,必须减少磁盘中Ext2超级块中s_free_inodes_count字段的值和相应的组描述符中bg_free_inodes_count字段的值。
2、如果内核给一个现有的文件追加一些数据,以使分配给它的数据块数因此也增加,那么就必须修改Ext2超级块中s_free_blocks_count字段的值和组描述符中bg_free_blocks_count字段的值。
3、即使仅仅重写一个现有文件的部分内容,也要对Ext2超级块的s_wtime字段进行更新。
因为所有的Ext2磁盘数据结构都存放在Ext2磁盘分区的块中,因此,内核利用页高速缓存来保持它们最新(参见“磁盘高速缓存”博文)。下面我们就一项一项的来讲解内核如何跟他们打交道,本篇博文首先介绍ext2超级块对象。
在“VFS文件对象”一博我们介绍过,VFS超级块的s_fs_info字段指向一个文件系统信息的数据结构:
struct super_block {
……
void *s_fs_info; /* Filesystem private info */
……
};
对于Ext2文件系统,该字段指向ext2_sb_info类型的结构:
struct ext2_sb_info {
unsigned long s_frag_size; /* Size of a fragment in bytes */
unsigned long s_frags_per_block; /* Number of fragments per block */
unsigned long s_inodes_per_block; /* Number of inodes per block */
unsigned long s_frags_per_group; /* Number of fragments in a group */
unsigned long s_blocks_per_group; /* Number of blocks in a group */
unsigned long s_inodes_per_group; /* Number of inodes in a group */
unsigned long s_itb_per_group; /* Number of inode table blocks per group */
unsigned long s_gdb_count; /* Number of group descriptor blocks */
unsigned long s_desc_per_block; /* Number of group descriptors per block */
unsigned long s_groups_count; /* Number of groups in the fs */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext2_super_block * s_es; /* Pointer to the super block in the buffer */
struct buffer_head ** s_group_desc;
unsigned long s_mount_opt;
uid_t s_resuid;
gid_t s_resgid;
unsigned short s_mount_state;
unsigned short s_pad;
int s_addr_per_block_bits;
int s_desc_per_block_bits;
int s_inode_size;
int s_first_ino;
spinlock_t s_next_gen_lock;
u32 s_next_generation;
unsigned long s_dir_count;
u8 *s_debts;
struct percpu_counter s_freeblocks_counter;
struct percpu_counter s_freeinodes_counter;
struct percpu_counter s_dirs_counter;
struct blockgroup_lock s_blockgroup_lock;
};
其含如下信息:
- 磁盘超级块中的大部分字段
- s_sbh指针,指向包含磁盘超级块的缓冲区的缓冲区首部
- s_es指针,指向磁盘超级块所在的缓冲区
- 组描述符的个数s_desc_per_block,可以放在一个块中
- s_group_desc指针,指向一个缓冲区(包含组描述符的缓冲区)首部数组(只用一项就够了)
- 其他与安装状态、安装选项等有关的数据
下图表示的是与Ext2超级块和组描述符有关的缓冲区与缓冲区首部和ext2_sb_info数据结构之间的关系。
要看懂上面那个图以及后面的文字,建议大家先去看一下“页高速缓存”的相关内容。当内核安装Ext2文件系统时(mount命令),它调用ext2_fill_super()函数来为数据结构分配空间,并写入从磁盘读取的数据(再看看“文件系统安装”博文吧):
static int ext2_fill_super(struct super_block *sb, void *data, int silent)
{
struct buffer_head * bh;
struct ext2_sb_info * sbi;
struct ext2_super_block * es;
struct inode *root;
unsigned long block;
unsigned long sb_block = get_sb_block(&data);
unsigned long logic_sb_block;
unsigned long offset = 0;
unsigned long def_mount_opts;
int blocksize = BLOCK_SIZE;
int db_count;
int i, j;
__le32 features;
sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi)
return -ENOMEM;
sb->s_fs_info = sbi;
memset(sbi, 0, sizeof(*sbi));
/*
* See what the current blocksize for the device is, and
* use that as the blocksize. Otherwise (or if the blocksize
* is smaller than the default) use the default.
* This is important for devices that have a hardware
* sectorsize that is larger than the default.
*/
blocksize = sb_min_blocksize(sb, BLOCK_SIZE);
if (!blocksize) {
printk ("EXT2-fs: unable to set blocksize/n");
goto failed_sbi;
}
/*
* If the superblock doesn't start on a hardware sector boundary,
* calculate the offset.
*/
if (blocksize != BLOCK_SIZE) {
logic_sb_block = (sb_block*BLOCK_SIZE) / blocksize;
offset = (sb_block*BLOCK_SIZE) % blocksize;
} else {
logic_sb_block = sb_block;
}
if (!(bh = sb_bread(sb, logic_sb_block))) {
printk ("EXT2-fs: unable to read superblock/n");
goto failed_sbi;
}
/*
* Note: s_es must be initialized as soon as possible because
* some ext2 macro-instructions depend on its value
*/
es = (struct ext2_super_block *) (((char *)bh->b_data) + offset);
sbi->s_es = es;
sb->s_magic = le16_to_cpu(es->s_magic);
if (sb->s_magic != EXT2_SUPER_MAGIC)
goto cantfind_ext2;
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
if (def_mount_opts & EXT2_DEFM_DEBUG)
set_opt(sbi->s_mount_opt, DEBUG);
if (def_mount_opts & EXT2_DEFM_BSDGROUPS)
set_opt(sbi->s_mount_opt, GRPID);
if (def_mount_opts & EXT2_DEFM_UID16)
set_opt(sbi->s_mount_opt, NO_UID32);
if (def_mount_opts & EXT2_DEFM_XATTR_USER)
set_opt(sbi->s_mount_opt, XATTR_USER);
if (def_mount_opts & EXT2_DEFM_ACL)
set_opt(sbi->s_mount_opt, POSIX_ACL);
if (le16_to_cpu(sbi->s_es->s_errors) == EXT2_ERRORS_PANIC)
set_opt(sbi->s_mount_opt, ERRORS_PANIC);
else if (le16_to_cpu(sbi->s_es->s_errors) == EXT2_ERRORS_RO)
set_opt(sbi->s_mount_opt, ERRORS_RO);
sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
if (!parse_options ((char *) data, sbi))
goto failed_mount;
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);
ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
EXT2_MOUNT_XIP if not */
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_INCOMPAT_FEATURE(sb, ~0U)))
printk("EXT2-fs warning: feature flags set on rev 0 fs, "
"running e2fsck is recommended/n");
/*
* Check feature flags regardless of the revision level, since we
* previously didn't change the revision level when setting the flags,
* so there is a chance incompat flags are set on a rev 0 filesystem.
*/
features = EXT2_HAS_INCOMPAT_FEATURE(sb, ~EXT2_FEATURE_INCOMPAT_SUPP);
if (features) {
printk("EXT2-fs: %s: couldn't mount because of "
"unsupported optional features (%x)./n",
sb->s_id, le32_to_cpu(features));
goto failed_mount;
}
if (!(sb->s_flags & MS_RDONLY) &&
(features = EXT2_HAS_RO_COMPAT_FEATURE(sb, ~EXT2_FEATURE_RO_COMPAT_SUPP))){
printk("EXT2-fs: %s: couldn't mount RDWR because of "
"unsupported optional features (%x)./n",
sb->s_id, le32_to_cpu(features));
goto failed_mount;
}
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
(sb->s_blocksize != blocksize))) {
if (!silent)
printk("XIP: Unsupported blocksize/n");
goto failed_mount;
}
/* If the blocksize doesn't match, re-read the thing.. */
if (sb->s_blocksize != blocksize) {
brelse(bh);
if (!sb_set_blocksize(sb, blocksize)) {
printk(KERN_ERR "EXT2-fs: blocksize too small for device./n");
goto failed_sbi;
}
logic_sb_block = (sb_block*BLOCK_SIZE) / blocksize;
offset = (sb_block*BLOCK_SIZE) % blocksize;
bh = sb_bread(sb, logic_sb_block);
if(!bh) {
printk("EXT2-fs: Couldn't read superblock on "
"2nd try./n");
goto failed_sbi;
}
es = (struct ext2_super_block *) (((char *)bh->b_data) + offset);
sbi->s_es = es;
if (es->s_magic != cpu_to_le16(EXT2_SUPER_MAGIC)) {
printk ("EXT2-fs: Magic mismatch, very weird !/n");
goto failed_mount;
}
}
sb->s_maxbytes = ext2_max_size(sb->s_blocksize_bits);
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) {
sbi->s_inode_size = EXT2_GOOD_OLD_INODE_SIZE;
sbi->s_first_ino = EXT2_GOOD_OLD_FIRST_INO;
} else {
sbi->s_inode_size = le16_to_cpu(es->s_inode_size);
sbi->s_first_ino = le32_to_cpu(es->s_first_ino);
if ((sbi->s_inode_size < EXT2_GOOD_OLD_INODE_SIZE) ||
(sbi->s_inode_size & (sbi->s_inode_size - 1)) ||
(sbi->s_inode_size > blocksize)) {
printk ("EXT2-fs: unsupported inode size: %d/n",
sbi->s_inode_size);
goto failed_mount;
}
}
sbi->s_frag_size = EXT2_MIN_FRAG_SIZE <<
le32_to_cpu(es->s_log_frag_size);
if (sbi->s_frag_size == 0)
goto cantfind_ext2;
sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;
sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
sbi->s_frags_per_group = le32_to_cpu(es->s_frags_per_group);
sbi->s_inodes_per_group = le32_to_cpu(es->s_inodes_per_group);
if (EXT2_INODE_SIZE(sb) == 0)
goto cantfind_ext2;
sbi->s_inodes_per_block = sb->s_blocksize / EXT2_INODE_SIZE(sb);
if (sbi->s_inodes_per_block == 0 || sbi->s_inodes_per_group == 0)
goto cantfind_ext2;
sbi->s_itb_per_group = sbi->s_inodes_per_group /
sbi->s_inodes_per_block;
sbi->s_desc_per_block = sb->s_blocksize /
sizeof (struct ext2_group_desc);
sbi->s_sbh = bh;
sbi->s_mount_state = le16_to_cpu(es->s_state);
sbi->s_addr_per_block_bits =
log2 (EXT2_ADDR_PER_BLOCK(sb));
sbi->s_desc_per_block_bits =
log2 (EXT2_DESC_PER_BLOCK(sb));
if (sb->s_magic != EXT2_SUPER_MAGIC)
goto cantfind_ext2;
if (sb->s_blocksize != bh->b_size) {
if (!silent)
printk ("VFS: Unsupported blocksize on dev "
"%s./n", sb->s_id);
goto failed_mount;
}
if (sb->s_blocksize != sbi->s_frag_size) {
printk ("EXT2-fs: fragsize %lu != blocksize %lu (not supported yet)/n",
sbi->s_frag_size, sb->s_blocksize);
goto failed_mount;
}
if (sbi->s_blocks_per_group > sb->s_blocksize * 8) {
printk ("EXT2-fs: #blocks per group too big: %lu/n",
sbi->s_blocks_per_group);
goto failed_mount;
}
if (sbi->s_frags_per_group > sb->s_blocksize * 8) {
printk ("EXT2-fs: #fragments per group too big: %lu/n",
sbi->s_frags_per_group);
goto failed_mount;
}
if (sbi->s_inodes_per_group > sb->s_blocksize * 8) {
printk ("EXT2-fs: #inodes per group too big: %lu/n",
sbi->s_inodes_per_group);
goto failed_mount;
}
if (EXT2_BLOCKS_PER_GROUP(sb) == 0)
goto cantfind_ext2;
sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) -
le32_to_cpu(es->s_first_data_block) +
EXT2_BLOCKS_PER_GROUP(sb) - 1) /
EXT2_BLOCKS_PER_GROUP(sb);
db_count = (sbi->s_groups_count + EXT2_DESC_PER_BLOCK(sb) - 1) /
EXT2_DESC_PER_BLOCK(sb);
sbi->s_group_desc = kmalloc (db_count * sizeof (struct buffer_head *), GFP_KERNEL);
if (sbi->s_group_desc == NULL) {
printk ("EXT2-fs: not enough memory/n");
goto failed_mount;
}
bgl_lock_init(&sbi->s_blockgroup_lock);
sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(*sbi->s_debts),
GFP_KERNEL);
if (!sbi->s_debts) {
printk ("EXT2-fs: not enough memory/n");
goto failed_mount_group_desc;
}
memset(sbi->s_debts, 0, sbi->s_groups_count * sizeof(*sbi->s_debts));
for (i = 0; i < db_count; i++) {
block = descriptor_loc(sb, logic_sb_block, i);
sbi->s_group_desc[i] = sb_bread(sb, block);
if (!sbi->s_group_desc[i]) {
for (j = 0; j < i; j++)
brelse (sbi->s_group_desc[j]);
printk ("EXT2-fs: unable to read group descriptors/n");
goto failed_mount_group_desc;
}
}
if (!ext2_check_descriptors (sb)) {
printk ("EXT2-fs: group descriptors corrupted!/n");
goto failed_mount2;
}
sbi->s_gdb_count = db_count;
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
percpu_counter_init(&sbi->s_freeblocks_counter,
ext2_count_free_blocks(sb));
percpu_counter_init(&sbi->s_freeinodes_counter,
ext2_count_free_inodes(sb));
percpu_counter_init(&sbi->s_dirs_counter,
ext2_count_dirs(sb));
/*
* set up enough so that it can read an inode
*/
sb->s_op = &ext2_sops;
sb->s_export_op = &ext2_export_ops;
sb->s_xattr = ext2_xattr_handlers;
root = iget(sb, EXT2_ROOT_INO);
sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
iput(root);
printk(KERN_ERR "EXT2-fs: get root inode failed/n");
goto failed_mount3;
}
if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size) {
dput(sb->s_root);
sb->s_root = NULL;
printk(KERN_ERR "EXT2-fs: corrupt root inode, run e2fsck/n");
goto failed_mount3;
}
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
ext2_warning(sb, __FUNCTION__,
"mounting ext3 filesystem as ext2");
ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
return 0;
cantfind_ext2:
if (!silent)
printk("VFS: Can't find an ext2 filesystem on dev %s./n",
sb->s_id);
goto failed_mount;
failed_mount3:
percpu_counter_destroy(&sbi->s_freeblocks_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
failed_mount2:
for (i = 0; i < db_count; i++)
brelse(sbi->s_group_desc[i]);
failed_mount_group_desc:
kfree(sbi->s_group_desc);
kfree(sbi->s_debts);
failed_mount:
brelse(bh);
failed_sbi:
sb->s_fs_info = NULL;
kfree(sbi);
return -EINVAL;
}
注意,要读懂上面的代码请好好理解页高速缓存相关的内容,因为ext2磁盘超级快ext2_super_block缓存于bh->b_data的某个位置,而对应的块号就是1(0号是引导块)。这里是对该函数的一个简要说明,只强调缓冲区与描述符的内存分配:
1. 分配一个ext2_sb_info描述符,将其地址当作参数传递并存放在超级块sb的s_fs_info字段:
struct buffer_head * bh;
struct ext2_sb_info * sbi;
struct ext2_super_block * es;
struct inode *root;
unsigned long block;
unsigned long sb_block = get_sb_block(&data);
unsigned long logic_sb_block;
unsigned long offset = 0;
unsigned long def_mount_opts;
int blocksize = BLOCK_SIZE;
int db_count;
int i, j;
__le32 features;
sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi)
return -ENOMEM;
sb->s_fs_info = sbi;
memset(sbi, 0, sizeof(*sbi));
2. 调用__bread()在缓冲区页中分配一个缓冲区和缓冲区首部。然后从磁盘读入超级块存放在缓冲区中。在“在页高速缓存中搜索块”一博我们讨论过,如果一个块已在页高速缓存的缓冲区页而且是最新的,那么无需再分配。将缓冲区首部地址存放在Ext2超级块对象sbi的s_sbh字段:
if (!(bh = sb_bread(sb, logic_sb_block))) {
printk ("EXT2-fs: unable to read superblock/n");
goto failed_sbi;
}
//这里的offset其实就是0。
es = (struct ext2_super_block *) (((char *)bh->b_data) + offset);
sbi->s_es = es;
……
3. 分配一个数组用于存放缓冲区首部指针,每个组描述符一个,把该数组地址存放在ext2_sb_info的s_group_desc字段。
db_count = (sbi->s_groups_count + EXT2_DESC_PER_BLOCK(sb) - 1) /
EXT2_DESC_PER_BLOCK(sb);
sbi->s_group_desc = kmalloc (db_count * sizeof (struct buffer_head *), GFP_KERNEL);
4. 分配一个字节数组,每组一个字节,把它的地址存放在ext2_sb_info描述符的s_debts字段(参见后面的“管理ext2磁盘空间”博文):
sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(*sbi->s_debts),
GFP_KERNEL);
if (!sbi->s_debts) {
printk ("EXT2-fs: not enough memory/n");
goto failed_mount_group_desc;
}
memset(sbi->s_debts, 0, sbi->s_groups_count * sizeof(*sbi->s_debts));
5. 重复调用__bread()分配缓冲区,从磁盘读人包含Ext2组描述符的块。把缓冲区首部地址存放在上一步得到的s_group_desc数组中:
for (i = 0; i < db_count; i++) {
block = descriptor_loc(sb, logic_sb_block, i);
sbi->s_group_desc[i] = sb_bread(sb, block);
if (!sbi->s_group_desc[i]) {
for (j = 0; j < i; j++)
brelse (sbi->s_group_desc[j]);
printk ("EXT2-fs: unable to read group descriptors/n");
goto failed_mount_group_desc;
}
}
6. 最后安装好sb->s_op,sb->s_export_op,超级块增强属性。因为准备为根目录分配一个索引节点和目录项对象,必须把s_op字段为超级块建立好,从而能够从磁盘读入根索引节点对象:
sb->s_op = &ext2_sops;
sb->s_export_op = &ext2_export_ops;
sb->s_xattr = ext2_xattr_handlers;
root = iget(sb, EXT2_ROOT_INO);
sb->s_root = d_alloc_root(root);
注意,EXT2_ROOT_INO是2,也就是它的根节点位于第一个块组的第三个位置上。很显然,ext2_fill super()函数返回后,有很多关键的ext2磁盘数据结构的内容都保存在内存里了,例如ext2_super_block、第根索引节点等,只有当Ext2文件系统卸载时才会被释放。当内核必须修改Ext2超级块的字段时,它只要把新值写入相应缓冲区内的相应位置然后将该缓冲区标记为脏即可,有看页高速缓存事情就是这么简单!