最近在看APUE,其中的一章谈到了文件系统,所以我在这里把linux 虚拟文件系统的相关内容做一个简单总结,其中会有部分源码,但不是很深入。
书接上回:http://blog.csdn.net/u012927281/article/details/51711085
在上回的blog中,我们初步遇到了几个数据结构,还是从现象出发,逐步深入。我们已经了解到在进程描述符中与文件系统相关的数据结构有"struct files_struct",除此以外还有:
struct fs_struct { int users; spinlock_t lock; seqcount_t seq; int umask; int in_exec; struct path root, pwd; };
还有一个结构体是:
struct nsproxy { atomic_t count; struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns_for_children; struct net *net_ns; };
通过分析以上两个结构体的内容,发现这两个结构体其实与文件系统的基本操作关系不大(如read、write操作等),看来还是得回到struct files_struct上来,再来看看它的内容:
struct files_struct { /* * read mostly part */ atomic_t count; struct fdtable __rcu *fdt; struct fdtable fdtab; /* * written part on a separate cache line in SMP */ spinlock_t file_lock ____cacheline_aligned_in_smp; int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; };以下内容摘自LKD,其中的内容我无法通过实验进程,因为上述内容位于内核中,关于内核的调试方法我还不会
fd_array数组指针指向已打开的文件对象,由于NR_OPEN_DEFAULT的值有上限,所以如果一个进程所打开的文件对象超过某个限定值,内核将分配一个新数组,并且将fdt指针指向它,关于“struct fdtable”结构体的内容我们之前已经进行了简单的分析,再来回顾一下:
struct fdtable { unsigned int max_fds; struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; unsigned long *open_fds; struct rcu_head rcu; };此处fd的作用与fd_array的作用相同,均指向已经打开的文件对象。
好,既然已经谈到了文件对象,那就对文件对象做一个详细的研究,根据当前我看到的一些资料(Linux内核设计与实现、深入理解Linux内核),虚拟文件系统(virtual file system,VFS)中有四个主要的对象类型,分别是:
这里盗用《深入理解Linux内核》中的一副图,来表示这四个对象类型之间的关系。
先来看看struct file,基本定义如下:
struct file { union { struct llist_node fu_llist; struct rcu_head fu_rcuhead; } f_u; struct path f_path; struct inode *f_inode; /* cached value */ const struct file_operations *f_op; /* * Protects f_ep_links, f_flags. * Must not be taken from IRQ context. */ spinlock_t f_lock; atomic_long_t f_count; unsigned int f_flags; fmode_t f_mode; struct mutex f_pos_lock; loff_t f_pos; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; u64 f_version; #ifdef CONFIG_SECURITY void *f_security; #endif /* needed for tty driver, and maybe others */ void *private_data; #ifdef CONFIG_EPOLL /* Used by fs/eventpoll.c to link all the hooks to this file */ struct list_head f_ep_links; struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */文件对象是已打开的文件在内存中的表示。该对象(不是物理文件)由相应的open系统调用创建,由close系统调用撤销,所有这些文件相关的调用实际上都是文件操作表中定义的方法。因为多个进程可以同时打开和操作同一个文件,所以同一个文件也可能存在多个对应的文件对象。文件对象仅仅在进程观点上代表已打开的文件,它反过来指向目录项对象,其实只有目录项对象才代表已打开的实际文件。虽然一个文件对应的文件对象不是惟一的,即通过open函数打开一个文件就会得到一个文件描述符,即使是同一个进程打开相同的文件得到的文件描述符也不相同,不同的文件描述符指向fd_array中不同的文件对象。虽然一个文件对应的文件对象不是惟一的,但对应的索引节点和目录项无疑是惟一的。
这里比较重要的字段有三个:
struct path f_path; struct inode *f_inode; /* cached value */ const struct file_operations *f_op;
先来看f_path的定义,位于/include/linux/path.h
struct path { struct vfsmount *mnt; struct dentry *dentry; };
再来看f_inode字段。f_inode的类型是索引节点对象,这一点与上图中描述的情况有所不同:文件对象与索引节点对象存在直接关系。这一点与《Linux内核设计与实现》、《深入理解Linux内核》中描述的也不相同,文件对象中就不包括这一字段,这一字段可能是2.6之后引入的新字段。
不过也可以根据注释对f_inode的功能做一个简单的推测,f_inode的可能是对索引节点的缓存,在访问时可以不通过目录项对象,直接对索引节点进行访问。
接下来struct file_operations,这一字段定义了文件对象的所有操作,具体定义如下:
struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); void (*mremap)(struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **, void **); long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len); void (*show_fdinfo)(struct seq_file *m, struct file *f); };
通过对文件对象的简单研究我们也可以发现,虚拟文件系统的实现在很大程度上体现了面向对象的思想,其中即包括对象所操作的数据,同时也包括对这些数据进行操作的函数。
在对文件对象进行简单分析后,再向下一层对目录项对象进行分析。VFS把目录当作文件对待,所以对于某个特定的路径,其中可能即包括目录文件同时也包括普通文件,路径中的每个组成部分都由一个索引点对象表示。虽然他们可以统一由索引节点表示,但是VFS经常需要执行目录相关的操作,比如路径名查找等。路径名查找需要解析路径中的每一个组成部分,不但要确保它有效,而且还需要再进一步寻找路径的下一个部分。为了方便查找操作,VFS引入了目录项的概念。每个dentry代表路径中的一个特定部分。必须明确一点:在路径中(包括普通文件在内),每一个部分都是目录项对象。解析一个路径并遍历其分量绝非简单的演练,它是耗时的、常规的字符串比较过程,执行耗时、代码繁琐。目录项对象的引入使得这个过程更加简单(对于这一点我现在还不能理解,没有目录项对象会变成什么样我现在给不出什么结论)。
回到主题,目录项对象定义如下,定义位于/include/linux/dcache.h。
struct dentry { /* RCU lookup touched fields */ unsigned int d_flags; /* protected by d_lock */ seqcount_t d_seq; /* per dentry seqlock */ struct hlist_bl_node d_hash; /* lookup hash list */ struct dentry *d_parent; /* parent directory */ struct qstr d_name; struct inode *d_inode; /* Where the name belongs to - NULL is * negative */ unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ /* Ref lookup also touches following */ struct lockref d_lockref; /* per-dentry lock and refcount */ const struct dentry_operations *d_op; struct super_block *d_sb; /* The root of the dentry tree */ unsigned long d_time; /* used by d_revalidate */ void *d_fsdata; /* fs-specific data */ struct list_head d_lru; /* LRU list */ struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ /* * d_alias and d_rcu can share memory */ union { struct hlist_node d_alias; /* inode alias list */ struct rcu_head d_rcu; } d_u; };
以下内容直接引用自《linux内核设计与实现》、《深入理解linux内核》。
目录项对象共包括三种状态:被使用、未被使用和负状态。
上文中提到了目录项高速缓存,下面就来简单了解下这一内容。
由于从磁盘读入一个目录项并构造相应的目录项对象需要花费大量的时间,所以,在完成对目录项的操作后,可能后面还要使用它,因此仍在内存中保留它有重要意义。为了最大限度地提高这些目录项对象的效率,Linux使用目录项高速缓存,它由两种类型的数据结构组成:
对于正在使用的目录项对象都被插入一个双向链表中,该链表由相应索引节点对象的i_dentry字段所指向(由于每个索引节点可能与若干硬链接关联,所以需要一个链表)。目录项对象的d_alias字段存放链表中相邻元素的地址。这两个字段的类型都是struct list_head。
未被使用和负状态的目录项对象都被插入一个“最近最少使用(LRU)”的双向链表中。由于该链表总是在头部插入目录项,所以链头节点的数据总比链尾的数据要新。每当内核缩减目录项高速缓存时,“负”状态目录项对象就朝着LRU链表的尾部移动,这样一来,这些对象就逐渐被释放了。
散列表和相应的散列函数用来快速地将给定路径解析为相关目录项对象。
接下来简单看一下目录项对象的操作函数:
struct dentry_operations { int (*d_revalidate)(struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int); int (*d_hash)(const struct dentry *, struct qstr *); int (*d_compare)(const struct dentry *, const struct dentry *, unsigned int, const char *, const struct qstr *); int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); void (*d_prune)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int); struct vfsmount *(*d_automount)(struct path *); int (*d_manage)(struct dentry *, bool); struct inode *(*d_select_inode)(struct dentry *, unsigned); } ____cacheline_aligned;
在上文的分析中我们已经大概了解了虚拟文件系统的实现思想——OOP,所以还是沿着这个思路,先分析类成员,再来分析类操作。struct inode定义如下,位于/include/linux/fs.h。
struct inode { umode_t i_mode; unsigned short i_opflags; kuid_t i_uid; kgid_t i_gid; unsigned int i_flags; #ifdef CONFIG_FS_POSIX_ACL struct posix_acl *i_acl; struct posix_acl *i_default_acl; #endif const struct inode_operations *i_op; struct super_block *i_sb; struct address_space *i_mapping; #ifdef CONFIG_SECURITY void *i_security; #endif /* Stat data, not accessed from path walking */ unsigned long i_ino; /* * Filesystems may only read i_nlink directly. They shall use the * following functions for modification: * * (set|clear|inc|drop)_nlink * inode_(inc|dec)_link_count */ union { const unsigned int i_nlink; unsigned int __i_nlink; //硬链接数目 }; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; unsigned int i_blkbits; blkcnt_t i_blocks; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif /* Misc */ unsigned long i_state; struct mutex i_mutex; unsigned long dirtied_when; /* jiffies of first dirtying */ struct hlist_node i_hash; struct list_head i_wb_list; /* backing dev IO list */ struct list_head i_lru; /* inode LRU list */ struct list_head i_sb_list; union { struct hlist_head i_dentry; struct rcu_head i_rcu; }; u64 i_version; atomic_t i_count; //引用计数器 atomic_t i_dio_count; atomic_t i_writecount; #ifdef CONFIG_IMA atomic_t i_readcount; /* struct files open RO */ #endif const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct file_lock *i_flock; struct address_space i_data; struct list_head i_devices; union { struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; }; __u32 i_generation; #ifdef CONFIG_FSNOTIFY __u32 i_fsnotify_mask; /* all events this inode cares about */ struct hlist_head i_fsnotify_marks; #endif void *i_private; /* fs or device private pointer */ };
比较重要的字段有三个:
unsigned long i_state; const struct inode_operations *i_op; struct super_block *i_sb;
#define I_DIRTY_SYNC (1 << 0) #define I_DIRTY_DATASYNC (1 << 1) #define I_DIRTY_PAGES (1 << 2) #define __I_NEW 3 #define I_NEW (1 << __I_NEW) #define I_WILL_FREE (1 << 4) #define I_FREEING (1 << 5) #define I_CLEAR (1 << 6) #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) #define I_REFERENCED (1 << 8) #define __I_DIO_WAKEUP 9 #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) //该索引节点为“脏”,磁盘内容必须被更新
struct inode_operations { struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); void * (*follow_link) (struct dentry *, struct nameidata *); int (*permission) (struct inode *, int); struct posix_acl * (*get_acl)(struct inode *, int); int (*readlink) (struct dentry *, char __user *,int); void (*put_link) (struct dentry *, struct nameidata *, void *); int (*create) (struct inode *,struct dentry *, umode_t, bool); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*rename2) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); int (*update_time)(struct inode *, struct timespec *, int); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); int (*tmpfile) (struct inode *, struct dentry *, umode_t); int (*set_acl)(struct inode *, struct posix_acl *, int); /* WARNING: probably going away soon, do not use! */ } ____cacheline_aligned;
struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ unsigned char s_blocksize_bits; unsigned long s_blocksize; loff_t s_maxbytes; /* Max file size */ struct file_system_type *s_type; const struct super_operations *s_op; const struct dquot_operations *dq_op; const struct quotactl_ops *s_qcop; const struct export_operations *s_export_op; unsigned long s_flags; unsigned long s_iflags; /* internal SB_I_* flags */ unsigned long s_magic; struct dentry *s_root; struct rw_semaphore s_umount; int s_count; atomic_t s_active; #ifdef CONFIG_SECURITY void *s_security; #endif const struct xattr_handler **s_xattr; struct list_head s_inodes; /* all inodes */ struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; struct backing_dev_info *s_bdi; struct mtd_info *s_mtd; struct hlist_node s_instances; unsigned int s_quota_types; /* Bitmask of supported quota types */ struct quota_info s_dquot; /* Diskquota specific options */ struct sb_writers s_writers; char s_id[32]; /* Informational name */ u8 s_uuid[16]; /* UUID */ void *s_fs_info; /* Filesystem private info */ unsigned int s_max_links; fmode_t s_mode; /* Granularity of c/m/atime in ns. Cannot be worse than a second */ u32 s_time_gran; /* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */ struct mutex s_vfs_rename_mutex; /* Kludge */ /* * Filesystem subtype. If non-empty the filesystem type field * in /proc/mounts will be "type.subtype" */ char *s_subtype; /* * Saved mount options for lazy filesystems using * generic_show_options() */ char __rcu *s_options; const struct dentry_operations *s_d_op; /* default d_op for dentries */ /* * Saved pool identifier for cleancache (-1 means none) */ int cleancache_poolid; struct shrinker s_shrink; /* per-sb shrinker handle */ /* Number of inodes with nlink == 0 but still referenced */ atomic_long_t s_remove_count; /* Being remounted read-only */ int s_readonly_remount; /* AIO completions deferred from interrupt context */ struct workqueue_struct *s_dio_done_wq; struct hlist_head s_pins; /* * Keep the lru lists last in the structure so they always sit on their * own individual cachelines. */ struct list_lru s_dentry_lru ____cacheline_aligned_in_smp; struct list_lru s_inode_lru ____cacheline_aligned_in_smp; struct rcu_head rcu; /* * Indicates how deep in a filesystem stack this SB is */ int s_stack_depth; };
最后来看看超级块对象操作,同样定义于/include/linux/fs.h中。
struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); int (*drop_inode) (struct inode *); void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); int (*freeze_super) (struct super_block *); int (*freeze_fs) (struct super_block *); int (*thaw_super) (struct super_block *); int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*umount_begin) (struct super_block *); int (*show_options)(struct seq_file *, struct dentry *); int (*show_devname)(struct seq_file *, struct dentry *); int (*show_path)(struct seq_file *, struct dentry *); int (*show_stats)(struct seq_file *, struct dentry *); #ifdef CONFIG_QUOTA ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); struct dquot **(*get_dquots)(struct inode *); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); long (*nr_cached_objects)(struct super_block *, int); long (*free_cached_objects)(struct super_block *, long, int); };
在研究文件系统过程中还提到了“目录项高速缓存”,与之类似的还有“索引点高速缓存”,以上两种都属于“磁盘高速缓存”。“磁盘高速缓存”属于软件机制,它允许内核将原本存在磁盘上的某些信息保存在RAM中,以便对这些数据的进一步访问能快速进行,而不必慢速访问磁盘本身。
与“磁盘高速缓存”类似的概念还有“硬件高速缓存”、“内存高速缓存”,以后遇到了再详细分析。
最后给大家推荐一点资料,同样来自于网络:http://wenku.baidu.com/link?url=nrZ4fZXU7e8dTtx9rrdrfgdK3hqnw8LEJcWxvvq4yME-SoFflpBRVaVnUYYMwdKXquqF47Twh4DwPuZdxSuGxyrgqBvfWal7MzN6mnAeXb_
特别是第15页的图,通过一个实例对上述四种文件系统对象之间的进行了一个图解。