When an application writes a file, the data does not become permanent immediately. The write operation first moves the data into the operating system cache in RAM, where it is vulnerable to system crashes and loss of power. The second step is the transfer to the hard disk, which normally has write caching enabled. The disk acknowledges the data straight away, but keeps it in the disk write cache which is still volatile memory. The data is now safe from system crash1, but is not safe from loss of power. On a modern disk, this may be 16MB or more of data in unknown state.
As performance enhancements in ext4 have made committing data to disk a contentious issue, I’ve written a note on how different platforms handle data consistency.
Delayed Allocation
The root of the latest problem is an optimisation called delayed allocation. In delayed allocation, the file system does not decide where on the disk to save the file until it is necessary to transfer the data to the disk. Linux users have become accustomed to the ordered data mode of ext3, where file data is written to disk before changes to metadata2. Ordered data mode only writes out data when it knows the destination on disk, so with delayed allocation the metadata will go to disk first. If an application creates a file before a system crash, the file may exist after the crash, but with zero length. This caused user complaints when implemented in XFS, and again when implemented in ext4.
ext4 has implemented a workaround for the common case of creating a new file with a temporary name, then renaming it to its final name. This produces an approximation to the ext3 behaviour by allocating blocks when a file is renamed.
The Platforms
Apple
Apple implements delayed allocation in HFS+. When the application calls fsync() on a file HFS+ allocates disk blocks for the file data and transfers that data to the disk, but fsync() does not wait for the disk to complete writing the blocks from its cache. A complete flush to disk requires the F_FULLSYNC operation of fcntl().
Reports of zero length files after crashes are rare on Mac OS X, suggesting that system applications are well behaved and the window of opportunity for corruption is short. It is advisable to call fsync() for safety here.
Microsoft
The allocation strategy of NTFS is not visible externally, but experiments suggest that it does not implement delayed allocation. Applications can open files with the FILE_FLAG_WRITE_THROUGH flag, which causes all writes to be sent directly to the disk. The FlushFileBuffers()call will ensure that data from a file is written to the disk, then flushed from the disk’s write cache and committed.
Windows Vista and Server 2008 introduce a new mechanism: Transactional NTFS. This allows applications to perform database style transactions in the filesystem, ensuring that a set of file operations either completes or fails entirely.
Linux
The ext3 file system allocates disk blocks immediately on write. When combined with the ordered data mode this ensures that application data is written consistently. Unfortunately fsync() on ext3 has developed a reputation as an expensive operation, so developers avoid it. fsync() on ext3 writes all file data to the disk, and waits for the disk to commit the data from its write cache, but only if the file metadata has changedand the file system is not mapped via LVM3.
fsync() on ext4 allocates disk blocks for the file data, then writes the data to disk and waits for the disk to commit the data from its write cache, with the same limitations as ext3.
Unfortunately Linux does not offer an equivalent of F_FULLSYNC or FlushFileBuffers() unless the hard disk write cache is disabled.
Summary
The table below shows how to achieve different levels of consistency on recent versions of each platform covered above.
Mac OS X | Windows | Linux | |
Write to disk without cache flush. | fsync() | FILE_FLAG_WRITE_THROUGH | fsync() |
Write to disk and wait for cache flush. | F_FULLSYNC | FlushFileBuffers() | None4 |
Transactions | None | Transactional NTFS | None |
- Ignoring some worst case scenarios. ↩
- Metadata is data about the file, such as timestamps and permissions. On most Linux filesystems the file name is not part of the metadata, as the file may have multiple hard links. ↩
- This should be fixed in kernel 2.6.29 for simple cases. ↩
- If the hard disk write cache is disabled, fsync() on ext3 and ext4 will provide a complete sync to disk. ↩