IFS FAQ
Q1 How difficult is it to port a Windows 9x based file system or file system filter driver to Windows NT/2000/XP?
Q2 Is there a WDM model for file systems or file system filter drivers?
Q3 How does a file system or file system filter driver handle PnP, Power Management, and WMI in Windows 2000/XP?
Q4 How do file systems get loaded on Windows NT/2000/XP?
Q5 How is the file system's device object found?
Q6 What's the right way to cancel a CREATE request in my filter driver?
Q7 How do I deal with file-sharing issues in my filter driver?
Q8 What's the proper way to install my file system or file system filter driver?
Q9 Does WHQL logo file systems or file system filter drivers?
Q10 When my customers install my file system or filter driver product, will they get that nasty pop-up saying "this is not a signed driver...Microsoft recommends that you do not continue"??
Q11 How is cache coherency handled when a file is opened for "ordinary" (cached) I/O and also opened memory mapped?
Q12 Can I leave integration with the cache manager out of my product to simplify things? What's the impact of doing this?
Q13 Must I support Fast I/O in my file system or filter, and where is Fast I/O documented?
Q14 Filtering file systems doesn't look that hard... I have a free sample I downloaded off the web. What are the limitations of the currently available file system filter driver samples?
Q15 Is Rajeev Nagar's book accurate, and should I use it as a reference?
Q16 Are there any other books on NT/2K/XP file system or file system filter development? What other resources are available to help me??
Q17 What are the primary differences between filtering rdr and filtering a local file system?
Q18 In general, when can I use the IFS kit's mini-rdr model, and when is it best for me to write my own rdr from scratch?
Q19 Does Microsoft offer support for developing file systems or file system filters?
Q20 Does NT/2K/XP have anything like a VFS interface?
Q21 Are the sources for NTFS available for general reference?
Q22 Can any part of a file system or a file system filter driver be pageable?
Q23 How does the defragmentation API work? Is there anything special that a filter driver must do to handle it correctly?
Q24 How are file IDs and Object IDs used in the file systems? In my filter driver, how do I deal with them?
Q25 What is the difference between cached I/O, user non-cached I/O, and paging I/O?
Q26 How do I map devices back to drive letters in Windows NT/2000/XP? Can I use the mount manager to do this? If so, how do I do this?
Q27 An application opened the file using the short name. How do I retrieve the long name?
Q28 I see I/O requests with the IRP_MN_MDL minor function code? What does this mean? How should I handle it in my file system or filter driver?
Q29 Who is responsible for maintaining the 'file pointer' (CurrentByteOffset field)? When an application does an append to the end of the file, how is this presented to the file system?
Q30 I never see any calls to several fast I/O operations. Does this mean I don't need to filter them? What happens if I do need to filter them?
Q31 Issues calling an FSD from a completion routine
Q32 Obtaining a drive letter assignment
Q33 Handling FILE_COMPLETE_IF_OPLOCKED in a filter driver.
Q34 Opening files during IRP_MJ_CREATE processing
Q35 How do I retrieve the "user name" for the user performing a given operation?
Q36 How do I detect reentrancy back into my filter driver?
Q37 How do I call a user-mode function from my kernel driver?
Q38 What is structured exception handling? How should I use it? Why do I get STOP code 0x1E (KMODE_EXCEPTION_NOT_HANDLED)? How do I deal with this?
Q39 I am using a completion routine in my filter. What am I allowed to do? What am I NOT allowed to do? What alternatives do I have to performing the work in my completion routine?
Q40 How do I force files to be closed from my file system/filter driver?
41 Can I rely upon the RelatedFileObject field in the FileObject? How should I use this information?
Q42 How do I deal with the "recycle bin"? Is this some special directory in the file system?
Q43 I need to access the file, but it is locked for exclusive access. How do I get around this?
Q44 I need to read a range of the file but it has a byte range lock on it. How can I bypass these byte range locks?
Q45 I need to build my own IRP. How do I do this?
Q46 How do timestamps work on files? What is the "change time" versus the "modify time"?
Q47 How does dismount work? How does this differ from media removal? Device removal?
Q48 What happens if I mix memory mapped I/O with regular file I/O? What if the file I/O is non-cached?
Q49 I am getting a PFN_LIST_CORRUPT STOP code. What does this mean? What could I be doing wrong? How do I work around this problem?
Q50 What are the rules for my file system/filter driver for handling paging I/O? What about paging file I/O?
Q51 I am getting NO_MORE_IRP_STACK_LOCATIONS as a stop code. How do I fix this?
Q52 What are the obsolete calls in Windows 2000? In Windows XP?
Q53 How do I enumerate the contents of a directory from kernel mode?
Q54 I am building a filter driver where I must change the directory information. How do I do that?
Q55 I see the user close the file. My filter receives an IRP_MJ_CLEANUP. But I never see the IRP_MJ_CLOSE? Why not?
Q56 What are the rules for managing MDLs and User Buffers? How do I substitute my own buffer in an IRP?
Q57 What are the issues with respect to IRQL APC_LEVEL? What does it do? Why should I use (or not use) FsRtlEnterFileSystem?
Q58 How do I determine if the FILE_OBJECT represents a file or a directory from my filter driver? Can I rely upon the FILE_DIRECTORY_FILE bit?
Q59 How do I determine if the IRP is coming from a local process or over the network?
Q60 How should I deal with Fast I/O in my file system? In my filter driver?
Q61 I am suffering from stack overflow issues. How do I deal with this?
Q62 What is the difference between EOF and AllocationSize? Why is the AllocationSize the same for a file AFTER it is compressed?
Q63 What is the IFS Kit? How do I get it? I'm not in the US/Canada. Can I still buy it? Can I buy it from a retail distributor? With a purchase order?
Q64 I open a file but later when I try to use the handle I get back an error indicating an invalid handle (or invalid object type) error. What am I doing wrong? How can I use my handle?
Q65 When can I rely upon the file name in the FileObject structure?
Q1 How difficult is it to port a Windows 9x based file system or file system filter driver to Windows NT/2000/XP?
In general one does not "port" a Windows 9x (Windows 95, Windows 98, and Windows Me) based file system to Windows NT, Windows 2000, or Windows XP. This is because the two file system models are quite a bit different. Thus, in general, "porting" consists more of re-implementing the file system.
This is particularly true for "filter drivers" where Windows 9x provides the IFS Manager "hook" mechanism and Windows NT uses the file system filter driver model. The two models are incompatible with one another.
It is possible for a file system to be written in an "OS independent" fashion, but in general such undertakings are substantial projects and are not normally considered "ports" of a file system (or filter driver) from one OS to another OS.
There are numerous differences between a Windows NT and Windows 98 file system that make porting a difficult task. They include:
(a) A substantially different I/O Model for file system operations. Windows 9x uses the "IFS Manager" interface. Windows NT uses the I/O Manager (IRP-based) interface.
(b) Windows NT is a re-entrant operating system, while Windows 98 is not. Thus, calls can re-enter the storage stack in Windows NT. This introduces more complex locking and synchronization semantics than are present in Windows 98.
(c) Windows NT makes considerable use of the demand paged virtual memory system with respect to file systems. Windows 9x file systems do not make use of the virtual memory system - such use is entirely different!
Of course, this list isn't intended on being comprehensive, but rather to demonstrate some of the "big" differences that developers find when porting from one OS to the other.
Q2 Is there a WDM model for file systems or file system filter drivers?
The Windows Driver Model (WDM) does not have a model for file systems or file system filter drivers. Thus, it is not possible to construct a file system or file system filter driver that is cross-platform compatible using the standard Windows Driver Model mechanism. This is because the model for file systems is substantially different between the two operating systems.
Q3 How does a file system or file system filter driver handle PnP, Power Management, and WMI in Windows 2000/XP?
Plug and Play, Power Management, and Windows Management Instrumentation (WMI) are normally handled in a minimal fashion in the Windows 2000/XP file systems. Of course, this does not mean that file systems are prohibited from handling such differences, only that as a matter of course most file systems do not need to be aware of all the PnP, Power Management, and WMI operations available.
For physical file systems, and file system filter drivers for those file systems, the most significant PnP operations are those that handle removable devices (which is not to be confused with removable media). A removable device is a device that can be dynamically removed from the system, such as a storage device connected via the USB or bus. In this case, the file system follows a rather specific pattern of behavior (which is not the same as the behavior described in the DDK for the physical storage drivers). Thus, when an FSD receives a plug and play request inquiring as to the state of the device for removal (IRP_MJ_PNP, with IRP_MN_QUERY_REMOVE) it might ensure that no critical files are located on the device (paging files, registry hive, etc.) and if not it would then pass the request through to the underlying media device. When the device is removed (IRP_MJ_PNP, with IRP_MN_REMOVE_DEVICE) the file system waits to ensure that the underlying media has been successfully removed. Once that's done, it then logically "dismounts" the volume (this is internal to the file system) and cleans up its own internal state. After all, the volume that was contained within the removable device is gone. Most other plug and play operations would not normally apply to a physical media file system. This can be seen in the FAT file system code distributed as part of the IFS Kit.
A network file system may also be interested in the appearance or removal of specific protocol stacks, and thus may become involved in the monitoring of plug and play events. However, the physical file systems are tied directly to the media device via the volume parameter block, but the network file systems are not directly tied to the underlying protocol stacks and thus they cannot rely upon the same IRP passing mechanism. Instead, a network file system would normally utilize a user-mode service to monitor state changes in the protocols by registering for notification of such events. Then, when a protocol has been loaded (or is being unloaded) the user-mode service can indicate that state change to the underlying driver using its private interface into the driver.
Power management is not supported by any of the existing physical media file systems because there is no need to do so. Prior to powering down the system, the OS ensures that the file systems are called to flush any dirty data back to their media (which is, of course, the primary concern of most file systems). It is possible to support power management within a file system, but at the present time there are no examples of this in the IFS Kit.
Windows Management Instrumentation (WMI) for file systems is the same as it is for normal device drivers. The examples in the IFS Kit do not use the WMI mechanism for control, configuration or statistics gathering. Instead, they rely upon the Windows NT 4.0 mechanism (IOCTL calls) for retrieving such information. Thus, there are no standard information formats supported by WMI for file systems.
Q4 How do file systems get loaded on Windows NT/2000/XP?
File systems are loaded via the I/O Manager on Windows NT, Windows 2000 and Windows XP, based upon information stored in the registry, or by explicit requests to load the driver via the ZwLoadDriver API (which is documented in the Windows XP IFS Kit).
While it is possible to build a Windows 2000-style file system driver (which utilizes an AddDevice entry point) none of the existing Windows 2000 file system function in this manner. Thus, they are neither installed using INF scripts, nor does the plug and play manager start them.
Q5 How is the file system's device object found?
Each time a user performs an open operation (via ZwCreateFile) they must either specify an absolute path name or they must specify a path name relative to an existing open file or directory.
In the case of a relative open, the I/O Manager "knows" to which device it should send the requests, because the existing open file or directory already references the correct file system device instance. Thus, the request can be sent directly to that device.
In the case of an absolute open, the I/O Manager must first start by parsing the name via the Object Manager. The Object Manager resolves a portion of the name that leads to a device object and then passes the balance of the name (the portion that has not yet been resolved) back to the I/O Manager, along with a pointer to the device object it located during its portion of name resolution.
The I/O Manager then examines the device object. If the device object has a valid pointer to a volume parameter block (VPB), the I/O Manager uses the DeviceObject field of the VPB to find the correct file system device object. If the device object does not have a valid VPB pointer, then the I/O Manager sends the IRP_MJ_CREATE request directly to the device driver for the device object passed in by the object manager.
For example, a physical media file system volume is located via the VPB. If, for instance, an application program attempts to open some file, say "c:/fred/bob.txt", this name is translated into its Windows NT/2000/XP equivalent name (via the Win32 CreateFile API call's implementation) "/DosDevices/c:/fred/bob.txt". This name, when passed to the I/O Manager must then be passed off to the object manager in order to locate the correct physical media device object. The object manager begins by attempting to locate the "DosDevices" entry. This entry is a symbolic link to the ?? directory (which, in turn, is a special directory, whose contents can vary on a per-process basis). This causes the name to be reconstructed using the new name. Parsing that new name then begins again. Thus, this name becomes "/??/C:/fred/bob.txt". The parsing for ?? locates the correct directory (the one that can vary on a per-process basis) and then in that directory it finds the "c:" entry. This is yet another symbolic link. The contents of this will vary from one computer to another, depending upon its configuration, but it would typically be some new name such as (this from a real Windows XP system) "/Device/HarddiskVolume1". Thus, the name being parsed has now been reworked so that it is "/Device/HarddiskVolume1/fred/bob.txt". Initiating a parse with this name we find the Device directory and then the HarddiskVolume1 device object within that directory.
This device object, along with the balance of the name, is then passed into the I/O Manager. The I/O Manager then examines the device object in question. Using the DeviceTree utility (available from OSR, and also scheduled for inclusion in the Windows XP DDK), we can see that this device object has a VPB structure that in turn points to the correct file system device instance. Thus, the name passed by the object manager to the FSD in this instance was "/fred/bob.txt" and the file system's device object found via the VPB will be passed to its IRP_MJ_CREATE dispatch entry point.
Had we performed the same analysis using a different drive letter, such as the "I:" drive, which maps to a network share, we would have followed similar parsing logic until we found the device name, which in this case the name on our sample system was "/Device/LanmanRedirector/;I:0/server/c$". However, when the I/O Manager examines the device object (for "/Device/LanmanRedirector") it will not find a VPB. Thus, it will pass the balance of the name to the IRP_MJ_CREATE handler of this device directly. Note that the name passed in this case would be quite different: "/;I:0/server/c$/fred/bob.txt".
Q6 What's the right way to cancel a CREATE request in my filter driver?
A file system filter driver will sometimes need to pass an IRP_MJ_CREATE request to the underlying file system and determine (in its completion routine for IRP_MJ_CREATE) whether or not the operation should be successful. The issue becomes the fact that the file object has now been declared to the underlying FSD. If the filter driver fails the IRP_MJ_CREATE operation it will cause a memory leak because the FSD will not know the file object is no longer in use.
For Windows 2000 and Windows XP a file system filter driver can use the IoCancelOpen API. For earlier versions of Windows NT, equivalent functionality is achieved by:
- Setting the FO_HANDLE_CREATED field in the file object; and
- Sending an IRP_MJ_CLEANUP to the underlying FSD; and
- Sending an IRP_MJ_CLOSE operation to the underlying FSD.
These I/O operations should be performed synchronously. Once these have all been completed, the status of the original IRP must be set to an error value and the I/O operation completed.
Note: this technique does not imply that it will undo the side effects of IRP_MJ_CREATE. To determine if an IRP_MJ_CREATE will have side effects you must examine the Disposition field. The FILE_SUPERSEDE, FILE_OPEN_IF, FILE_OVERWRITE and FILE_OVERWRITE_IF operations can have definite side effects, some of which are extremely difficult to "undo" (e.g., FILE_SUPERSEDE). If your filter needs to "undo" such operations, you must first check to find the state of the underlying file system before allowing the operation to proceed.
Q7 How do I deal with file-sharing issues in my filter driver?
File sharing semantics are often an unforeseen complication for file system filter drivers. This is because filter drivers often wish to create separate handles against the file so that they can perform operations against the file, independent of the application-initiated request to access the file. For example, an on-access virus scanner might wish to read an executable in order to scan it - this requires read access, while execution does not require read access.
Regardless of the reason, the "problem" that many file system filter drivers have is that when they attempt to open the file, they must be able to handle file-sharing issues. Specifically, if another application opens the file, it might specify exclusive access to the file. It may be critical to the function of the filter driver that it be allowed to access the file even though normal file sharing semantics would reject the access request. There are a few techniques that a file system filter driver can use to "circumvent" file sharing semantics:
- The filter driver can implement the file sharing internally in the filter driver; in this case it always asked the underlying FSD for full-sharing of the file and enforces the sharing at its level; or
- The filter driver can use memory mapping to access the file; or
- The filter driver can build IRPs and access the file directly;
While it is possible for a file system filter driver to implement the file sharing semantics internally within the file system filter driver; it is often the most difficult situation because of the complex nature of the sharing semantics. To properly track sharing state, the file system must track (on a per-file basis) the current sharing state of the file. Typically, this is done to mirror the behavior of the FAT file system example in the IFS Kit. Thus, when the file is opened, the sharing is managed by using IoCheckShareAccess and/or IoUpdateShareAccess. When the user closes the file (as specified by the IRP_MJ_CLEANUP operation) the filter removes the share access (IoRemoveShareAccess) so that subsequent operations to access the file succeed.
In computing share access there are two important elements:
- The access requested for the specific open instance of the file, which is recorded in the file object; and
- The share access allowed by existing applications using the file.
The actual implementation of this can be somewhat complicated, which is why it is generally best to allow the FSD to manage this, or to use the I/O Manager routines.
A second approach is to keep in mind that the section object for a file can always be used to memory map the file.
A third approach is to build read and write IRPs directly. Because security checks (including whether or not a read/write is allowed) are done in the I/O Manager before the operation is passed to the file system. Thus, the filter driver can perform I/O operations against a file object that was not opened for read or write access, simply by constructing read and write IRPs within the filter driver itself.
Q8 What's the proper way to install my file system or file system filter driver?
Typically, a file system driver is a legacy driver. As such, it is installed by using the Service Control Manager, or directly making changes to the registry. A file system filter driver normally installs in a separate fashion, although this will change as the Windows XP team begins to add WHQL certification mechanisms for certain types of file system filter drivers. For such drivers, they will be required to utilize INF scripts.
In addition, it is possible to build a plug-and-play "type" file system. In such a case, it is possible to build an INF file for such a driver. However, there are no existing samples of how to achieve this.
Q9 Does WHQL logo file systems or file system filter drivers?
Not at the present time. However, WHQL is working on a plan for the future that will provide testing and certification for certain types of file system filter drivers. The first target file system filter drivers for certification will be anti-virus filters.
Q10 When my customers install my file system or filter driver product, will they get that nasty pop-up saying "this is not a signed driver...Microsoft recommends that you do not continue"??
Not at the present time. Once Microsoft begins to implement a protocol for signing file system filter drivers this pop-up would be displayed for any file system filter driver that registered using one of the defined class GUIDs but was not signed.
This pop-up dialog box does not apply for devices that register using a GUID for which Microsoft does not have a signing program, which includes file systems and file system filter drivers.
Currently, Microsoft is working on developing a signing program for the filter part of anti-virus products containing a file system filter driver. Once that program is in-place, a driver developer will need to submit their driver to Microsoft so that the results can be certified and the resulting driver signed by Microsoft. In the future, Microsoft may extend this program to include other file system filter drivers as well, although it seems unlikely there will ever be a "generic" file system filter driver certification process, given the broad range of possible behaviors from file system filter drivers.
File systems do not presently have a program for certification. However, in theory, it should be possible to submit a file system for certification, provided that it passes all of the relevant tests (that is, those tests used to ensure proper behavior of FAT and NTFS). We note that this is "in theory" because such certification is not a standard program and would require acceptance under a case-by-case review by Microsoft.
Q11 How is cache coherency handled when a file is opened for "ordinary" (cached) I/O and also opened memory mapped?
The Windows NT/2000/XP file systems all utilize the virtual memory system for caching the file system data. The task of managing the file system data is the responsibility of the cache manager. The cache manager accomplishes this by memory mapping the files - which in turn is done by the Memory Manager.
An application (such as Notepad) that uses memory-mapped files does so by memory mapping the files, again via the Memory Manager.
What this means is that if an application program reads data from the file system (using the standard Win32 ReadFile API call) the file system satisfies this request by calling the Cache Manager (via CcCopyRead, for example) and the Cache Manager then, in turn, copies the data from virtual memory.
The advantage of this approach is that there is only a single copy of the file - the one in virtual memory. Thus, both read/write and memory mapped access to the file are coherent.
Q12 Can I leave integration with the cache manager out of my product to simplify things? What's the impact of doing this?
While it is possible to construct a file system that does not integrate with the cache manager, in general there are a number of potential problems of which a developer should be aware:
(a) Integration with the Memory Manager (versus the Cache Manager) is seldom optional. This is because any memory-mapped file access requires supporting the Memory Manager. A file system that does not support the Memory Manager will not work with Notepad in Windows 2000, nor will program execution function properly.
(b) A file system that does integrate with the Memory Manager but does not integrate with the Cache Manager must be aware that it may not be possible to maintain cache coherency between the Memory Manager and the file system. If a file is memory mapped by an application program, it is not possible for a file system to indicate that the cached (memory mapped) data has changed. Thus, the application program will continue to use the old information.
To implement a file system so that it does not "integrate" with the cache manager, it is sufficient for the file system to refrain from calling any of the Cc* routines. However, to support memory mapped files (executables, notepad, etc.) it is necessary for the FSD to create the storage for the SectionObjectPointers structure (a field to which the FileObject points). The Memory Manager will utilize this storage to track the section objects used as part of supporting a file-backed section object.
An FSD then controls the contents of the memory manager state by using MmFlushSection and MmForceSectionClosed in order to advise the memory manager of changes to the section, or to force the memory manager to tear down references to the section. Note that both of these calls are optional and it is quite possible that neither of them will lead to the desired results - thus, the FSD must be written to handle cases where these calls return errors.
Q13 Must I support Fast I/O in my file system or filter, and where is Fast I/O documented?
Fast I/O is optional for a file system. For a file system filter driver, it is required if any file system it filters supports fast I/O. Because all of the standard Windows NT/2000/XP file systems support fast I/O, essentially it is required of all filter drivers.
Fast I/O is a mechanism that was introduced into Windows NT in order to optimize the handling of certain I/O operations. In addition, fast I/O routines are also used for a variety of communications channels between the FSD and various kernel components, including the I/O Manager, the File System Runtime Library, and the CIFS File Server. Unfortunately, at the present time Fast I/O is not documented, although future versions of the IFS Kit will include documentation on how the Fast I/O routines are used.
Q14 Filtering file systems doesn't look that hard... I have a free sample I downloaded off the web. What are the limitations of the currently available file system filter driver samples?
The examples we have seen often serve as a good "starting point" for a file system filter driver project. Unfortunately, most of these examples are usually "passive" filters - that is, they examine an I/O operation but do not attempt to modify the I/O or the flow of control to the underlying FSD.
Thus, the more closely the samples fit your intended file system filter driver, the more applicable they will be. If, however, your primary goal is to modify the behavior of the underlying file systems, such as might be the case with an on-access virus scanner or an on-the-fly encryption program, then these samples serve as little more than that - a sample of how to implement a file system filter driver.
Q15 Is Rajeev Nagar's book accurate, and should I use it as a reference?
Rajeev Nagar's book can serve as a good reference text for background, as well as for a discussion of many of the issues involved in writing a file system and/or a file system filter driver for Windows NT. The most significant limitation to the book is that much of it was done against Windows NT 3.51, although it was released just after Windows NT 4.0 became available. Thus, some of the detailed information is no longer applicable. However, since it is the only text available on this subject, it still plays an important role in the reference shelf of a file systems developer.
Q16 Are there any other books on NT/2K/XP file system or file system filter development? What other resources are available to help me??
Obviously, one important source of information for those developing Windows NT/2000/XP file systems is the NTFSD mailing list (for which you can sign up at the OSR Website. List members may submit questions to [email protected].) In addition, there are books on Windows NT device drivers that can provide much of the background for drivers in general:
· Viscarola, Peter G. and W. Anthony Mason, Windows NT Device Driver Development, Macmillan Technical Publishing, 1998.
· Dekker, Edward N and Joseph M. Newcomer, Developing Windows NT Device Drivers: A Programmer's Handbook, Addison-Wesley Publishing Company, 1999.
· Baker, Art, The Windows NT Device Driver Book: A Guide for Programmers, Prentice Hall, 1996.
· Cant, Chris, Writing Windows WDM Device Drivers: Covers NT 4, Win 98, and Win 2000, CMP Books, 1999.
· Oney, Walter and Forrest Foltz, Programming the Microsoft Windows Driver Model, Microsoft Press, 1999.
· McDowell, Steven, Windows 2000 Kernel Debugging, Prentice Hall Computer Books, 2001.
· Solomon, David A and Mark Russinovich, Inside Microsoft Windows 2000 (Third Edition), September 2000.
· Nebbett, Gary, Windows NT/2000 Native API Reference, New Riders Publishing, 2000.
· Schreiber, Sven, Undocumented Windows 2000 Secrets, Addison-Wesley, 2001.
There are other resources as well, including the Microsoft IFS Kit and OSR's The NT Insider.
Q17 What are the primary differences between filtering rdr and filtering a local file system?
In general, filtering a local file system differs somewhat from filtering a redirector in a number of key areas:
(a) File state tracking. Network file systems frequently defer operations in order to minimize the network traffic. For example, CIFS defers opening files until necessary. This can lead to changing the "FsContext" value of the file object, for instance, which complicates the tracking for file system filter drivers.
(c) Security. Network file systems routinely restrict operations that can be performed by the system process using standard system credentials. Thus file system filter drivers may need to utilize advanced techniques, such as impersonation, etc. in order to circumvent these security considerations.
(d) Private APIs between user mode service and kernel mode driver. Some of the processing needed by the network redirector is implemented in a user mode service. The user mode service and kernel mode driver interact using a private IOCTL-based API.
(e) Support for UNC naming. The Universal Naming Convention (UNC) consists of names that begin with a prefix (e.g., "server" and "share" in CIFS) and then the balance of the name. A filter must be able to handle both UNC names and "volume based" names.
(f) Complex naming issues. Names are not unique for redirectors. For example, the volume based name and the UNC name are quite different, yet they can each be used to reach the same file.
(g) Some operations do not translate to redirectors. For example, "open by ID" works with CIFS, but the file ID cannot be obtained directly from all versions of the CIFS redirector - instead, a "dummy value" is returned by the CIFS redirector.
Many operations are identical between the two, but there are subtle behavior differences that can dramatically impact the development of a file system filter driver - beware!
Q18 In general, when can I use the IFS kit's mini-rdr model, and when is it best for me to write my own rdr from scratch?
The IFS Kit's mini redirector model is most suitable for CIFS-like network file system clients, although it can be adapted to a broad range of file systems, including file systems that are not "redirectors" in the traditional sense. The primary advantage of the mini-redirector model is that it simplifies the interactions with the virtual memory system.
Writing any file system, including a redirector, "from scratch" is a substantial undertaking and will typically involve many months of diligent work. It is most appropriate when the file system being developed does not fit into a standard model (such as the mini-redirector model) or when the file system requires the ability to control the file system/virtual memory interactions.
Q19 Does Microsoft offer support for developing file systems or file system filters?
Microsoft's support offerings for file systems currently include making the IFS Kit available. Newer versions of the IFS Kit include documentation on how to use the kit. At the present time (October, 2001) Microsoft does not offer development support for developing file systems or file system filter drivers, although that may be subject to change.
Q20 Does NT/2K/XP have anything like a VFS interface?
The Virtual File System (VFS) interface is a UNIX abstraction originally developed in order to simplify the porting of the Network File System (NFS) to various UNIX platforms. It provides a model where there is a "standard" set of operations that are supported by a file system. The operating system then, in turn, works with the file system to maintain a "vnode" cache (where a "vnode" is a virtual file system information element) and to provide core OS management services for working with the OS.
The Windows NT/2000/XP platform also has a "standard" set of operations that it expects to be supported by file systems. The operating system then invokes the relevant file system services by calling upon those standard services.
However, the details of the implementation of the UNIX VFS interface are quite different than the IRP Major dispatch entry points a Windows NT file system. Because of this, a UNIX VFS-based file system does not "port" in any trivial fashion to Windows NT/2000/XP.
Q21 Are the sources for NTFS available for general reference?
The Installable File Systems (IFS) Kit contains a number of real file systems that are distributed with Windows NT/2000/XP. This includes the FAT file system and the CDFS file system. However, Microsoft does not distribute the NTFS file system as part of the IFS Kit. While it may be possible to license other components of Windows NT/2000/XP from Microsoft, this is not something that is available via the IFS Kit mechanism.
Q22 Can any part of a file system or a file system filter driver be pageable?
Depending upon the type of file system, some, or all of the file system can be pageable. In addition, there are some special issues for physical media file systems that support paging files (these are special files used by the OS to store pageable data when it is not in use).
First, it is worth noting the underling issue that motivates these rules about paging - specifically, about the ability of the Windows operating system to handle an arbitrary number of stacked page faults. The stack in the kernel is limited - typically 12KB - and any reentrant behavior, including page faults, can cause stack overflow conditions. Thus, in general the rule is that at most two page faults may be "stacked". The second page fault will occur when attempting to resolve the first page fault. That second fault is actually subject to some pretty stringent rules.
For any arbitrary code within a file system, or file system filter driver, that can be invoked while handling paging activity (that is any I/O operation where the IRP_PAGING_IO bit is present in the IRP) the file system must ensure that its own code necessary for handling that I/O can always be fetched safely. For example, the NTFS and FAT file systems must ensure their code is locked in memory because if it were paged in from the disk it might be paged in from an NTFS or FAT volume. However, since the code necessary to handle that paging I/O would not be present, there would be no way for the page fault to be resolved. This, of course, applies to file system filter drivers that filter NTFS or FAT, as well. If you are writing a file system, and your file system code is never stored on your file system, then you can make your code paths pageable.
For all file systems (even NTFS or FAT) you can make data accessed while handling paging activity pageable. Thus, you can store file system data in pageable memory, as well as use file-backed sections (which are "demand paged"). Even if the file system is processing a page fault, a second page fault can be handled. In that case, however, processing that second page fault must not generate yet another page fault because that would then constitute the dreaded "three faults". Thus, there is SOME data within the file system that must be in non-paged memory. Typically, this is any data required to process the paging activity to satisfy an IRP_MJ_READ operation, since that is required for retrieving data from a file.
For the paging file, there are special rules because access to the paging file cannot safely perform any operation that will generate a page fault. This is because the memory manager may be retrieving the contents of paged pool from the file system. Since we have already described the paths where the file system may be handling a stacked page fault any new page fault would constitute a triple-fault situation, which is catastrophic. Thus, the rules for paging file access are very restrictive. No access to pageable code or data. Use only of OS routines that are safe at DISPATCH_LEVEL (even though the OS is only running at APC_LEVEL when processing these page faults). The Memory manager controls all serialization for paging operations (to eliminate potential deadlock conditions).
Because of these restrictions, the Memory Manager makes certain guarantees to file systems. Perhaps the most important is that it does not perform "extending write" operations. Thus, the portion of a file system that handles space allocation need not be resident in memory.
Note that the rules for paging files do not apply to file systems that do not support paging files. A file system can recognize an attempt to open or create a paging file because the SL_OPEN_PAGING_FILE bit will be indicated in the Flags field of the I/O stack location for the IRP_MJ_CREATE operation.
In Windows XP and Windows Server 2003 the routine FsRtlIsPagingFile may be used to determine if a given file object represents the paging file.
Note that a file system filter driver that filters any file system supporting paging files must obey the same rules.
Q23 How does the defragmentation API work? Is there anything special that a filter driver must do to handle it correctly?
The defragmentation API is a mechanism that was introduced in Windows NT 4.0 to allow application programs to defragment files on the local disks. Perhaps the best source of information about how defragmentation works in Windows 2000 is the Platform SDK, which describes the use of FSCTL_GET_VOLUME_BITMAP, FSCTL_GET_RETRIEVAL_POINTERS, and FSCTL_MOVE_FILE. These three operations are used to defragment a given file. In addition, the IFS Kit includes the FAT file system source code and its implementation of defragmentation.
A file system filter driver will typically not be concerned about the defragmentation API because it is implemented internally within the file system. Under certain circumstances a filter may need to be aware the that physical location of a file has changed, such as might be the case for a SAN file system filter driver that uses these APIs to track the logical storage locations for one (or more) files.
Q24 How are file IDs and Object IDs used in the file systems? In my filter driver, how do I deal with them?
File IDs are a 64-bit identifier created by the file system to provide a volume-level unique identifier for a given file. The initial purpose behind creating such IDs was to allow certain types of applications (e.g., file servers) to associate a numeric identifier with the file. For example, the stateless NFS file system provides a handle to its clients. If the handle incorporates the file ID, it is possible for the file server to open the file at an arbitrary point in time by using the file ID, rather than by using a path to the file. In Windows 2000, Microsoft introduced the concept of an object ID. Object IDs are created by application programs and are optionally assigned to files by using the FSCTL_SET_OBJECT_ID, FSCTL_GET_OBJECT_ID, and FSCTL_DELETE_OBJECT_ID file system control operations.
To retrieve the file ID for a file, an application program can query the "internal ID" of the file. The application then opens the file using this ID by using the ZwCreateFile API. The Object Attributes structure must specify the handle of an existing (and open) file or directory handle on the volume where the file is located. The name of the file is then the file ID or Object ID and the FILE_OPEN_BY_FILE_ID option must be set. The file system can then use this ID to open the file.
However, not all file systems support opening a file using a file ID or object ID. For example, the FAT file system will generate and return a file ID, but it does not support "open by file ID":
//
// If this is an open by file ID operation, just fail it explicitly. FAT's
// source of fileids is not reversible for open operations.
//
if (BooleanFlagOn( Options, FILE_OPEN_BY_FILE_ID )) {
FatCompleteRequest( IrpContext, Irp, STATUS_NOT_IMPLEMENTED );
return STATUS_NOT_IMPLEMENTED;
}
However, the CDFS example in the IFS Kit does support open by file ID, although it does not support object IDs:
//
// For the open by file ID case we verify the name really contains
// a 64 bit value.
//
} else {
//
// Check for validity of the buffer.
//
if (FileName->Length != sizeof( FILE_ID )) {
return STATUS_INVALID_PARAMETER;
}
}
And thus this will not work if the file ID is 128 bits rather than 64 bits. Only the NTFS file system supports opening a file using both its file ID and its object ID.
The most substantial impact of the use of file IDs for file system filter drivers is the inability to extract a name for the file. For CDFS, a filter can query the file system for the name of the file (using IRP_MJ_QUERY_INFORMATION or IoQueryInformationFile and querying the FileNameInformation attribute of the file) and the file system will always return a name. However, for NTFS there are certain cases when it cannot return a path to the file (where the caller that opened the file does not have traverse privileges, there is no mechanism for NTFS to determine if it is allowed to return the path to the file). Further, even in those cases where it does return a path to the file it is important to note that there may be multiple paths to the file (via hard links) and that the path name returned is only one of the possible names.
Q25 What is the difference between cached I/O, user non-cached I/O, and paging I/O?
In a file system or file system filter driver, read and write operations fall into several different categories. For the purpose of discussing them, we normally consider the following types:
- Cached I/O. This includes normal user I/O, both via the Fast I/O path as well as via the IRP_MJ_READ and IRP_MJ_WRITE path. It also includes the MDL operations (where the caller requests the FSD return an MDL pointing to the data in the cache).
- Non-cached user I/O. This includes all non-cached I/O operations that originate outside the virtual memory system.
- Paging I/O. These are I/O operations initiated by the virtual memory system in order to satisfy the needs of the demand paging system.
Cached I/O is any I/O that can be satisfied by the file system data cache. In such a case, the operation is normally to copy the data from the virtual cache buffer into the user buffer. If the virtual cache buffer contents are resident in memory, the copy is fast and the results returned to the application quickly. If the virtual cache buffer contents are not all resident in memory, then the copy process will trigger a page fault, which generates a second re-entrant I/O operation via the paging mechanism.
Non-cached user I/O is I/O that must bypass the cache - even if the data is present in the cache. For read operations, the FSD can retrieve the data directly from the storage device without making any changes to the cache. For write operations, however, an FSD must ensure that the cached data is properly invalidated (if this is even possible, which it will not be if the file is also memory mapped).
Paging I/O is I/O that must be satisfied from the storage device (whether local to the system or located on some "other" computer system) and it is being requested by the virtual memory system as part of the paging mechanism (and hence has special rules that apply to its behavior as well as its serialization).
Q26 How do I map devices back to drive letters in Windows NT/2000/XP? Can I use the mount manager to do this? If so, how do I do this?
In Windows NT there is no mechanism for mapping from a device back to a drive letter. Further, because the mapping of driver letters to devices is not one-to-one and it can change, any mechanism for mapping drive letters must be able to handle the various issues involved. The simplest mechanism for physical media volumes is to open each possible drive letter ("/DosDevices/A:", then "/DosDevices/B:", etc.) and determine if that drive letter yields the proper device object. For a network drive, however, the drive letter is embedded within the name, and the FSD can extract the drive letter from that name; the name must also be adjusted to account for the server/share name that is typically part of the embedded name.
In Windows 2000 and Windows XP, volume mount points (the generalization of drive letters on Windows 2000 and Windows XP) are managed by the mount manager, which is a kernel mode driver that is responsible for managing volume mount points. The IOCTL operations to directly access the mount manager are included in the DDK (in the mountmgr.h header file) but the meanings of these volume mount point operations are not directly documented in the DDK.
The volume mount point API is documented in the Platform SDK in great detail, including various calls for creating, deleting, and locating volume mount points. This API in turn relies upon the mount manager IOCTL operations to retrieve the current list of mount points within the system. Again, we note that this includes drive letters, but also accommodates other volume mount points (and we note that drive letters are not required as mount points!).
Q27 An application opened the file using the short name. How do I retrieve the long name?
The one mechanism that works on all versions of Windows NT, 2000, and XP is to look into the directory that contains the entry. In Windows XP, this can be done using NtQueryDirectoryFile, which is part of the Windows XP IFS Kit. However, in earlier versions this call is not exposed by the IFS Kit and thus it requires that a filter driver implement equivalent functionality by using the IRP_MN_QUERY_DIRECTORY operations.
Using NtQueryDirectoryFile, the filter driver can specify a search of the directory using the short file name. The FSD will then return one (or more) file that has a matching short file name as well as its corresponding long file name. For the standard file systems, this would only return a single matching entry. From this, both the short name and long name of the file can be retrieved.
Q28 I see I/O requests with the IRP_MN_MDL minor function code? What does this mean? How should I handle it in my file system or filter driver?
Kernel mode callers of the read and write interface (IRP_MJ_READ and IRP_MJ_WRITE) can utilize an interface that allows retrieval of a pointer to the data as it is located in the file system data cache. This allows the kernel mode caller to retrieve the data for the file without an additional data copy.
For example, the AFD file system driver has an API function it exports that takes a socket handle and a file handle. The file contents are "copied" directly to the corresponding communications socket. The AFD driver accomplishes this task by sending an IRP_MJ_READ with the IRP_MN_MDL minor operation. The FSD then retrieves an MDL describing the cached data (at Irp->MdlAddress) and completes the request. When AFD has completed processing the operation it must return the MDL to the FSD by sending an IRP_MJ_READ with the IRP_MN_MDL_COMPLETE minor operation specified.
For a file system filter driver, the write operation may be a bit more confusing. When a caller specifies IRP_MJ_WRITE/IRP_MN_MDL the returned MDL may point to uninitialized data regions within the cache. That is because the cache manager will refrain from reading the current data in from disk (unless necessary) in anticipation of the caller replacing the data. When the caller has updated the data, it releases the buffer by calling IRP_MJ_WRITE/IRP_MN_MDL_COMPLETE. At that point the data has been written back to the cache.
An FSD that is integrated with the cache manager can implement these minor functions by calling CcMdlRead and CcPrepareMdlWrite. The corresponding functions for completing these are CcMdlReadComplete and CcMdlWriteComplete. An FSD that is not integrated with the cache manager can either indicate these operations are not supported (in which case the caller must send a buffer and support the "standard" read/write mechanism) or it can implement them in some other manner that is appropriate.
Q29 Who is responsible for maintaining the 'file pointer' (CurrentByteOffset field)? When an application does an append to the end of the file, how is this presented to the file system?
The "file pointer" is a field (the CurrentByteOffset field) within the file object that is maintained by the FSD and used by the I/O Manager. Each time a synchronous I/O operation is performed, the file system updates this field to reflect the current byte offset. For example, here is code from CDFS that accomplishes this task:
//
// Update the current file position in the user file object.
//
if (SynchronousIo && !PagingIo && NT_SUCCESS( Status )) {
IrpSp->FileObject->CurrentByteOffset.QuadPart = ByteRange;
}
Note that this field is not updated for asynchronous I/O or for paging I/O. The SynchronousIo value used here was computed based upon the bits of the file object:
SynchronousIo = FlagOn( IrpSp->FileObject->Flags, FO_SYNCHRONOUS_IO );
The I/O Manager then uses this CurrentByteOffset field when specifying the offset to use for a read or write operation.
In addition, an application can modify the CurrentByteOffset field by making an appropriate IRP_MJ_SET_INFORMATION call (setting the FilePositionInformation for a given open file instance). This is also handled by the FSD.
An application can append to the end of a file by utilizing the manifest constant FILE_WRITE_TO_END_OF_FILE. This 64 bit numeric value indicates that the data is being appended to the file. An FSD should perform the I/O operation relative to the current end of the file. Thus, two applications can interleave I/O and it will always be appended to the end of the file.
Q30 I never see any calls to several fast I/O operations. Does this mean I don't need to filter them? What happens if I do need to filter them?
There are a number of fast I/O operations that are currently not used by the OS. For example, the compressed MDL operations have been reserved for future use. However, the risk a file system filter driver runs in not implementing these is that a subsequent release or update (even a service pack) might begin using these APIs. It is also possible that a third party product might utilize these APIs as well. Thus, by not filtering them, it is possible the file system filter driver will not handle these operations and as a result its function may be compromised.
If a file system filter driver does not implement a particular fast I/O routine the default behavior for the OS is to call the underlying file system directly. This ensures that even if an old file system filter driver is used with a newer system, the behavior of the file system will be correct, although the file system filter driver may not function properly.
There are six fast I/O routines that are not invoked in a file system filter driver, even if they are implemented by the file system filter driver. They are:
· AcquireFileForNtCreateSection
· ReleaseFileForNtCreateSection
· AcquireForModWrite
· ReleaseForModWrite
· AcquireForCcFlush
· ReleaseForCcFlush
Each of these six fast I/O operations deals with serialization between the virtual memory system and the file system. As such, most file system filter drivers do not need to filter these operations in any case. In Windows XP, a file system filter driver can register to filter these six fast I/O operations by utilizing the FsRtlRegisterFileSystemFilterCallbacks API (note that this API is not available prior to Windows XP). In this case, the filter registers a set of twelve functions that will be called before, and after, the FSD's fast I/O entry point is called.
Q31 Issues calling an FSD from a completion routine
When calling back into a file system from a completion routine it is possible that per-thread state information used by the file systems can cause problems for the new operation.
Frequently, a file system filter driver attempts to perform operations in its completion routine that may involve calling down to the underlying file system. However, this can cause problems because the underlying file system utilizes thread-local storage for its own operations. That thread local storage may indicate incorrect state to the underlying file system.
Logically, this is because a file system may perform a series of operations when completing an I/O operation, such as the following example:
IoCompleteRequest(Irp, IO_NO_INCREMENT);
IoSetTopLevelIrp(NULL);
return STATUS_SUCCESS;
The "problem" here is that the call to IoCompleteRequest will, in turn, generate a call to your filter driver's completion routine. Of course, if your completion routine then calls the file system again, it will be using the same "top level IRP" information. The FSD may then (in turn) use this information, believing it is related to the operation you are performing. Unfortunately, this can lead to erroneous behavior of the underlying file system.
Further, while the Windows file systems have systematically been changed to free their state before calling IoCompleteRequest (precisely because of this problem) this issue arises when you implement "layered" or stacked file system functionality. In such a case, your driver might receive a request from one file system and then redirect it to another file system. To implement this functionality, your driver must save the upper file system's state before calling into the lower file system. This would be done using the following mechanism:
PVOID oldTopLevelIrp;
oldTopLevelIrp = IoGetTopLevelIrp();
IoSetTopLevelIrp(NULL);
status = CallLowerFsd(Device, NewIrp);
/* omit status handling */
IoSetTopLevelIrp(oldTopLevelIrp);
This ensures that the state of the "upper file system" does not interfere with the state of the "lower file system".
Q32 Obtaining a drive letter assignment
A file system that is not associated with a physical media volume may wish to have a drive letter assignment for its volume so that Win32 applications are able to access it.
The correct way to achieve this in Windows is to utilize the services of the mount manager. There is a documented, exported API for achieving this - SetVolumeMountPoint.
Of course, this mechanism is implemented in terms of file system control operations against the kernel level mount manager (/Device/MountMgr) and the IOCTL values are exported in the DDK header file mountmgr.h. The relevant IOCTL for creating a new mount point is IOCTL_MOUNTMGR_CREATE_POINT, which is documented in the DDK.
The mount manager is not responsible for creating NTFS mount points (which are implemented using reparse points), although the SDK routine will use NTFS mount points as necessary. NTFS mount points can be created from kernel level software by using the FSCTL_SET_REPARSE_POINT operation. For this call, you must describe the mount point information using the REPARSE_DATA_BUFFER structure. This will then store the relevant reparse point with the corresponding file or directory on the NTFS volume.
When creating mount points within a file system, keep in mind the restrictions:
· Of the Windows file systems, only NTFS supports reparse points;
· NTFS requires that a directory be empty if you are going to create a mount point for the directory;
· A volume mounted using a reparse point need not have any drive letters.
Of course, the simplest solution is to create your mount points within the user mode environment with the SDK API, rather than within the kernel mode environment.
For those file systems that must function in a pre-Windows 2000 environment (prior to the existence of the mount manager API) either the driver, or an application program, must create the drive letter. A driver does so by creating a symbolic link using IoCreateSymbolicLink and a Win32 application does this using DefineDosDevice. Similarly, the drive letter may be removed by using IoDeleteSymbolicLink (for the driver environment) or DefineDosDevice.
Q33 Handling FILE_COMPLETE_IF_OPLOCKED in a filter driver.
A filter driver may be called by an application that has indicated the FILE_COMPLETE_IF_OPLOCKED. If the filter in turn calls ZwCreateFile it may cause the thread to deadlock.
A common problem for file systems is that of reentrancy. The Windows operating system supports numerous reentrant operations. For example, an asynchronous procedure call (APC) can be invoked within a given thread context as needed. However, suppose an APC is delivered to a thread while it is processing a file system request. Imagine that this APC in turn issues another call into the file system. Recall that file systems utilize resource locks internally to ensure correct operation. The file systems must also ensure the correct order of lock acquisition in order to eliminate the possibility of deadlocks arising. However it is not possible for the file system to define a locking order in the face of arbitrary reentrancy!
To resolve this problem, the file systems disable certain types of reentrancy that would not be safe. They do this by calling FsRtlEnterFileSystem when they enter a region of code that is not reentrant. When they leave that region of code, they call FsRtlExitFileSystem to enable reentrancy.
This is important to an understanding of this problem because the CIFS file server uses oplocks as part of its cache consistency mechanism between remote clients and local clients. This is done using a "callback" mechanism, which is implemented using APCs.
Normally, the FSD will block waiting for the completion of the APC that breaks an oplock. Under certain circumstances, however, the CIFS server thread that issued the operation requiring an oplock break is also the thread that must process the APC. Since the file system has blocked APC delivery, and now the thread is blocked awaiting completion of the APC, this approach leads to deadlock. Because of this, the Windows file system developers introduced an additional option that advises the file system that if an oplock break is required to process the IRP_MJ_CREATE operation, it should not block, but instead should return a special status code STATUS_OPLOCK_BREAK_IN_PROGRESS. This return value then tells the caller that the file is not completely opened. Instead, a subsequent call to the file system, using the FSCTL_OPLOCK_BREAK_NOTIFY, must be made to ensure that the oplock break has been completed.
Of course, this works because by returning this status code the APC can be delivered, once the thread exits the file system driver.
Note that FSCTL_OPLOCK_BREAK_NOTIFY, and the other calls for the oplock protocol, are documented in the Windows Platform SDK.
Q34 Opening files during IRP_MJ_CREATE processing
While processing an IRP_MJ_CREATE a filter may need to open the file with different attributes/rights, etc. This is often done by using a second call to ZwCreatefile. This then will generate a call back into the FSD filter. Thus, a common filter issue is being able to detect this reentrancy.
There are several ways of dealing with reentrancy during an IRP_MJ_CREATE operation, and the appropriate solution for your particular driver will depend upon the circumstances. In addition, there are a number of techniques that might work for a single file system filter driver, but that fail when used in a multi-filter environment.
For Windows XP and newer versions of Windows, the best mechanism for opening a file within the filter is to use IoCreateFileSpecifyDeviceObjectHint. A filter driver can call this function and specify a given device object. The IRP_MJ_CREATE that is built will be passed to the specified device object. This technique avoids reentrancy issues and is the best mechanism available for a filter to open a file.
For versions of Windows prior to Windows XP, this mechanism is not available. The best mechanism in this environment is to implement your own functional equivalent of IoCreateFileSpecifyDeviceObjectHint. This can be done by creating a second device object for each device you are filtering.
For example, suppose you decide to filter some given file system device object, FSDVolumeDeviceObject. You then create a device object MyFilterDeviceObject and attach it using IoAttachDeviceToDeviceStack (of course, in Windows XP you would use IoAttachDeviceToDeviceStackSafe instead). In addition, you create a second device object MyFilterShadowDeviceObject. This device object must be assigned a name ("/Device/MyFilterDevice27", for example). The name can be anything, but it must obviously be unique. In your device extension for your two device objects, you need to track this name, and you need to maintain pointers to the respective device objects (that is, the device extension for MyFilterShadowDeviceObject should point to MyFilterDeviceObject and the device object extension for MyFilterDeviceObject should point to MyFilterShadowDeviceObject). Don't forget to set the StackSize field of the device object correctly!)
Now, an IRP_MJ_CREATE request arrives in your filter, specifying MyFilterDeviceObject. To open the file without experiencing reentrancy problems, you call IoCreateFile (or ZwCreateFile). Since you must pass the name of the file being opened, you construct that by using both the name you gave MyFilterShadowDeviceObject and the name that is in the FileObject of the I/O stack Location (IoGetCurrentIrpStackLocation(Irp)->FileObject).
Since you are passing a name in that points to your second device object, the I/O Manager will build the IRP_MJ_CREATE and pass the resulting I/O request packet to your driver, but specifying MyFilterShadowDeviceObject.
In your IRP_MJ_CREATE dispatch handler you must detect that this is a "shadow" device object, rather than a typical filter device object. In this case, you should send the IRP_MJ_CREATE operation down to the device being filtered by MyFilterDeviceObject. Indeed, since you should not need to do any further processing, you can use IoSkipCurrentIrpStackLocation (rather than IoCopyCurrentIrpStackLocationToNext).
The original filter (where the IoCreateFile call was made) will receive back a file handle that can then be used for subsequent operations (using the Zw API routines).
Typically, filter drivers that attempt to use IoCreateFile or ZwCreateFile with the same file/device name as the original request experience reentrancy into their driver. A number of techniques for dealing with this have been tried in the past, but they exhibit various problems when combined with other filters. These include:
· Appending a "special string" to the end of the file name. This will not work when two filters using this technique are loaded onto the same system (since each one appends its "special string" onto the previous filter's "special string").
· Track thread identifiers to detect reentrancy. This technique fails when combined with a filter that utilizes a separate service for opening the file; filters sometimes must switch to a different thread context in order to eliminate stack overflow conditions.
· Building create IRPs within a filter. This technique does work properly, but is completely unsupported and quite difficult to implement correctly. Because of this, it is a fragile solution that should not be used given the availability of alternative mechanisms.
· Re-using the file object from the current IRP_MJ_CREATE. In this sequence, the filter allows the create operation to complete and then uses the file object subsequently. When done, the filter then sends a cleanup and close operation to the underlying file system. It then sends the original IRP_MJ_CREATE operation to the underlying FSD. There are several potential issues with this approach. First, in this technique the filter does not have a file handle for the file object and thus cannot use the Zw API calls. Second, the file object must be restored to its original state - otherwise, there are fields within it that are not set up properly. Third, because the file object has not yet been "properly referenced" the filter may find that the OS deletes the object because its reference count drops to zero during its use. Used carefully, this technique has been successful in the past.
Regardless of the technique used, the filter driver must be cognizant of the issues involving oplocks (and the FILE_COMPLETE_IF_OPLOCKED option). If an oplock break occurs during an IRP_MJ_CREATE, whether from an original application, or from a filter driver attempting to open the file, the filter must be able to handle it gracefully.
Finally, the filter must be aware of the fact that even though it is calling from kernel mode, the underlying FSD will continue to enforce sharing access against the file.
Q35 How do I retrieve the "user name" for the user performing a given operation?
User names, per se, are not a concept of the core OS. Rather, users are tracked internally as "security identifiers" or SIDs. It is possible to extract the SID of the current thread. If a "user name" is needed, a user mode service can be used to convert from the SID to the corresponding text user name. This is done using the Win32 function LookupAccountSid, which is documented in the Platform SDK.
The SID of the calling thread can be extracted from its token. This is done by first attempting to open the thread token (ZwOpenThreadTokenEx or NtOpenThreadToken or NtOpenThreadTokenEx). If this fails because the thread has no token, the filter should open the process token (ZwOpenProcessTokenEx or NtOpenProcessToken or NtOpenProcessTokenEx). In either case, the filter will have a handle for a token.
The SID can be retrieved from the given token using NtQueryInformationToken or ZwQueryInformationToken. The filter should specify TokenUser as the TOKEN_INFORMATION_CLASS value. The call will return a buffer that contains the TOKEN_USER structure. This structure contains the SID of the token.
Note, however, that obtaining the SID of the current caller is often not precisely what a filter is trying to accomplish. Instead, often the filter wishes to know the SID of the requesting thread. For local calls, this will typically be the same. For remote calls, however, the CIFS server routinely utilizes impersonation during IRP_MJ_CREATE and for some IRP_MJ_SET_INFORMATION operations. Otherwise, the CIFS server uses the local system's credentials. To handle this case, a filter must store away the credential information of the original caller. In the case of IRP_MJ_CREATE the original caller's token is specified as part of the IO_SECURITY_CONTEXT parameter. The ACCESS_STATE structure in turn contains the SECURITY_SUBJECT_CONTEXT and the filter can retrieve a pointer to the token using SeQuerySubjectContextToken. The SID can then be retrieved from the token using SeQueryInformationToken.
Q36 How do I detect reentrancy back into my filter driver?
In addition to the standard problem of detecting re-entrancy during an IRP_MJ_CREATE operation (discussed in Section 1.34), some filter drivers must also be able to detect re-entrancy for other I/O operations, such as IRP_MJ_READ or IRP_MJ_WRITE. The technique described in the earlier section can also be applied for other I/O operations, but may not be necessary.
Certain operations tend to experience considerable re-entrancy. These include:
· I/O operations related to paging. This is because a normal "user read" operation may trigger subsequent paging I/O operations, which are satisfied in the same thread context.
· Cache Memory functions. These can cause the reference count on a file object to drop to zero which, in turn, will cause an IRP_MJ_CLOSE to be sent to the file system.
· Closing related file objects. These can cause Cache and Memory Manager releases of memory that in turn trigger IRP_MJ_CLOSE operations to be sent to the file system.
The typical solution to detecting this type of reentrancy is to maintain some "per thread" information. For Windows XP, this can be the filter context information. For Windows 2000 (and earlier) the filter driver should maintain its own lookaside table for tracking that state.
Q37 How do I call a user-mode function from my kernel driver?
In general, user mode functions are not available from the kernel mode environment. This is because routines developed to work in the user mode part of the OS rely upon the specifics of that environment - for example, the ability to call Win32 functions. Of course, since those functions are not available in the kernel environment, routines that rely upon them are not, in turn, available in the kernel environment.
However, there are techniques that allow a function, available in user mode, to be exploited by most kernel mode components. The best of these (in that it is straight-forward to use and fits within the Windows architecture) is to couple a user mode service with the kernel mode driver.
The service is typically written to use the Win32 API. This service then creates one or more threads for handling kernel-level requests. Each of these threads then opens the kernel mode driver (via a device object created by the kernel mode driver) and issues a custom IOCTL operation using the Win32 function DeviceIoControl.
The DeviceIoControl call is presented to the kernel mode driver as an IRP_MJ_DEVICE_CONTROL call, and the custom IOCTL is the control code of that call. The kernel mode driver then:
· Marks the IRP as pending (using IoMarkIrpPending)
· Places the IRP on a driver-owned queue specifically dedicated to tracking these requests
· Returns STATUS_PENDING to the caller.
The I/O operation has not been completed at this point and the service's thread blocks waiting for the I/O operation to be processed.
At some time in the future, a kernel mode operation requires invoking the user mode function. Rather than invoking the operation directly, the kernel mode thread removes an IRP from the driver-owned queue. If there are no IRPs on the queue, the kernel mode thread can either block and wait for one to become available or fail the operation - or even combine the two, so that it blocks and waits for some finite period of time and after that point it fails the operation.
If there was an IRP on the queue, the kernel mode thread takes the IRP (that was removed from the queue) and creates the "output" information for this request by describing the desired operation. Typically, the developer will specify some command structure format, so that the format contains an "op code" to indicate the operation being performed, as well as a "request ID" to indicate which kernel level request has constructed this command structure (this is needed when the answer is being provided). Of course, any additional parameter information is also provided at this point. The kernel mode driver then performs normal completion processing on this request by setting the Status.Information and Status.Status fields, indicating a successful completion. Then the IRP is completed using IoCompleteRequest.
Since the service thread was blocked awaiting completion of this request, it is scheduled and runs at some point following the call to IoCompleteRequest. The service thread then examines the command described in the "output" data returned from the driver. That data indicates the specific operation to be performed - in other words, the user mode function that is to be invoked. The parameter data should also be present in the output data returned by the driver.
Once the function call returns the necessary information, the thread constructs a response block. While the specific contents of the response block are dependent on the information being returned, it will normally indicate the "request ID" from the original command block. With the response block constructed, the service thread calls DeviceIoControl again, indicating that it is sending a response message.
The driver receives the response message as a new IRP_MJ_DEVICE_CONTROL operation. Examining the message, the driver uses the "request ID" to match up this response information with the original request. Since the original requesting thread is probably blocking, awaiting this response, once the driver has stored the response data in a suitable location, it should signal the other thread (typically, this is done using a notification event).
The original thread is presumably blocked awaiting this response (e.g., the original thread has called KeWaitForSingleObject using a notification event). Once it unblocks, the necessary data is available and it can continue processing.
A warning is in order with respect to using this technique, because it is possible to create deadlocks within the system. If the "original thread" owns any resources (e.g., locks) at user level that must also be acquired by the service thread, this technique will deadlock. This becomes an issue because some Win32 operations serialize their activities. Typically, this serialization is internal to the Win32 subsystem and not under the control of the driver or service developer. This becomes an issue, for instance, if the first thread calls some Win32 service (typically this is done by calling a Win32 API). This service then acquires a mutex and calls into the OS to perform some operation. This ultimately resolves into a call to the driver that is blocked. If the driver's service then invokes a Win32 function that attempts to acquire the same mutex the threads will be deadlocked. Thus, this technique must be used cautiously in order to avoid these potential deadlock situations. Since the internal serialization model of the Win32 subsystem is not documented and is subject to change over time, developers must be vigilant for this type of deadlock.
Q38 What is structured exception handling? How should I use it? Why do I get STOP code 0x1E (KMODE_EXCEPTION_NOT_HANDLED)? How do I deal with this?
Typically a function indicates any errors using a return value. However, an alternative mechanism for dealing with errors is provided by the Windows operating system and is known as an exception. A driver can raise an exception by calling ExRaiseStatus, which is described in the Windows DDK documentation. The advantage of the exception mechanism is that it only invokes registered exception handler's to deal with the exception. In doing this, it may be necessary to unwind the calling function stack so that control can be returned to a much earlier point.
Structured exception handling is the mechanism used by a driver to register an exception handler and - when that exception occurs - to verify that the exception is handled by the given exception handler, and if it is handled, to transfer control to that exception handler. Thus, an exception handler consists of three distinct pieces:
· A protected block of code. This is the region of the program where, if the exception occurs, the exception handler may be invoked.
· An exception filter. This is the expression that is evaluated to determine if the exception handler will handle this exception.
· An exception handler. This is the code to which control is given if the exception filter determines that the exception handler should be invoked.
In the Windows kernel environment, structured exception handling is implemented within the operating system itself. Typically, driver programmers take advantage of structured exception handling by using extensions present in the C compiler (which is included with the Windows XP DDK, or is obtained separately for earlier versions of the DDK). These extensions use the keywords __try, __except, __finally, and __leave. For C language programs the inclusion environment defines the pseudo-keywords try, except, finally, and leave in terms of these C language extensions. For C++ language programs, however, try invokes the C++ exception handling mechanism, which is not the same as the Windows kernel level structured exception handling mechanism.
To protect a block of code, you use the __try keyword. The __except keyword follows the protected block and also has an associated expression. That expression evaluates to one of three possible values:
· A negative value, in which case the operation that caused the exception is restarted and execution resumes at that point (a "continuation" of the operation). The exception handler is not invoked.
· A positive value, in which case the operation that caused the exception is handled and execution resumes at the beginning of the exception handler.
· A zero value, in which case the operation that caused the exception is not handled by this exception handler. Other registered exception handlers will then be sought to handle this specific exception. The exception handler is not invoked.
It is important to note that the expression can invoke functions as part of its processing. Also, within the context of that expression there are two special functions that can be invoked: GetExceptionInformation and GetExceptionCode. The function GetExceptionInformation provides access to the EXCEPTION_POINTERS structure, which in turn refers to the EXCEPTION_RECORD and CONTEXT structures specific to the given exception. The function GetExceptionCode returns information about the actual status that was raised (this information can also be extracted from the EXCEPTION_RECORD structure). Note that GetExceptionInformation is only valid within the context of the __except expression, while GetExceptionCode is valid for both the __except expression and the exception handler block.
Thus, to protect a block of code, a driver "wraps" it in a structured exception handler, something like this:
__try {
/* this is the protected code block */
} __except (MyExceptionFilter(GetExceptionInformation())) {
/* this is the exception handler */
}
The exception filter function, MyExceptionFilter, is thus responsible for determining the correct course of action for the given exception. This can be to provide debugging assistance, or to ignore the source of the exception, or to convert the exception into an error. Regardless, the return value from this function will determine whether or not the exception handler is invoked. The standard include files define three symbolic constants that are typically used as part of this structured exception handling:
· EXCEPTION_EXECUTE_HANDLER - this value (1) indicates that the exception handler should be invoked.
· EXCEPTION_CONTINUE_SEARCH - this value (0) indicates that the exception is not handled; the exception handler is not invoked and the operating system searches for a different exception handler to process this exception.
· EXCEPTION_CONTINUE_EXCECUTION - this value (-1) indicates that the exception has been corrected and execution should continue at the point following the exception.
For a driver, structured exception handling must be used whenever calling a routine that might generate an exception. A failure to do so will cause the default OS-provided exception handler to be invoked. This exception handler calls KeBugCheckEx with the bug code KMODE_EXCEPTION_NOT_HANDLED.
In addition, a driver should use structured exception handling whenever performing an operation on a user virtual address. If the address is invalid, the exception handler will be invoked and the driver can process the access violation as necessary.
Another interesting use is to trap operations that work by raising exceptions. For example, the breakpoint mechanism works by raising a breakpoint exception (STATUS_BREAKPOINT). A driver that has embedded breakpoints invokes the debugger because it raises this exception. If a debugger is not attached, however, the default exception handler is invoked and the operating system halts. If the driver writer uses structured exception handling, the breakpoint exception can be ignored, which happens in the case where there is no attached debugger. This code would look something like:
__try {
DbgBreakPoint();
} __except(EXCEPTION_EXECUTE_HANDLER) {
/* ignore */
}
This prevents the driver from crashing, even with embedded breakpoints within it.
For additional information on structured exception handling, including the use of termination handlers (__finally and __leave) we suggest referring to the Windows Platform SDK documentation. For information on how the Windows operating system implements structured exception handling, refer to Matt Pietrek's article "A Crash Course on the Depths of Win32 Structured Exception Handling," from the January 1997 Microsoft Systems Journal and reproduced in the Windows Platform SDK documentation.
Q39 I am using a completion routine in my filter. What am I allowed to do? What am I NOT allowed to do? What alternatives do I have to performing the work in my completion routine?
A completion routine in a file system filter driver has certain restrictions:
· A completion routine may be called at DISPATCH_LEVEL and thus the completion routine may need to post the IRP to a work routine so that it may be further processed.
· A completion routine may be called while the file system still has registered thread-local storage in use (notably, the "top level IRP" field, which is retrieved using the IoGetTopLevelIrp function). A filter driver must be able to handle this situation properly, either by storing this field's value, clearing it, and then restoring it upon completion of the call back into the file system driver, or by posting the request to a different thread (where the thread-local information will not be stored).
· A completion routine may be called while the file system still has ownership of file system internal synchronization mechanisms (e.g., ERESOURCE objects) that restrict the ability of the file system to handle arbitrary calls. In such a case, the work must be completed by a different thread and the completion routine code cannot block waiting for this completion by the separate thread.
· In the completion routine for IRP_MJ_CLEANUP, IRP_MJ_CLOSE, P_MJ_READ, IRP_MJ_WRITE, IRP_MJ_QUERY_INFORMATION, and IRP_MJ_SET_INFORMATION it is possible that the file object passed into the completion routine has already been processed by the file system, I/O Manager, and Object Manager. Because of this, the file object must not be used in any operations above the file system level. This is because the reference count on the file object should not be incremented (e.g., a call to ObQueryObjectNameString results in a reference on the file object for the duration of the operation). Otherwise, the OS behavior will be incorrect.
In general, a conservatively written completion routine will do only minimal processing (that processing is safe at DISPATCH_LEVEL) in its completion routine. Should it require additional processing, the completion routine must build a work item and enqueue the work item to a separate worker thread. It should not wait for the completion of that work item (if this is necessary, waiting should be done in the dispatch entry point routine, where the IRQL of the system must be below DISPATCH_LEVEL).
Q40 How do I force files to be closed from my file system/filter driver?
A common problem for file system and file system filter driver writers is that they observe files are "never" closed. They often assume this is some bug in their filter or file system but in fact this is normal system behavior because of the way the OS caches files in the virtual memory system.
A file may be accessed using either the "normal" read/write API or via the alternative memory mapped file API. Typically, user access to a file, even when performed via the read/write API, is actually satisfied using the memory mapping mechanism.
Q41 Can I rely upon the RelatedFileObject field in the FileObject? How should I use this information?
When a file is opened relative to another opened file, the FileObject argument of the I/O stack location will refer to that related file object. This information is typically uses to begin a parse operation (for example) within a file system.
A common mistake in a file system filter driver is to use this information recursively. That is, the filter driver will use FileObject->RelatedFileObject->RelatedFileObject. Unfortunately, this does not work reliably because the I/O Manager only maintains a reference on the FileObject and the FileObject->RelatedFileObject. Thus, it is quite possible that the second related object is not valid.
Another possible use of this information is in constructing the name of the file being opened or created. The file system or filter driver must be careful when doing this, however, because the OPEN_BY_ID case in the IRP_MJ_CREATE path may specify a related file object, but that related object is arbitrary. In earlier versions of Windows NT, the related file object was required in this, but in Windows 2000 and Windows XP it is no longer necessary for the OPEN_BY_ID case. Also note that the open may be on a stream, relative to the containing file. In both of these cases, constructing the name from this information is not straight-forward. It may be simpler to query the underlying file system after the IRP_MJ_CREATE has been processed.
Q42 How do I deal with the "recycle bin"? Is this some special directory in the file system?
The recycle bin is a concept of the Windows Explorer, not of the file system itself. Indeed, the name of the recycle bin is locale specific (based upon a registry parameter). If your filter driver must handle the recycle bin, it will need to obtain that name from the registry and then watch for operations against that directory (in the IRP_MJ_CREATE operations).
Q43 I need to access the file, but it is locked for exclusive access. How do I get around this?
There are two techniques that you can use to circumvent this problem. The simplest is that on Windows XP the IO_IGNORE_SHARE_ACCESS_CHECK can be specified in a call to IoCreateFileSpecifyDeviceObjectHint. This implementation is not possible on Windows 2000 (or earlier versions).
For any version of the operating system there are a few different techniques that can be applied to resolve the problem:
· The filter driver can "take over" share access checks on a file-by-file basis. In this case, the filter handles share access and always passes in a request for full sharing to the underlying file system. The disadvantage of this approach is that it requires the filter driver writer understand the details of how to handle share access (see the FAT file system or CDFS file system for examples of how this is done at the file system level).
· The filter driver can create a new file object, specifying access that does not conflict with any combination of sharing attributes. For example, this can be done by specifying SYNCHRONIZE access. Subsequent I/O to the file is done by the file using IRPs directly. Since share access management is done during the IRP_MJ_CREATE handling, this alleviates the need to perform any checks after that point.
There may be other similar techniques as well, but these two have been used in a number of different circumstances successfully by file system filter driver writers.
Q44 I need to read a range of the file but it has a byte range lock on it. How can I bypass these byte range locks?
Byte range locks are managed by the file system itself and are enforced for user-level IRP_MJ_READ and IRP_MJ_WRITE operations on the file. To avoid these operations, a file system filter driver can perform operations that do not involve byte range locks. Typically, this is done by accessing the file using paging I/O of some type. The simplest way to achieve this is to memory map the file. Memory mapping a file can be done in a user mode application or in the kernel using the ZwCreateSection, ZwMapViewOfSection, and MmMapViewInSystemSpace operations. Subsequent access to the file contents are done using this memory mapped data region; when data must be fetched from the file it is done so.
Q45 I need to build my own IRP. How do I do this?
Building IRPs can be either simple (if you use one of the standard routines) or extremely complicated (if you build your own IRPs). This is because the range of options for IRPs is quite broad - there are numerous flags, options, combinations of parameters, etc. that materially impact the way that IRPs are interpreted by the system. Thus, in discussing this issue, we will first look at the simple mechanisms and then discuss the more general problem.
The Windows operating system provides a number of different "standard" routines for building I/O request packets:
· IoBuildAsynchronousFsdRequest
· IoBuildSynchronousFsdRequest
· IoBuildDeviceIoControlRequest
· IoMakeAssociatedIrp
Each of these functions can be used to construct a specific type of IRP that can then, in turn, be used by a file system or filter driver to call lower level drivers. The first two calls (IoBuildAsynchronousFsdRequest and IoBuildSynchronousFsdRequest) are similar. They both are used for IRP_MJ_PNP, IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_FLUSH_BUFFERS, or IRP_MJ_SHUTDOWN processing. The primary difference between the two is that an IRP built with IoBuildSynchronousFsdRequest is added to the IRP list associated with the current thread - this ensures that if the current thread terminates prematurely, this IRP will be cancelled. If the driver uses IoBuildAsynchronousFsdRequest the IRP is not associated with the current thread. Thus, the driver is responsible (in its completion routine) for freeing the IRP directly (using IoFreeIrp). Another difference is that IoBuildSynchronousFsdRequest requires that the driver provide the address of an event object. This event object will be set when the I/O operation completes. It is important to note that the names of these two functions might lead a developer to believe that they implement synchronous or asynchronous I/O, but that is not the case - they build IRPs that will work, regardless of whether the underlying driver implements synchronous or asynchronous behavior. If the driver wishes to have synchronous behavior, it must still examine the return code from IoCallDriver and, if the return value is STATUS_PENDING, wait for the completion of the I/O operation. If the IRP is of the "synchronous" variety, the I/O Manager will signal the completion by setting the event. If the IRP is of the "asynchronous" variety, the driver is responsible for signaling the completion in its completion routine (the same point when it frees the IRP using IoFreeIrp).
Device control operations are performed by using IoBuildDeviceIoControlRequest and it can be used to construct essentially any device control operation that one driver wishes to send to another driver. Of particular note, this routine can be used to construct internal device control operations - and such operations can only be constructed within the kernel. Any driver receiving an internal device control (IRP_MJ_INTERNAL_DEVICE_CONTROL) is assured that only another driver is making the call. Thus, these operations are ideal for driver-to-driver communications. Unfortunately, file system control operations cannot be constructed with this call - they must be created manually.
For all other types of operations, if an IRP is required, it must be constructed by the driver directly. The basic steps for this are:
· Allocate the I/O Request Packet; alternatively, the driver may maintain a pool of IRPs itself in which case it initializes the IRP.
· Set up the base fields of the IRP, including flags fields.
· Set up the next stack location with the appropriate properties.
Unfortunately, there are a number of complex issues surrounding the various fields of the IRP itself because these fields are interpreted by the I/O Manager. To simplify this, we suggest that one of the standard routines be used whenever possible. For example, it is possible to use IoBuildDeviceIoControlRequest even when sending an IRP_MJ_FILE_SYSTEM_CONTROL operation because the two IRPs are constructed in similar fashion.
Q46 How do timestamps work on files? What is the "change time" versus the "modify time"?
The Windows operating system exposes four different timestamps for files. Note that the file system typically provides the initial value for these, although they can be changed programmatically by applications. These four values are:
· The creation time. This is intended to represent time when the file was first created.
· The last access time. This is intended to represent the last time the file was accessed (that is, the data contents of the file were retrieved). Some file systems eliminate storing this value because updating it is an expensive operation. For example, the NTFS file system experiences a substantial performance improvement if this feature is disabled. (See Microsoft Knowledge Base Article # Q185590 for information about this option - search for "access time"). This is done by adding the value NtfsDisableLastAccessUpdate (a REG_DWORD) to 1. If this value is not present, or if it is zero, the last access time is updated. The value is added in the Control/FileSystem key of the CurrentControlSet in the registry database.
· The last modified time. This is intended to represent the last time the contents of the file was modified.
· The last change time. This is intended to represent the last time any attribute of the file was modified. This is distinct from the last modified time, because changing the timestamps (for example) will update the last change time, while it will not update the last modified time.
Timestamps are normally manipulated by archive management and restore utilities. This is because they want the file, as it is stored on the disk, to represent the original attributes of the file, not the attributes assigned to it by the file system.
One point we often note for file system filter drivers - opening a file can cause changes to the timestamps of the file, depending upon the policy and implementation of the given file system.
Q47 How does dismount work? How does this differ from media removal? Device removal?
In Windows there is a model for removable devices and a separate model for removable media. Thus, a device may be removable, but not have removable media, or a device may not be removable, but support removable media, both, or neither. To discuss these issues, then we need to distinguish between whether the device is being removed, or the media is being removed from the device.
Device removal is achieved using plug and play operations. All plug and play operations are represented by IRP_MJ_PNP IRPs, with specific minor function codes indicating the particular plug and play operation being performed. A device removal is indicated by the operating system when it sends an initial inquiry (specifying IRP_MN_QUERY_REMOVE). This initial inquiry about the device is to determine if the drivers in the device stack will allow removal of the device. Each driver in the stack can independently decide that the remove will - or will not - succeed. Thus, each driver must also be able to handle a decision (by a lower device) not to support removal of the given device.
For example, a file system might choose not to support device removal because the device in question contains a critical file (e.g., paging file) or because there are open references to the device. Of course, a file system filter driver could also make such a determination for its own purposes. Eventually the actual driver for the device and for the bus will determine if the hardware is in a state to allow device removal. Assuming that all of the system components agree to allow this device removal, the IRP will be completed successfully.
Subsequently, the operating system will then either indicate the removal of the device (IRP_MN_REMOVE_DEVICE) or the cancellation of the removal request (IRP_MN_CANCEL_REMOVE_DEVICE). In either case, the associated drivers must be able to process the operation at that point, since these are both indications of a state change and, in general, should not be rejected by any driver in the stack.
Of course, any file system may be in a position where it must handle device removal operations because there is a broad range of removable devices.
Another possibility with removable devices is that the user might remove the device without stopping it first. In this case, the storage stack is sent an IRP_MN_SURPRISE_REMOVAL plug and play notification. There is no mechanism for any driver to "fail" this operation because this is an indication to the drivers in that stack that the device is physically gone - it is no longer connected to the system!
Removable media is handled somewhat differently in Windows file systems because the ability to remove media pre-dates the ability to dynamically remove devices. There are a number of different sequences of media removal all centered around the FSCTL_DISMOUNT_VOLUME operation. Each of these is presented to the file system as an IRP_MJ_FILE_SYSTEM_CONTROL with the IRP_MN_USER_REQUEST option set. The file object provided should be an open instance of the volume (and hence would be subject to normal volume access checks for shared access). The traditional sequence of operations is for an application to lock the volume (FSCTL_LOCK_VOLUME) which ensures that there are no open files accessing the volume. This will also force dirty cached data back to disk, as the file system driver will flush all outstanding file objects. Some applications may perform additional processing at this point (e.g., the format utility that writes directly to the raw volume). Eventually, the volume is dismounted (the FSCTL_DISMOUNT_VOLUME operation). Some applications then unlock the volume (FSCTL_UNLOCK_VOLUME) while others simply close the volume object. Regardless, a volume is implicitly unlocked when the file object used to lock the volume is closed (IRP_MJ_CLEANUP). Some application programs are "sloppy" in this regard and rely upon these semantics.
An alternative to this approach is to forcibly dismount the volume. In this case, the file system receives the FSCTL_DISMOUNT_VOLUME operation without a preceding lock operation. In this case, the volume is forcibly dismounted, even if there are open files on the volume. Typically, the file system will flush data back (for the existing open files) and then dismount the volume.
Some physical media devices support forcible dismount using an external mechanism. In this case, the media is removed from the drive without advising the OS of this. Traditionally, this is handled by the file system when it notices that the media device indicates a change (either by returning STATUS_VOLUME_VERIFY_REQUIRED or because its DO_VERIFY_REQUIRED bit is set in the Flags field of its DeviceObject). If the file system supports externally removable media (and it is useful to note that NTFS does not do so,) it requests a verification of the volume by calling IoVerifyVolume. The I/O Manager calls the file system to verify the volume. If the volume has changed, STATUS_WRONG_VOLUME is returned and the I/O Manager will initiate new mount processing on the volume so that control of the volume can be handed to a new file system driver. If the volume has not changed, STATUS_SUCCESS is returned. The file system performing the verification is responsible for clearing the DO_VERIFY_VOLUME bit in the physical media's device object. The file systems deal with media verification failures in different ways. Some file systems always fail the verification (for example, a CD-ROM file system where the CD image might have changed even though the signature of the volume has not. This happens when a new session is burned onto the CD, for example). Others do not support this (because they do not support external removal of the media). Some always write data through on removable media so that the removal of the media merely requires updating internal state information with respect to the given volume.
Some devices support "software" ejection, even though the actual switch is present on the hardware. For these devices, an application and/or driver combination monitor the state of the eject request button. When depressed, the components running on the system perform a software dismount. This allows file systems that do not support "removable media" to function correctly with removable devices.
Q48 What happens if I mix memory mapped I/O with regular file I/O? What if the file I/O is non-cached?
Memory mapped I/O is, by its very nature, inherently cached file access. To ensure coherency between memory mapped I/O and non-cached I/O the file systems do attempt to perform cache invalidation, but because of the implementation model for the Windows VM system it is not possible to guarantee the coherency of non-cached I/O when mixed with memory mapped file access.
The one type of memory mapped file access that does retain its coherency is the file system data cache. Each time a non-cached I/O is performed, the file system must ensure that the non-cached I/O is consistent with any data being cached by the file system data cache. Thus, if an application performs a non-cached read operation, the file system will first flush the data from the cache to disk and then allow the non-cached read to retrieve the data contents from the disk. This ensures that the correct data is read from the disk by the I/O operation. If an application performs a non-cached write operation, the file system will invalidate data in the file system data cache (CcPurgeCacheSection) or within the virtual memory system (MmFlushImageSection) and then allow the non-cached I/O operation to proceed.
Thus, in both of these cases the data within the cache remains coherent with respect to the non-cached I/O. If an application has memory mapped the file, however, the file system cannot invalidate these mappings - the operating system knows there are references to the memory in the cache, but it does not know which processes are using those mappings.
Q49 I am getting a PFN_LIST_CORRUPT STOP code. What does this mean? What could I be doing wrong? How do I work around this problem?
PFN_LIST_CORRUPT (0x4E) occurs any time the memory manager detects an incorrect or inconsistent state in its internal memory tracking data structures. Several of these would be symptomatic of an internal OS bug, but this can also be caused by improper behavior within a driver in managing Memory Descriptor Lists (MDLs). Typically, these indicate improper use of Memory Manager calls used to create, or delete page mappings. In our experience, we have observed these problems when partial MDLs are improperly unmapped, or when MDLs are used against non-paged regions of memory (normally non-paged pool).
The typical case we observe is where the first parameter of the stop code is 0x7. In this case, the two reference counts on the page (each one representing a different type of usage) are inconsistent. Resolving this problem requires identifying the MDL being manipulated and ascertaining what operations have been performed. As is often the case with problems of this type, the location where the crash occurs is not the location where the bug occurred, but is rather the location where the bug was finally detected.
To locate this problem we recommend:
· Reviewing any code within your driver that manipulates MDLs. Ensure that you are handling partial MDLs and non-paged memory regions correctly.
· Use the driver verifier
· Use the checked build of the operating system.
Q50 What are the rules for my file system/filter driver for handling paging I/O? What about paging file I/O?
The rules for handling page faults are quite strict because incorrect handling can lead to catastrophic system failure. For this reason, there are specific rules used to ensure correct cooperation between file systems (and file system filter drivers) and the virtual memory system. This is necessary because page faults are trapped by the VM system, but are then ultimately satisfied by the file system (and associated storage stack). Thus, the file system must not generate additional page faults because this may lead to an "infinite" recursion. Normally, hardware platforms have a finite limit to the number of page faults they will handle in a "stacked" fashion.
Thus, the most reliable of the paging paths is that for the paging file. Any access to the paging file must not generate any additional page faults. Further, to avoid serialization problems, the file system is not responsible for serializing this access - the paging file belongs to the VM system, and the VM system is responsible for serializing access to this file (this eliminates some of the re-entrant locking problems that occur with general file access). To achieve this, the drivers involved must ensure they will not generate a page fault. They must not call any routines that could generate a page fault (e.g., code modules that can be paged). The file system may only be called at APC_LEVEL but it should only call those routines that are safe to call at DISPATCH_LEVEL, since such routines are guaranteed not to cause page faults. None of the data being used through this path should be pagable.
For all other files, paging I/O has less stringent rules. For a file system driver the code paths cannot be paged - otherwise, the page fault might be to fetch the very piece of code needed to satisfy the page fault! For any file system, data may be paged, since such a page fault can always be resolved eventually by retrieving the correct contents off disk. Paging activity occurs at APC_LEVEL and thus limits arbitrary re-entrancy in order to prevent calling code paths that could generate yet more page faults.
These rules will prevent the system from processing more than two page faults in a stacked fashion within the thread context. This should be safe for all processors supported by Windows.
Q51 I am getting NO_MORE_IRP_STACK_LOCATIONS as a stop code. How do I fix this?
This error occurs because your driver calls IoCallDriver but there are no additional stack locations available within the IRP. As a result, the IRP is likely to already be damaged. The maximum number of I/O stack locations can be exceeded for several possible reasons:
· The caller created an IRP with an insufficient number of stack locations. In earlier versions of Windows NT this situation would arise with OS components that pre-allocated fixed-size IRPs. Typically, this was resolved by changing some component-specific registry entry to increase the fixed size.
· A driver in the stack has the incorrect value in the StackSize field of the DEVICE_OBJECT. In this case, the driver should be modified to report the correct size. Normally, filter drivers need not worry about this because IoAttachDeviceToDeviceStackSafe and IoAttachDevice set up this field by adding one to the value of the device to which they attach. Drivers that do not filter are responsible for determining the correct value for this field.
· A filter driver can cause this problem by incorrectly copying the stack location. Drivers should use IoCopyCurrentIrpStackLocationToNext. In older versions of Windows NT where this call is not available, it is possible to use the newer implementation (which is a macro) or to call RtlCopyMemory and follow this with a call to IoSetCompletionRoutine. If no completion routine is needed, the three Boolean values should be set to FALSE so that the completion routine is not called.
In newer versions of Windows this has become an unusual error because most filter drivers function properly in this regard. The most common manifestation of this problem remains kernel mode components that utilize fixed-size IRPs, where they detect the improper size and return an error back to the caller, rather than risk running out of stack locations. In such a case, the resolution is normally to change the registry parameters to increase the fixed size.
Q52 What are the obsolete calls in Windows 2000? In Windows XP?
The following calls are obsolete in both Windows 2000 and Windows XP:
· RtlLargeIntegerAdd
· RtlEnlargedIntegerMultiply
· RtlEnlargedUnsignedMultiply
· RtlEnlargedUnsignedDivide
· RtlLargeIntegerNegate
· RtlLargeIntegerSubtract
· RtlExtendedMagicDivide
· RtlExtendedLargeIntegerDivide
· RtlLargeIntegerDivide
· RtlExtendedIntegerMultiply
· RtlLargeIntegerAnd
· RtlConvertLongToLargeInteger
· RtlConvertUlongToLargeInteger
· RtlLargeIntegerShiftLeft
· RtlLargeIntegerShiftRight
· RtlLargeIntegerArithmeticShift
· RtlLargeIntegerGreaterThan
· RtlLargeIntegerGreaterThanOrEqualTo
· RtlLargeIntegerEqualTo
· RtlLargeIntegerNotEqualTo
· RtlLargeIntegerLessThan
· RtlLargeIntegerLessThanOrEqualTo
· RtlLargeIntegerGreaterThanZero
· RtlLargeIntegerGreaterOrEqualToZero
· RtlLargeIntegerEqualToZero
· RtlLargeIntegerNotEqualToZero
· RtlLargeIntegerLessThanZero
· RtlLargeIntegerLessOrEqualToZero
· KeGetDcacheFillSize
· ExInterlockedIncrementLong
· ExInterlockedDecrementLong
· ExInterlockedExchangeUlong
· ExInitializeWorkItem
· ExQueueWorkItem
· ExInitializeZone
· ExExtendZone
· ExInterlockedExtendZone
· ExFreeToZone
· ExIsFullZone
· ExInterlockedAllocateFromZone
· ExInterlockedFreeToZone
· ExIsObjectFirstZoneSegment
· ExReleaseResource
· ExInitializeResource
· ExAcquireResourceShared
· ExAcquireResourceExclusive
· ExReleaseResourceForThread
· ExConvertExclusiveToShared
· ExDeleteResource
· ExIsResourceAcquiredExclusive
· ExIsResourceAcquiredShared
· ExIsResourceAcquired
· IoAttachDeviceByPointer
· IoReportResourceUsage
· HalGetDmaAlignmentRequirement
· COMPUTE_PAGES_SPANNED
· MmIsNonPagedSystemAddressValid
· MmCreateMdl
· MmGetSystemAddressForMdl
· KeAttachProcess
· KeDetachProcess
The following routines are obsolete in Windows XP:
· IoAttachDeviceToDeviceStack
Q53 How do I enumerate the contents of a directory from kernel mode?
In Windows XP, the function ZwQueryDirectoryFile is available for retrieving the contents of a directory using a file handle. Thus, this can be combined with ZwCreateFile to open a directory and retrieve its contents. The IFS Kit documentation describes how to use ZwQueryDirectoryFile in greater detail.
In versions prior to Windows XP, this routine is not documented. In all versions, it is possible to obtain the contents of a directory by utilizing the IRP_MJ_DIRECTORY_CONTROL operation, specifying IRP_MN_QUERY_DIRECTORY as the minor function code. This call passes in four parameters:
· Length - this is the length of the data buffer provided to the file system
· FileName - this is used for pattern matching against requests within the directory
· FileInformationClass - this is used to describe the format in which the data should be returned to the caller
· FileIndex - this is used to provide the location from which enumeration should take place (note that this is optional, as described later).
In addition, there are three flags in the I/O stack location that indicate the intention of this operation:
· SL_RESTART_SCAN - this indicates that the scan of the directory should be done relative to the first entry in the directory
· SL_RETURN_SINGLE_ENTRY - this indicates that at most one entry should be returned to the caller in the provided buffer
· SL_INDEX_SPECIFIED - this indicates that the FileIndex parameter was specified in the call and enumeration should begin with the entry specified by this index value.
Note that directory enumeration operations pass data back to the caller based upon the Flags field in the DEVICE_OBJECT structure of the file system driver. Thus, as is typical for a device driver, this will be METHOD_NEITHER and hence the Length parameter specifies the size of Irp->UserBuffer. As with any user level buffer access, this buffer may not be valid (of course, a kernel component making this call will specify RequestorMode in the IRP as KernelMode so that the buffer will be in the kernel address space). Typically, the FileName is only provided during the first call of the enumeration and the file system will store a copy of that data away for subsequent enumerations. If no name is presented, the enumeration will be treated as an enumeration of all entries in the directory. This is the equivalent of specifying "*" or "*.*" as the enumeration as well. While there are four different formats for the return data (FileDirectoryInformation, FileFullDirectoryInformation, FileBothDirectoryInformation, and FileNamesInformation) normally the preferred format is FileBothDirectoryInformation as this is the format requested by the Win32 subsystem when querying directories.
The file system will return a set of zero or more directories; the return value will indicate either an error (STATUS_BUFFER_OVERFLOW, STATUS_NO_SUCH_FILE, or STATUS_NO_MORE_FILES typically), or a success code (STATUS_SUCCESS or STATUS_PENDING). If STATUS_PENDING is returned, the caller can either block and wait or handle the results in a completion routine (which is safe for kernel mode components, since kernel addresses are valid in the completion routine context). The Information field will indicate the number of bytes returned after completion of the operation and the buffer will contain one (or more) entries from the directory.
This interface is stateful, so that a subsequent call will return additional information from the directory if it is available. When no further information is available, the file system will return STATUS_NO_MORE_FILES.
Q54 I am building a filter driver where I must change the directory information. How do I do that?
Filter drivers that manipulate the contents of directories must be careful to handle these operations properly because application behavior relies upon these very specific semantics. For example, it is an error to return STATUS_SUCCESS but to indicate that no data is being returned. Thus, if a filter determines that it does not wish to return the one entry within the buffer, it should re-issue the request to the underlying file system, rather than relying upon the application to handle this case correctly.
Normally, a filter driver would do this by intercepting the completion of a directory enumeration operation (IRP_MJ_DIRECTORY_CONTROL, IRP_MN_QUERY_DIRECTORY). In the completion routine it can signal the waiting thread in its dispatch entry point, post the IRP for additional processing, or process it directly in the completion routine.
Processing it directly in the completion routine might seem to be the most appealing solution, but it suffers from the standard problems of processing anything within a completion routine:
· The completion routine can be called at DISPATCH_LEVEL. Thus, the allowed operations are limited.
· The completion routine may be called in arbitrary context. Thus, the buffer may not be valid in the current context.
If a completion routine posts the IRP to a worker thread, it must ensure that either its dispatch routine returned STATUS_PENDING, or its dispatch routine blocks waiting for the worker thread to complete the I/O.
Q55 I see the user close the file. My filter receives an IRP_MJ_CLEANUP. But I never see the IRP_MJ_CLOSE? Why not?
The purpose of IRP_MJ_CLEANUP is to indicate that the last handle reference against the given file object has been released. The purpose of IRP_MJ_CLOSE is to indicate that the last system reference against the given file object has been released. This is because the operating system uses two distinct reference counts for any object, including the file object. These values are stored within the object header, with the HandleCount representing the number of open handles against the object and the PointerCount representing the number of references against the object. Since the HandleCount always implies a reference (from a handle table) to the object, the HandleCount is less than or equal to the PointerCount.
Any kernel mode component may maintain a private reference to the object. Routines such as ObReferenceObject, ObReferenceObjectByHandle, and IoGetDeviceObjectPointer all bump the reference count on a specific object. A kernel mode driver releases that reference by using ObDereferenceObject to decrement the PointerCount on the given object.
A file system, or file system filter driver, will often see a long delay between the IRP_MJ_CLEANUP and IRP_MJ_CLOSE because a component within the operating system is maintaining a reference against the file object. Frequently, this is because the memory manager maintains a reference against a file object that is backing a section object. So long as the section object remains "in use" the file object will be referenced. Section objects, in turn, remain referenced for extended periods of time because they are used by the memory manager in tracking the usage of memory for file-backed shared memory regions (e.g., executables, DLLs, memory mapped files). For example, the cache manager uses the section object as part of its mappings of file system data within the cache. Thus, the period of time between the IRP_MJ_CLEANUP and the IRP_MJ_CLOSE can be arbitrarily long.
The other complication here is that the memory manager uses only a single file object to back the section object. Any subsequent file object created to access that file will not be used to back the section and thus for these new file objects the IRP_MJ_CLEANUP is typically followed by an IRP_MJ_CLOSE. Thus, the first file object may be used for an extended period of time, while subsequent file objects have a relatively short lifespan.
Q56 What are the rules for managing MDLs and User Buffers? How do I substitute my own buffer in an IRP?
In all fairness, there are no "rules" for managing MDLs and user buffers. There are suggestions that we can offer based upon observed behavior of the file systems. First, we note that there are two basic sources of I/O operations for a file system - the applications layer, and other operating system components.
For applications programs, most IRPs are still buffered. The operations for which this is not necessarily the case are those for which larger amounts of data are transferred. These are IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, IRP_MJ_SET_QUOTA, and the per-control code options of IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL. If the Flags field in the device object specifies DO_DIRECT_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA, is specified as a memory descriptor list (MDL) pointed to by the MdlAddress field of the IRP. If the Flags field in the device object specifies DO_BUFFERED_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA, is a non-paged pool buffer pointed to by the AssociatedIrp.SystemBuffer field of the IRP. The most common case for a file system driver is that neither of these two flags is specified, in which case the buffer for IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_SECURITY, IRP_MJ_SET_SECURITY, and IRP_MJ_QUERY_EA, IRP_MJ_SET_EA is a direct pointer to the caller-supplied buffer via the UserBuffer field of the IRP.
Interestingly, for IRP_MJ_QUERY_SECURITY and IRP_MJ_SET_SECURITY the buffer is always passed as METHOD_NEITHER. Thus, the user buffer is pointed to by Irp->UserBuffer. The file system is responsible for validating and managing that buffer directly.
For the control operations (IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL) the buffer descriptions are a function of the specified control code, of which there are four: METHOD_BUFFERED, METHOD_IN_DIRECT, METHOD_OUT_DIRECT and METHOD_NEITHER.
· METHOD_BUFFERED - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. Upon completion of the operation the output data is in the same buffer. Transferring data between user and kernel mode is handled by the I/O Manager.
· METHOD_IN_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the memory corresponding to the memory for the original buffer, it probed it for read access (hence the IN part of this transfer description). Typically, this is confused because it is referred to as the output buffer (although, probing it for input would suggest it is being used as a secondary input buffer).
· METHOD_OUT_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the memory corresponding to the memory for the original buffer, it probed it for write access (hence the OUT part of this transfer description). · METHOD_NEITHER - in this case the input data is described by a pointer in the I/O stack location (Parameters.DeviceIoControl.Type3InputBuffer). The output data is described by a pointer in the IRP (UserBuffer). In both cases, these pointers are direct virtual address references to the original buffer. The memory may, or may not, be valid.
Regardless of the type of transfer, any pointers stored within these buffers will always be direct virtual memory references. For example, a driver that accepts a further buffer pointer will need to treat them as ordinary direct access to application address space.
For any driver, attempting to access a user buffer directly can lead to an invalid memory reference. Such references cause the memory manager to throw exceptions (such as is done using ExRaiseStatus for instance). If a driver has not protected against such exceptions, the default kernel exception handler will be invoked. This handler will call KeBugCheckEx indicating KMODE_EXCEPTION_NOT_HANDLED. The exception will be indicated as STATUS_ACCESS_VIOLATION (0xC0000005). Protecting against such exceptions requires the use of structured exception handling, which is described elsewhere (See Question Number 1.38 for more information).
A normal driver will associate an MDL it creates to describe the user buffer with the IRP. This is useful because it ensures that when the IRP is completed the I/O Manager will clean up these MDLs, which eliminates the need for the driver to provide special handling for such clean-up. As a result, it is normal that if there is both an MdlAddress and UserBuffer address, the MDL describes the corresponding user address (you can confirm this by comparing the value returned by MmGetMdlVirtualAddress with the value stored in the UserBuffer field). Of course, it is possible that a driver might associate multiple MDLs with a single IRP (using the Next field of the MDL itself). This could be done explicitly (by setting the field directly) or implicitly (by using IoAllocateMdl and indicating TRUE for the SecondaryBuffer parameter). This could be problematic for file system filter drivers, should a file system be implemented to exploit this behavior.
The other source of I/O operations is from other OS components. It is acceptable for such OS components to use the same access mechanism used by user-mode components, in which case the earlier analysis still applies. In addition, kernel mode components may utilize direct I/O - regardless of the value in the Flags field of the DEVICE_OBJECT for the given file system. For example, paging I/O operations are always submitted to the file system by utilizing MDLs that describe the new physical memory that will be used to store the data. File systems should not reference the memory pointed to by Irp->UserBuffer although this address will appear to be valid (and will even be the same as is returned by MmGetMdlVirtualAddress). The address are not, in fact valid, but may be used by the file system when constructing multiple sub-operations. Memory Manager MDLs cannot be used for direct reference to that memory, as those buffers have not been mapped into memory since they do not yet contain the correct data.
For a file system filter driver that wishes to modify the data in some way, it is important to keep in mind the use of that memory. For example, a traditional mistake for an encryption filter is to trap the IRP_MJ_WRITE where the IRP_NOCACHE bit is set (which catches both user non-cached I/O as well as paging I/O) and, using the provided MDL or user buffer, encrypt the data in-place. The risk here is that some other thread will gain access to that memory in its encrypted state. For example, if the file is memory mapped, the application will observe the modified data, rather than the original, cleartext data. Thus, there are a few rules that need to be observed by file system filter drivers that choose to modify the data buffers associated with a given IRP:
· The IRP_PAGING_IO bit changes the completion behavior of the I/O Manager. MDLs in such IRPs are not discarded or cleaned up by the I/O Manager, because they belong to the Memory Manager (see IoPageRead as an example). Thus, filter drivers should be careful when setting this bit (e.g., if they create a new IRP and send it down with the resulting data).
· The Irp->UserBuffer must have the same value as is returned by MmGetMdlVirtualAddress. If it does not and the underlying file system must break up the I/O operation into a series of sub-operations, it will do so incorrectly (see how this is handled in deviosup.c within the FAT file system example in the IFS Kit, where it builds a partial MDL using IoBuildPartialMdl. It uses Irp->UserBuffer as an index reference for Irp->MdlAddress). For example, if substituting a new piece of memory (such as for the encryption driver), make sure that this parameter is set correctly.
· Never modify the buffer provided by the caller unless you are willing to make those changes visible to the caller immediately. Keep in mind that in a multi-threaded shared memory operating system the change is - literally - available and visible to other threads/processes/processors as you make them. Changes should always be made to a separate buffer component. That buffer can then be used in lieu of the original buffer, either within the original IRP, or by using a new IRP for that task.
· Use the correct routine for the type of buffer (e.g., MmBuildMdlForNonPagedPool if the memory is allocated from non-paged pool).
· Any reference to a pointer within the user's address space must be protected using __try and __except in order to prevent invalid user addresses from causing the system to crash.
Q57 What are the issues with respect to IRQL APC_LEVEL? What does it do? Why should I use (or not use) FsRtlEnterFileSystem?
Windows is designed to be a fully re-entrant operating system. Thus, in general, kernel components may make calls back into the OS without worrying about deadlocks or other potential re-entrancy problems.
Windows also provides out-of-band execution of operations, such as asynchronous procedure calls (APC). And APC is a mechanism that allows the operating system to invoke a given routine within a specific thread context. This is, in turn, done by using a queue of pending APC objects that is stored in the control structure used by the OS for tracking a thread's state. Periodically, the kernel checks to see if there are pending APC objects that need to be processed by the given thread (where "periodically" is an arbitrary decision of the operating system). From a pragmatic programming standpoint, an APC can be "delivered" (that is, the routine can be called) by a thread between any two instructions. The delivery of APCs can be blocked by kernel code using one of two mechanisms:
· Kernel APC delivery may be disabled by using KeEnterCriticalRegion and re-enabled by using KeLeaveCriticalRegion. Note that Special Kernel APCs are not disabled using this mechanism.
· Special Kernel APC delivery may be disabled by raising the IRQL of the thread to APC_LEVEL (or higher).
There are numerous uses for this, but some of the reasons for this are because of the nature of threads and APCs on Windows. First, we note that a given thread is restricted to running on only a single processor at any given time. Thus, the operating system can eliminate multi-processor serialization issues by requiring that an operation be done in one specific thread context. For example, each thread maintains a list of outstanding I/O operations it has initiated (this is ETHREAD->IrpList). This list is only modified in the given thread's context. By exploiting this, and by raising to APC_LEVEL the list can be safely modified without resorting to more expensive locking mechanisms (such as spin locks).
The primary disadvantage to APC_LEVEL is that it disables I/O completion APC routines from running. This in turn means that the driver must be careful to handle completion correctly (that is, it cannot use the Zw routines, and it must use its own signaling mechanism from its completion routine when sending an IRP so that it may signal completion of the I/O).
Q58 How do I determine if the FILE_OBJECT represents a file or a directory from my filter driver? Can I rely upon the FILE_DIRECTORY_FILE bit?
The determination of whether or not a given FILE_OBJECT represents a directory is the sole domain of the file system driver. Thus, for a file system filter driver to determine if a file is a directory, it must ask the file system. This can be done by querying the attributes of the file (e.g., after it has been successfully opened by the underlying file system) or by examining the attributes within the directory, which can be done before the underlying file has been opened.
Options specified during create are not adequate for determining if a file is, in fact, a directory. For example, an application may optionally specify that the file being opened must be a directory by setting the FILE_DIRECTORY_FILE option as part of create (this is a bit in the I/O Stack location, Parameters.Create.Options, the low 24 bits of which are used for file options). If the file creation is successful, the file system filter driver can conclude that the FILE_OBJECT does represent a directory. If the file creation is successful and the caller did not specify FILE_DIRECTORY_FILE, however, the caller cannot presume that the file is a directory. The FILE_NON_DIRECTORY_FILE bit can similarly be used to determine that the given FILE_OBJECT does not represent a file.
There is one complication for those writing a file system filter driver - they must keep in mind that some file options now combine these two bits. For example FILE_COPY_STRUCTURED_STORAGE (which is not used but is still present in ntifs.h for Windows XP) is defined as FILE_DIRECTORY_FILE and FILE_NON_DIRECTORY_FILE.
Thus, the safest way to determine if a FILE_OBJECT represents a directory remains to ask the underlying file system.
Q59 How do I determine if the IRP is coming from a local process or over the network?
In our experience, it is not possible to ascertain this information for most operations. However, we have found that a solution that works with IRP_MJ_CREATE is to examine the process context. If the process is the system process, we then examine the impersonation state of the given thread. This can be done by trying to open the security token of the current thread (ZwOpenThreadToken or ZwOpenThreadTokenEx). If the thread is impersonating, our experience indicates that it is, in fact, operating on behalf of a remote user. While this is heuristic in nature, it is based upon observations of how the CIFS/SMB file server is implemented (it is a kernel mode driver that uses worker threads for processing requests on behalf of remote systems).
If we need to track this for subsequent operations, we can associate this state information with the given file object, so that subsequent I/O operations on this file object will allow us to determine if the original create operation was done using this impersonation technique. Impersonation is used during IRP_MJ_CREATE so that the underlying file system performs security checks using the correct credentials. Subsequently, the operating system will validate access independent of the thread's credentials, since the security decision has already been made for the given FILE_OBJECT.
Q60 How should I deal with Fast I/O in my file system? In my filter driver?
In general, fast I/O is completely optional for a file system, although there are some important reasons why it is advisable to support it. For a file system filter driver, it is essentially mandatory for it to handle all fast I/O entry points that are handled by a file system it is filtering because failure to do so can lead to incorrect results in other file system filter drivers.
How to handle specific fast I/O entries depends upon the file system being filtered, because different file systems handle fast I/O implementations very differently. Essentially all file system filter drivers need to handle the FAST_IO_DETACH_DEVICE entry point, because this is called within the file system filter driver whenever the device to which the filter is attached is about to be deleted. In Windows 2000 and earlier versions of Windows NT, the filter driver should detach, but may need to refrain from deleting the device object until all outstanding references to it have been resolved. In Windows XP, the operating system has additional logic within it so that this entry point will not be called until it is safe to delete your device object at the same time.
For the other fast I/O entry points, the general rule is that your filter driver may return TRUE or FALSE depending upon whether or not it wishes to allow this fast I/O. Thus, for the fast I/O operation to be performed via this path, each file system filter driver must pass the request along (implicitly accepting a TRUE possibility) until it reaches the file system. Any single filter driver may reject the fast I/O operation, in which case the caller will build an IRP_MJ equivalent of the fast I/O operation and pass it to the underlying file system.
We note some exceptions to this:
· Some of the fast I/O entry points do not return Boolean values. These have special rules for handling, which we describe below in more detail.
· We note that the WebDAV file system in Windows XP will not behave properly if fast I/O device control operations are rejected by a filter driver.
· Some of the fast I/O operations will result in multiple IRP operations (e.g., FAST_IO_QUERY_OPEN which eliminates an IRP_MJ_CREATE, IRP_MJ_QUERY_INFORMATION, IRP_MJ_CLEANUP and IRP_MJ_CLOSE operation).
Of course, keep in mind that the general rule for file system filter drivers applies in this case as well - if it works without the filter and does not work WITH the filter, it is the filter that must be at fault!
One area that has changed substantially in Windows XP is the handling of the six fast I/O operations dealing with file/VM locking: FAST_IO_ACQUIRE_FILE, FAST_IO_RELEASE_FILE, FAST_IO_ACQUIRE_FOR_MOD_WRITE, FAST_IO_RELEASE_FOR_MOD_WRITE, FASTIO_ACQUIRE_FOR_CCFLUSH, FAST_IO_RELEASE_FOR_CCFLUSH.
In Windows 2000, these six fast I/O entry points are passed directly to the file system. They do not call into any intervening filter drivers. In Windows XP a new mechanism is available for file system filter drivers that wish to intercept these operations - FsRtlRegisterFileSystemFilterCallbacks.
Q61 I am suffering from stack overflow issues. How do I deal with this?
This problem is one that file system drivers sometimes must handle because of the re-entrant nature of the operating system. Thus, the Windows operating system contains a number of calls that are suitable for use in protecting against stack overflow conditions, such as IoGetRemainingStackSize, which is used to probe the size of the stack remaining, or __try and __except to trap a STATUS_STACK_OVERFLOW exception.
When there is insufficient stack space, the normal technique is to post the operation to a different thread to complete the processing. This new thread will have a "clean" stack so that it can continue the processing. This technique is demonstrated in the FastFat file system example in the IFS Kit (see read.c). FastFat uses the routine FsRtlPostPagingFileStackOverflow (when processing paging file I/O) or an internal work routine to process other requests (see FatPostStackOverflowRead for example).
Q62 What is the difference between EOF and AllocationSize? Why is the AllocationSize the same for a file AFTER it is compressed?
These two values are used by various parts of the operating system to establish the size of other data structures. The notable example here is that the AllocationSize is used by the Memory Manager to establish the size of the section object backed by the file. Since the section object is used when memory mapping the file, the size of the section is quite important - and that size is the AllocationSize of the file.
Of course, this is now confusing because compression was added to the NTFS File system, but in spite of this the AllocationSize does not change. This is because even though the on-disk size is smaller, the size presented to the memory manager (and hence via the section object) must be the size of the uncompressed data.
The EOF of the file is the last readable byte within the file. Anything beyond that has never been defined for the given file.
The rule for file systems is that EOF is always less than or equal to AllocationSize.
Q63 What is the IFS Kit? How do I get it? I'm not in the US/Canada. Can I still buy it? Can I buy it from a retail distributor? With a purchase order?
The IFS Kit is the "Installable File Systems Kit" for the Windows 2000 and Windows XP operating system platform. It contains the header files, examples, and documentation for constructing file systems and file system filter drivers for the Windows 2000 and Windows XP platforms. It is available for sale directly from Microsoft (see http://www.microsoft.com/ddk/ifskit for ordering information). According to the Microsoft web site:
· The kit is available in most countries (the exception is those countries expressly prohibited by the Bureau of Export Administration, Department of Commerce. This would include, but is not limited to: Cuba, Iraq, Libya, and North Korea).
· The kit has an initial license fee of US$995, plus shipping. Subsequent update versions are available for US$109, plus shipping. These prices are subject to change.
· The kit is only available at the present time from Microsoft. This is for several reasons, one of which is that you must sign a special license agreement applicable to this kit.
· The terms of payment are: a bank check drawn in US Dollars, Visa, Mastercard, or American Express. Purchase Orders are not accepted by Microsoft for purchasing this kit.
While this information is current as of this writing, Microsoft is the ultimate source for this information - please review their web site for the relevant information.
Q64 I open a file but later when I try to use the handle I get back an error indicating an invalid handle (or invalid object type) error. What am I doing wrong? How can I use my handle?
Object handles are specific to a given process. This is because an object handle is a reference into the object handle table for the given process. Thus, if a driver creates a handle in one process context and then attempts to use it in a different process context, the handle will not work properly because it will refer to the wrong entry.
A kernel mode driver can rectify this problem using one of several techniques:
· When creating a new handle, specify that this handle is a kernel handle (so that it is created in the object handle table for the system process). The returned handle will then have the high-order bit set so that subsequent use of this handle will be done against the system process object handle table.
· When creating a new handle, always do it in a specific process context. This can be done by using KeStackAttachProcess, for instance. Alternatively, this can be done by posting work items to worker threads, either standard system worker threads, or captive worker threads created by the driver.
· A driver may refrain from using handle-based interfaces entirely. In this case, the driver uses IRP operations because they avoid using handles.
Q65 When can I rely upon the file name in the FileObject structure?
The FileName field in the file object passed to a file system may be modified any time after the IRP_MJ_CREATE has been processed by the file system. For a file system filter driver, this means that the FileName field is only useful (and valid) when the create operation has been received in the dispatch entry point of the file system filter driver. Specifically, what this means is not valid is to use the FileName field after the IRP has been processed by the file system driver. Thus, the name cannot be used in the completion routine.
A file system filter driver that needs to know the name of the file object in its completion routine should capture that name during the dispatch processing and pass it to the completion routine, either via some driver-maintained state table (context value or lookup table) or passed in the context pointer to the driver's completion routine. Another alternative is to query the file system (using IRP_MJ_QUERY_INFORMATION) from the completion routine. This particular technique (querying the file system in the completion routine) suffers from some potential problems:
· The file system driver may return a name for the file, not necessarily the name that was used during the IRP_MJ_CREATE
· A completion routine may be called at DISPATCH_LEVEL. A file system filter driver must be prepared to post its request and process it in a work routine.
· A file system may continue using the top level Irp field (the contents of which are retrieved using IoGetTopLevelIrp) during the completion routine call, so a file system filter driver must not rely upon this OS-internal field being properly "cleaned up" before the completion routine has been called.