Windows Streams - An Introduction to File System Streams
The NT Insider, Vol 13, Issue 2, March - April 2006 | Published: 17-Apr-06| Modified: 17-Apr-06
文件流的概念是与
NTFS
一起在
1993
年出现的
.
但这个概念在应用程序中很少出现
.
在
Windows Server 2003
中出现了
API(
FindFirstStream
和
FindNextStream),
可以估计这个
概念将在以后有应用中大行其道
.
但在内核驱动开发人员理解这一概念是必要的。如过滤驱动中常用到文件流的概念
,
比方说
Stream Context.
NTFS
明确规定一个文件是由一个或多个文件流组成
.
每个文件流有一系统的属性
.
对那些不熟悉文件系统空间的人来说,流的概念也是陌生的。虽然在1993年Windows NT首次投入使用的时候,NTFS已经实现并使用了流,但即使到了今天应用程序也很少使用到这一特色。然而,随着Win32接口新增加了一些用于探测和列举流的设计(Windows Server 2003引入了FindFirstStream和FindNextStream APIs),这些APIs在将来的应用程序中出现的频率将愈来愈高。
对内核级文件系统设计来说,理解流的概念非常重要,因为流是设计像过滤器管理器(Fileter Manager)这样的接口的组成部分。比如,过滤器管理器明确地使用流上下文的概念而不是文件上下文。当然,从本文完稿日起过滤器管理器中已经不支持文件上下文。然而,我们期望未来版本的过滤器管理器将会支持文件级上下文。
从历史角度来看关联文件的附加信息对使用数据的应用程序透明是有用的。理想的是,像这样的结合只作为文件的一个更多的部分。因此,举个例子,假如用户复制这个文件,附加信息也连同文件的用户数据部分一起移动。早期像这样的附加信息被包含在资源分支中(resource fork,它是一种属性,被添加到Macintosh上来关联一个指定的应用程序到给定的数据文件上)。在Windows 中这种结合是传统地基于文件名完成的(它的扩展)。然而,在Macintosh的世界里它是文件固有的一部分,这就意味着两个使用同一扩展名的文件能实际上调用不同的应用程序。
在OS/2中微软和IBM添加了一个应用级属性机制名为扩展属性(extended attributes). 为了保持OS/2的传统,当Windows NT 3.1首次投入使用时,它本属的(native)三个读/写文件系统都支持扩展属性。NTFS和HPFS在文件结构中给予了本属的支持, 而FAT使用根目录中的一个隐藏文件来跟踪扩展属性。即使到了今天IFS Kit (和现在的WDK)中的FAT文件系统代码中依然有需要支持FAT-12和FAT-16格式的文件系统中的扩展属性的痕迹。然而FAT-32格式不支持扩展属性。
NTFS的设计包括一个组合式概念,那就是文件是一个或多个流的容器。任何一个流都能拥有一个属性集来描述指定流的特征。对NTFS来说,这个模型的一大好处是它们能为了利用Macintosh的服务而支持Macintosh资源分支(resource forks)。仔细回顾下CDFS源代码 �C 也在IFS Kit和WDK中 �C 那里面也有对Macintosh资源分支的支持。这个组合概念也为允许应用程序和其他系统组件来关联信息到文件上提供了一个普遍的扩展机制。
一个可能的动机或许已经对处在同一时期的OLE团队起作用了。那时,OLE团队在文件中藏入一个文件系统来允许对藏入来自不同应用程序的数据元素的透明的支持。虽然这个方案本身对应用程序来说不透明,但它被封装到应用程序能使用的库中,这使它”看起来”是透明的。虽然我们无法确定是不是这个动机,但是想在文件系统层直接添加对这种分离的支持对文件系统开发人员来说是一个巨大的诱惑。当然,在Windows 2000的开发期间,存在通过使用NTFS流机制来实现结构化存储(implement structured)的一个短暂的尝试。
已经提供了流的一个基本的动机,我们仍然没有真正地概念性地描述流是什么。以下图表(看 图1)试图通过一个概念模型来解释一个单一文件如何能包含多个流的数据。虽然模糊地基于NTFS模型,但我们不说这是NTFS模型,读者也不应该据此得出NTFS或任何其他的文件系统如何实现流的任何推论。
图1 �C 流的概念模型
因此,从理论上讲一个单一”文件”由多个截然不同的组件组成。其中一些应用于整个文件的属性。例如,这可能包含安全描述符,文件的扩展属性列表和文件上的时间戳。每一个流都有名字,其缺省名为零长度字符串比如上图中使用的NULL名。预备流有有一个非零长度的字符串名。任何一个流名都是唯一的,但可能有许多不同的流。文件系统负责在名上放置一切限制,比如流名的大小。一个单独的流有它自己关联的data属性。 上图中我们使用$DATA名是因为当我们开始解释NTFS使用的命名方法时会非常轻松。因为流关联了数据,文件系统也必须像流的本地信息、数据元素的大小、任何现存的列表和位图或者其他对给定的流来说可能是唯一的控制结构一样维护流的附加信息。
流可以被用来存储几乎任何东西.因为它们仅仅是文件的另一个元素,当流存在时应用程序能要么忽略它要么使用它。系统应用能被构造来保存流。例如, Win32 API CopyFileW 知道文件的流且复制那些流。类似的,Win32 Backup API知道流且”把它们封装”到一个由<stream header>+<stream data>+ 等等组成的单独的流中,这就允许以一个对Backup应用程序透明的样子来备份整个文件的内容。在引入Windows Server 2003中的FindFirstStream API前, Backup API是文件流存在的充分证据。在这个语境中”documented”意味着有一篇微软基础知识文章描述了如何使用Backup API来寻找文件的流。
为了允许应用程序打开一个文件的流,NTFS扩展了名支持(包括’:’字符的使用)来从文件名中分离出流的名。这类似于用来从文件的名中分离出目录名的’\’。FAT不支持文件名中的’:’,因此在FAT文件系统上不可能正常地打开流。虽然其他文件系统不支持流,一个应用程序员应该根据以下来决定流支持:
试图打开一个文件流.
查询文件系统特征并检查FILE_NAMED_STREAMS属性.
应用程序不应该使用文件系统的名因为这将引起一个长期的应用程序兼容性的问题。其他的文件系统(除了NTFS)支持流和流的概念。
“不认识流”的应用程序正常地通过名打开文件。然后此应用程序获得了默认数据流。因此,打开文件foo.txt 与打开流foo.txt::$DATA 一样(就是说,名为NULL的流的数据属性)。然而,如果你在许多应用程序中尝试这样做你会发现它失败了 �C 不是因为流没有正常起作用,而是因为不认识流的应用程序可能设法通过访问目录使文件的存在生效。既然目录不列出文件的流,那么应用程序将会决定指定的名不存在。在图2中通过使用命令行演示了这种行为的一个例子。
图2 �C 应用程序中意识到的流的变化
注意使用完全受限的名包括流后缀的更多应用(more utility)/* ?这个名词或许是说功能更多的应用程序*/不会工作,但当没有用一个流标识符来访问文件时,从名中重定向输入会适当地工作。一个指定的应用程序在一个有多个流的文件上的如何行动依赖于这个应用程序的实现。
通过使用filespy我们有可能”看见” 更多应用的真正的问题是它试图在目录中使文件生效。对第一条命令的跟踪的部分如图3.
图3 �C 用FileSpy观察更多的东西
注意它调用了IRP_MJ_DIRECTORY_CONTROL (IRP_MN_QUERY_DIRECTORY,用信息FileBothDirectory),返回值是STATUS_NO_SUCH_FILE。虽然在跟踪中还不是完全清楚,但我们可以断言这就是因为实际上它正在寻找的文件的全名包括了流的部分。如果我们间接地这样做它就不会以同一方式起作用了。
经证实访问流的最好的应用是低应用(lowly utility,对应上面的more utility)notepad。取代询问目录,notepad通过打开指定的对象来决定流是否存在。
图4演示了使用命令行应用来用两个数据流创建一个单一文件。
图4 �C 一个文件...两个流
第一行是”缺省地”创建默认数据流因为它不是指定的。第二行是创建一个名为bar的数据流。 下一步我们分别重获每个流的内容。最后,我们注意到目录入口仅仅报告了默认数据流中的数据大小。
现在我们已经描述了文件上的流,你能很轻易地推理出NTFS也支持目录上的预备流。把没有出现在目录内容列表中的信息嵌入到目录中的能力对存储目录辅助信息比如一个backup程序的目录级策略来说是一种恐怖的方式。在NTFS中一个目录的默认数据流不能被使用,因为它被用于存储真实的目录内容。
当应用程序打开一个文件,Windows OS创建一个相应的FILE_OBJECT来表示文件的打开实例。通过对流的介绍,名的数据结构不会改变但是它们惯于(常常)会改变。这个概念会混淆那些刚刚开始在内核环境下工作的人。
传统意义上,文件系统用FILE_OBJECT的域FsContext关联一个指定的文件状态到文件对象上。通常提及它们的时候就是File Control Block或Stream Control Block。后一个名字反映了(它是)文件系统中支持流的最好的实体。这点很重要因为通常(例如,在过滤驱动中)如果两个FILE_OBJECT有相同的FsContext指针值我们会把这两个文件对象结合到一起。然而,如果底层文件系统支持流,那么这两个值真正的意义是它们表示同一流而非同一文件。
为了论证我们的观点,图5图形化地论证了当一个文件系统支持流时组织文件系统数据结构的一个可能的模型。
图5 - 支持流的文件系统组织数据结构的模型
对文件系统过滤器来说文件对象的SectionObjectPointers和FsContext值都可见,但Stream Context和File Context 则完全是文件系统内部的了。因此,不存在简单的方法来关联这些。
达成这个目的的一个方法是获得文件级属性 �C 据报道是与一个给定文件的所有流一样的东西。举个例子,NTFS中我们通常使用File ID 因为这个64位值对一个给定卷上的一个给定文件来说是唯一的。如果我们发现两个流有相同的File ID 值,我们就知道它们表示同一文件即使FsContext域不相同。在这种情况下,我们直到同一文件上有不同的流。
因此,不是所有的文件系统都支持这个特征。举个例子,同一文件的不同流CIFS/SMB转向器 不会返回相同的File ID,即使远程文件系统是NTFS,因为转向器会产生这样的一个File ID。因此,设法关联经由转向器的同一文件的不同流是一个更大的挑战 �C 也超出了本文的范围。
不幸的是,这意味着在这个问题上没有一个通用目的的解决方案。虽然过滤器管理器允诺在将来的某个版本中会有一个解决方案,但是我们可以预料那个支持很有可能会在未来的Windows版本中受限,而这使其会最终成为一个"go away"问题但在可预见的未来中很有可能会遗留给我们。
已经描述了流的基本机制和从内核文件系统/过滤驱动级如何看待它们,我们认为提供使用流的额外的动机会比较有用。当今,搜索有关流的信息和它们在NTFS中的使用会让人小气馁一把,因为几乎每篇文章都说到流如何是潜在的安全隐患。这个特点使流有用(事实上它们不直接可见,它们不影响被报告的默认数据流的大小等等)也令它们对不希望流存在的用户来说是个潜在的安全隐患。希望未来版本的Windows会包括监视流的应用(例如,右键点击来"列举这个文件上的流"),对Explorer来说这可能会是有用的扩展。
现在,以下列出的是流的一些现有的用途:
Internet Explorer 关联地带信息(zone information)到给定的文件上。当文件从Internet上下载下来的时候这个信息跟踪文件的来源,以便后来对这个文件的访问能采取合适的安全措施。
Windows 2000中的Internet Explorer 为图形的图像(image)内含一个thumbnail来加速目录内容的表现。在Windows XP中这个thumbnail因为性能原因被移到一个分离的文件中。这个改变暴露了一个安全隐患。
Macintosh的服务为资源分支继续使用流尽管SFM的使用在这些天遭到抨击。
SQL Server 2005 在流中存储数据库信息。
然而,构造流的其他可能的使用情景不是十分困难的事情。以下是一些例子:
加密驱动能存储”加密头”信息到一个与数据分离的流中,这就保留了原始数据的大小和数据规划。
部分常驻的文件可能有额外的控制信息,比如经由一个描述什么在文件中存在或不存在的HSM产品。
文件数据有效性驱动能存储文件块上的检验和到一个分离的数据流中。这样的检验和能探测端到端数据失败,这Exchange用来使它自己的数据生效的技术相似。
数字权限信息能关联文件。
Backup过滤器能跟踪一个文件的个别因上次backup而已经改变的块,允许只一个backup装置只捕获增加的更新资料。
审计应用程序可能存储设计规则到一个分离的流中比如目录级流。
应用程序可能会发现通过存储新版本的扩展信息到预备数据流中提供”向后兼容”的功能很有用。
流为关联信息到一个文件的数据内容上提供一个虽然模糊但非常有用的机制。随着Win32 APIs的创造,和在未来版本的Windows中更多支持流的文件系统的出现,我们期待看到流的更多应用。
For those new to the file systems space, the concept of a stream is likely to be new. While NTFS has implemented and made streams available since Windows NT first shipped in 1993, even today this is a feature that is seldom used by application programs. However, with the addition of new APIs in the Win32 programming interface for detecting and enumerating the presence of streams (FindFirstStream and FindNextStream APIs were introduced with Windows Server 2003), it is increasingly likely that these APIs will become more common in future applications.
For those of us involved in kernel level file systems programming, it is important to understand the concept of a stream because streams are integral to the programming interface such as the Filter Manager. For example, Filter Manager explicitly uses a concept of a stream context rather than a file context. Indeed, file contexts are not supported in the Filter Manager as of the date this article was written. However, we expect support for file level contexts will become possible in future versions of the Filter Manager.
Historically it has been useful to associate additional information with a file in a way that was transparent to the applications using the data. Ideally, such an association becomes just .e more part of the file. So, for example, if a user copies the file, the additional information is moved along with the user's data portion of the file. Early examples of such ancillary information include the resource fork, which is an attribute that was added . the Macintosh in order to associate a specific application with the given data file. In Windows that association is traditionally done based upon the file name (its extension). However, in the Macintosh world it was an inherent part of the file, which meant two files using the same extension name could actually invoke different applications.
In OS/2 Microsoft and IBM added an application-level attribute mechanism known as extended attributes. In keeping with its shared OS/2 heritage, when Windows NT 3.1 first shipped, all three of its native read/write file systems supported extended attributes. NTFS and HPFS had native support in the file structure itself, while FAT uses a hidden file in the root directory for tracking extended attributes. Even today the FAT file system code in the IFS Kit (and now the WDK) includes vestiges of the code needed to support extended attributes in FAT-12 and FAT-16 formatted file systems. The FAT-32 format, however, does not support extended attributes.
The design of NTFS included a modular concept that files were containers of .e or more streams. Each stream could then have a collection of attributes that described the characteristics of the specific stream. For NTFS, .e benefit of this model was that they could support Macintosh resource forks for use with Services for Macintosh. A careful review of the CDFS source code - also in the IFS Kit and WDK - shows there is support for Macintosh resource forks there as well. This modular concept also provided a generally extensible mechanism for allowing applications and other system components to associate information with the file.
One possible motivation for this might have been the work that was being done in the same timeframe by the OLE team. At that time, the OLE team was embedding a file system within the file to allow transparent support for embedding data elements from different applications. While this scheme was not transparent to applications, it was wrapped in libraries that applications could use, which made it appear transparent. While we're not certain if this was the motivation, it would be a strong temptation for a file systems developer to look at ways to add support for this sort of separation directly at the file systems layer. Indeed, during the development of Windows 2000, there was a short-lived attempt to implement structured storage by using the NTFS streams mechanism.
Having provided a basic motivation for streams, we still have not really described what they are conceptually. The following diagram (See Figure 1) attempts to demonstrate .e conceptual model for how a single file might contain more than .e stream of data. While loosely based upon the NTFS model, we do not claim this is the NTFS model, nor should the reader infer anything about how NTFS or any other file system implements streams.
Figure 1 - Conceptual Model for Streams
|
Thus, conceptually a single file consists of multiple distinct components. Some of these are attributes that apply to the entire file. For example, this might include the security de.or, the extend attributes list for the file, and the timestamps . the file. A stream then has a name, with the default name being a zero length string such as the NULL name used in the figure above. Alternate streams have a non-zero length string name. Each stream name is unique, but there may be many different streams. The file system is responsible for placing any restrictions . the name, such as the size of the stream name. An individual stream then has its own associated data attribute. In the figure above we used the name $DATA because that's easiest when we start explaining the naming used by NTFS. Because a stream has data associated with it, the file system must also maintain additional information about the stream such as its location information, the size of the data elements, any extant lists, and bitmaps or other control structures that might be unique to that given stream.
Streams can be used to store almost anything. Because they are just another element of the file, applications can either ignore the stream or use them if and when they are present. System utilities can be constructed to preserve streams. For example, the Win32 API CopyFileW is aware of the streams of files and copies those streams. Similarly, the Win32 Backup API is aware of streams and "packs them" into a single flat byte stream that consists of <stream header>+<stream data>+ etc., which allows the entire file contents to be backed up in a fashion that is transparent to the backup application. Prior to the introduction of the FindFirstStream API in Windows Server 2003, the backup API was the documented way to learn about the existence of streams. In this context "documented" means there was a Microsoft Knowledge Base article describing how to use the backup API to find the streams of the file.
In order to allow applications to open a specific stream, NTFS extended its name support to include the use of the colon (':') character to separate the name of the file from the name of the stream. This is similar to how the backslash ('\') is used to separate the name of the directory from the name of the file. FAT does not support the use of colon (':') in its file names, and thus it is not possible to open streams . a FAT file system without receiving a program error. Since other file systems do support streams, an application programmer should determine streams support either by:
-
Attempting to open a file stream.
-
Querying the file system characteristics and checking for the FILE_NAMED_STREAMS attribute.
Application programs should not use the name of the file system since this will create a persistent problem that causes application compatibility issues. Other file systems besides NTFS support streams and the stream concept.
An application that is not "stream aware" will normally open the file by name. This application then obtains the default data stream. Thus, opening the file foo.txt is the same as opening the stream foo.txt::$DATA (that is, the data attribute of the stream with a NULL name). However, if you try this with many applications you will find that it fails - not because streams don't work right, but rather because applications that are not stream aware may try to validate the existence of the file by looking in the directory. Since the directory does not list the streams of a file, the application will determine the name specified is not present. An example of this behavior is shown in Figure 2 using the command line.
Figure 2 - Variations . Stream "Awareness" in Apps
|
Note that using the more utility with the fully qualified name, including stream suffix, does not work, but redirecting the input from the name does work properly, as does accessing the file without a stream identifier. How a specific application behaves when there are streams . a file depends . the implementation of the specific application program.
Using the filespy utility we were able to "see" that the real issue for the more utility is that it attempts to validate the file in the directory. The relevant part of the trace for the first command is shown in Figure 3.
Figure 3 - Watching "More" with FileSpy
|
Notice that it is calling IRP_MJ_DIRECTORY_CONTROL (IRP_MN_QUERY_DIRECTORY) with FileBothDirectory Information and the return is STATUS_NO_SUCH_FILE. While not clear in the trace, we assert that this is because the file it is seeking is, in fact, the full name including the stream portion. It does not work the same way if we do this via indirection.
It turns out that the best utility for accessing streams is the lowly utility notepad. Instead of interrogating the directory, the notepad determines the presence or absence of streams by attempting to open the named object.
Figure 4 is a simple example that creates a single file with two data streams using the command line utility.
Figure 4 - .e File...Two Streams
|
The first line creates the default data stream "by default" since it is not specified. The second line creates a stream called bar that is named. Next we retrieve the contents of the respective streams. Finally, we note that the directory entry .ly reports the size of the data in the default data stream.
Now that we've described streams . files, it is .ly a small leap to note that NTFS also supports alternate streams . directories. The ability to embed information within the directory that does not show up in a listing of the directory contents is a terrific way to store ancillary directory information such as directory-level policy for a backup program. In NTFS the default data stream for a directory is not available for use since it is used to store the actual directory contents.
When an application opens a file, the Windows OS creates a corresponding FILE_OBJECT to represent that specific open instance of the file. With the introduction of streams, the names of the data structures do not change, but how they are used does change. This concept can be confusing for those just starting their work in the kernel environment.
Traditionally, a file system associated a specific file state with the file object using the FsContext field of the FILE_OBJECT. This is often referred to either as the File Control Block or the Stream Control Block. The latter name reflects the support of streams as a first-class entity within the file system. This is important because normally (e.g., in a filter driver) we associate two FILE_OBJECTs together if they have the same FsContext pointer value. However, if the underlying file system supports streams, then these two values really represent the same stream and not the same file.
To demonstrate our point, Figure 5 graphically demonstrates .e possible model for organizing file system data structures when a file system actually does support streams.
Figure 5 - Model for Organizing Data Structures for Stream-Supported File System
|
For a file system filter both the SectionObjectPointers and FsContext values are visible from the file object, but the association between the Stream Context and File Context is
entirely internal to the file system. Thus, there is no simple way to associate these.
One way to achieve this would be to obtain a file-level attribute - something that is reported to be the same for all streams of a given file. For example, with NTFS we generally use the File ID because this 64-bit value is unique for a given file . a given volume. If we find two streams that report the same File ID value, we know they represent the same file even if the FsContext fields are different. In this case we know these are different streams of the same file.
However, not all file systems support this feature. For example, the CIFS/SMB Redirector does not return the same File ID for different streams of the same file even if the remote file system is NTFS because the redirector generates such a File ID. Thus, trying to associate different streams with the same file via redirector is more of a challenge - and .e beyond the scope of this article.
Unfortunately, this means there is not a general purpose solution for this problem. While Filter Manager promises a solution in a future release, we anticipate that support will likely be restricted to future Windows versions, which makes this an issue that will eventually "go away" but is likely to remain with us in the foreseeable future.
Having described the basic mechanism of streams and how they are viewed from the kernel file system/filter driver level, we thought it would be useful to provide additional motivation for use of streams. At the present time, searching for information about streams and their uses in NTFS is a bit discouraging since almost every article talks about how streams are potential security hazards. The very features that make streams useful (the fact they aren't immediately visible, they don't affect the reported size of the default data stream, etc.) also makes them potential security hazards for users not expecting streams to be present. Hopefully, future versions of Windows will include utilities to monitor streams (e.g., right-click to "enumerate the streams . this file"), which might be a useful extensions to Explorer.
For now, listed below are a few existing uses of streams:
-
Internet Explorer associates zone information with a given file. This information tracks the origin of the file when it is downloaded from the Internet so subsequent attempts to access the file can take appropriate safeguards.
-
Internet Explorer in Windows 2000 embedded a thumbnail for graphic images to speed up rendering of directory contents. In Windows XP this thumbnail was moved to a separate file for performance reasons. In the process, this change created an interesting information exposure security vulnerability.
-
Services for Macintosh continue to use streams for resource forks even though SFM usage is deprecated these days.
-
SQL Server 2005 stores database information in streams.
However, it isn't terribly difficult to construct other possible usage scenarios for streams. A few examples of possible usage scenarios are listed below.
-
An encryption driver could store "encryption header" information in a separate stream from the data, which would preserve the original data size and data layout.
-
A partially resident file might have additional control information, such as through an HSM product that describes what is or is not present in the file.
-
A file data validation driver could store checksums . blocks of the file in a separate data stream. Such checksums can detect end-to-end data failures, which is similar to a technique used by Exchange to validate its own data.
-
Digital Rights information could be associated with the file.
-
A backup filter could track individual blocks of a file that have been changed since the last backup, allowing for .ly incremental updates to be captured to a backup set.
-
An auditing application might store auditing policy in a separate stream such as a directory level stream.
-
Application programs might find streams useful as a means of providing "backward compatibility" by storing extended information for newer versions in alternate data streams.
Streams provide a highly useful, albeit obscure, mechanism for associating information with the data contents of a file. With the creation of Win32 APIs, and the advent of more file systems that support them in future Windows releases, we expect to see more uses of streams in the future.