Today's computer software constantly breaks new grounds. Consumer software applications offer a sophisticated set of features that enable rich new experiences. Powerful server applications are setting new records in throughput, speed and scale. These improvements have been made possible by rapid progress in hardware technologies and continuous adoption of software advancements in optimization, virtualization, and distributed and parallel computing. However, as a result, software applications have become larger and more complicated. At the same time, users' expectations about software quality are higher than ever. Fundamental characteristics such as performance, reliability and manageability have proved essential in the long-term success of software products, and they are often celebrated as primary features.
Increasing software complexity and higher user expectations on quality thus present a difficult challenge in software development. When an unexpected problem occurs, predicting internal states of all relevant components is nearly impossible. Retracing the history of execution flows is cumbersome and tricky, but often necessary in finding out the root cause of software problems. When users report problems after deployment, they expect the root cause of the problem to be quickly identified and addressed. The overwhelming number of hardware and software combinations, different workload characteristics, and usage patterns of end users make such tasks even tougher. The ability to use a mechanism that enables you to understand system execution in a transparent manner, with minimal overhead, is invaluable.
Instrumentation is one such effective solution in measuring and improving software quality. Software performance counters have provided a convenient way to monitor application execution status and resource usage at an aggregate level. Event instrumentation has also been popular over the years. Events raised by a software component at different stages of execution can significantly reduce the time it takes to diagnose various problems. In addition to scanning for certain events or patterns of events, one can apply data mining and correlation techniques to further analyze the events to produce meaningful statistics and reports on program execution and problematic behavior. The ability to collect events on production systems in real time helps avoid the need to have an unwieldy debugger setup on customer machines.
Introduced in the Windows 2000 operating system, Event Tracing for Windows (ETW) is a general-purpose event-tracing platform on Windows operating systems. Using an efficient buffering and logging mechanism implemented in the kernel, ETW provides a mechanism to persist events raised by both user-mode applications and kernel-mode device drivers. Additionally, ETW gives users the ability to enable and disable logging dynamically, making it easy to perform detailed tracing in production environments without requiring reboots or application restarts.
The operating system itself has been heavily instrumented with ETW events. The ability to analyze and simulate core OS activities based on ETW events in development, as well as on production-mode systems, has been valuable to developers in solving many quality problems. With each subsequent Windows release, the number of ETW events raised by the operating system has increased; Windows 7 is the most instrumented operating system to date. In addition, Windows 7 contains tools that can utilize these operating system ETW events to analyze system performance and reliability, as well as uncover quality problems in software applications.
Many application problems surface as anomalies in OS resource usage, such as unexpected patterns or spikes in the consumption of CPU, memory, network bandwidth, IOs and so on. Because OS events for most system activities can be traced to the originating process and thread, one can make considerable progress in narrowing down possible root causes of many application problems, even without ETW instrumentation in applications. Of course, ETW instrumentation in the application would allow further diagnosis to be significantly more efficient.
In the first article of our two-part series, we present a high-level overview of the ETW technology and core OS instrumentation. Then, we discuss tool support to obtain and consume OS events. Next, we provide more details on the events from various subcomponents in the core OS. We also explain how the different system events can be combined to produce a comprehensive picture of system behavior, which we demonstrate by using a set of Windows PowerShell scripts.
As mentioned earlier, ETW is a logging platform that efficiently records the events sent by software applications or kernel-mode components. Using ETW provider APIs, any application, DLL or driver can become an event provider (a component that raises events) for ETW. A provider first registers with ETW and sends events from various points in the code by inserting ETW logging API calls. Any recordable activity of importance can be an event, and it is represented by a piece of data written by ETW at the time of logging. These logging API calls are ignored when the provider is not enabled. An ETW controller application starts an ETW session and enables certain providers to it. When an enabled event provider makes a logging API call, the event is then directed to the session designated by the controller. Events sent to a session may be stored in a log file, consumed programmatically in real time, or kept in memory until the controller requests a flush of that data to a file. A previous article, "Improve Debugging And Performance Tuning with ETW" (msdn.microsoft.com/en-us/magazine/dvdarchive/cc163437.aspx), has more details about the ETW technology and how to add ETW instrumentation into an application. Over the years, ETW has come to support many different logging modes and features, which are documented on MSDN.
An ETW event consists of a fixed header followed by context-specific data. The header identifies the event and the component raising the event, while the context-specific data ("event payload" hereafter) refers to any additional data that the component raising the event wants to record. When an event raised by a provider is written to an ETW session, ETW adds additional metadata to the header, including thread and process IDs, the current CPU on which the logging thread is running, CPU time usage of the thread, and timestamp. Figure 1 shows an XML representation of an event (a Process event of type Start) as decoded by the tracerpt tool (to be discussed later) in the XML dump file. The <System> section is common to all events and represents the common header that ETW records for each event. This contains timestamp, process and thread ID, provider GUID, CPU time usage, CPU ID, and so on. The <EventData> section displays the logged payload of this event. As shown in Figure 1, a Process Start event from the Windows kernel contains process key (a unique key assigned to each process for identification), process ID, parent process ID, session ID, exit status (valid only for a Process End event), user SID, executable file name of the process, and the command that started the process.
ETW controllers are the applications that use the ETW control API set to start ETW sessions and enable one or more providers to those sessions. They need to give each session a unique name, and on Windows 7 there can be up to 64 sessions running concurrently.
Figure 1 Process Start Event in XML Dump
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Guid="{9e814aad-3204-11d2-9a82-006008a86939}" />
<EventID>0</EventID>
<Version>3</Version>
<Level>0</Level>
<Task>0</Task>
<Opcode>1</Opcode>
<Keywords>0x0</Keywords>
<TimeCreated SystemTime="2009-07-14T16:27:43.441456400Z" />
<Correlation ActivityID="
{00000000-0000-0000-0000-000000000000}" />
<Execution ProcessID="2584" ThreadID="4324"
ProcessorID="1" KernelTime="90" UserTime="15" />
<Channel />
<Computer />
</System>
<EventData>
<Data Name="UniqueProcessKey">0xFFFFFA8005BBC950</Data>
<Data Name="ProcessId">0x1430</Data>
<Data Name="ParentId">0xA18</Data>
<Data Name="SessionId"> 1</Data>
<Data Name="ExitStatus">259</Data>
<Data Name="DirectoryTableBase">0x4E1D6000</Data>
<Data Name="UserSID">guest</Data>
<Data Name="ImageFileName">notepad.exe</Data>
<Data Name="CommandLine">notepad</Data>
</EventData>
<RenderingInfo Culture="en-US">
<Opcode>Start</Opcode>
<Provider>MSNT_SystemTrace</Provider>
<EventName xmlns=
"http://schemas.microsoft.com/win/2004/08/events/trace">
Process</EventName>
</RenderingInfo>
<ExtendedTracingInfo xmlns=
"http://schemas.microsoft.com/win/2004/08/events/trace">
<EventGuid>{3d6fa8d0-fe05-11d0-9dda-00c04fd7ba7c}</EventGuid>
</ExtendedTracingInfo>
</Event>
The events from the kernel and core system components, however, are logged in a manner different from user-mode applications or kernel-mode device drivers. The core system logs events to a special session with a reserved name, "NT Kernel Logger." Only the "NT Kernel Logger" session ("kernel session" hereafter) receives core system events, and it does not accept events from any other regular event provider. Also, core OS events are enabled by specifying appropriate flags when a session is started. Each flag represents event instrumentation in a different core component that can be enabled selectively. This helps reduce the instrumentation overhead in the kernel and reinforces the authenticity of system events. In addition, a new feature in Windows 7 allows users to capture call stacks at the time of logging. If symbols are available, one can trace a chain of function calls that trigger the kernel events to be logged. The benefit from call-stack analysis will be discussed in subsequent sections, when we discuss individual events in more detail.
There are a few ETW control tools available on Windows that allow users to collect events. For instance, the Performance Monitor exposes ETW control in the form of a data collection set. The EventLog service is also capable of starting and stopping ETW sessions and viewing events. For command-line and script interfaces, logman.exe offers options to perform ETW control operations. For event consumption, the command-line tool tracerpt.exe can consume ETW log files and produce dumps in several formats, including CSV and XML. An example of an XML representation of a Process Start event is shown in Figure 1. In this article, we use logman.exe and tracerpt.exe in the samples we present. The "logman query providers" command in Figure 2 lists the different flags that can be used by a controller when enabling the kernel session.
The following command starts the kernel session and enables process, thread, disk, network, image, and registry events. The collected events will be stored in a file called systemevents.etl in the current directory. Controlling the kernel session and collecting core OS events require administrator privileges:
> logman start "NT Kernel Logger" –p "Windows Kernel Trace"
(process,thread,img,disk,net,registry) –o systemevents.etl –ets
To stop the collection, users need to issue the "logman stop
-ets" command:
> logman stop "NT Kernel Logger" –ets
The tracerpt tool can process the events in the log file into a readable format. By default, tracerpt accepts one or more log files and generates an event output file and a summary file. The default format of the output file is XML:
> tracerpt systemevents.etl
The logman and tracerpt help text, Windows Help and Support, and online documentation have more details on the switches and features of these tools.
There are advanced performance analysis tools that consume core OS events for various analysis and tuning scenarios. Windows Performance Toolkit (WPT) is one such notable tool, and is available from the Windows SDK (msdn.microsoft.com/en-us/performance/cc825801.aspx). WPT is useful to a broad audience, including system builders, hardware manufacturers, driver developers and general application developers. Its trace analysis tools, Xperf and XperfView, apply sophisticated techniques (including those introduced here) to aggregate and analyze the core OS events to offer meaningful perspectives into the OS and application behavior. Its flexible GUI provides many rich and customizable presentation options that can help users focus on different aspects of system activities. Figure 3 shows a screenshot of the XperfView tool in action.
The Windows core OS has many components instrumented with ETW. As shown in Figure 2, events are available for activities spanning various subsystems including processes, threads, disk and file IOs, memory, network, registry and so on. In this section, we provide details for each group of events. We also discuss a number of core operating system concepts. More information on these concepts can be found in OS-centric reference materials such as "Windows Internals, 5th Edition," by Russinovitch, Solomon and Ionescu (Microsoft Press, 2009). Core OS events are subject to change in future versions of Windows, as the platform and its instrumentation evolve to meet new requirements.
There are several ways to make use of event data. One can scan for a certain event, such as an error event, or a pattern of events that represent execution flow. Other popular methodologies include statistical analysis (counting and summarizing events), delta analysis (analyzing deltas between pairs of events, such as Start and End), activity analysis (tracking an activity/request through event correlation), and aggregation and pattern analysis based on call-stacks. Over the years, we have found it effective to construct a state machine and simulate OS activities when we look at core system events. When we explain the system events in this section, we will also describe how we use them in building the state machine.
A thread is a basic unit of execution on the Windows OS, and a process acts as a container for threads. Each process (and thread) is assigned an ID that is unique while it is running. IDs for processes and threads share the same number space. That is, while a thread with ID A is active, A is never given to other processes or threads. The only exception is the per-CPU idle threads and the idle process, whose IDs are all identical and have been historically 0. Process and Thread events are the basic building blocks in establishing the state machine, which is helpful in understanding more advanced system activities. There are two types of events for Process and Thread events: Start and End. Process Start and End are logged when a process starts and terminates, respectively. The same applies to Thread Start and End events. The payload for Process events has more details about the process, such as process name and parent process ID, as shown in Figure 1. Likewise, the Thread event payload contains thread-specific information, such as stack base and limit, and start address. It should be noted that the IDs of the starting processes or threads are part of the Process and Thread event payload, although the event header already has those items in it. Process Start events are logged in the context of a parent process that creates the current process. In that case, the process ID in the <System> section (event header) is the parent process. The process ID in the event payload is the one being created. The same applies to thread ID in Thread events.
For processes and threads that started before the event collection, ETW logs state rundown events. For the purpose of analysis, they are used to denote the running processes and threads at the time the event collection begins. ETW also logs rundown events for the remaining processes and threads still running when the collection ends. Process and Thread rundown events enumerate and log events in the same format as Process and Thread Start events for all processes and threads, including system and idle processes. Process and Thread rundowns events use different types, DCStart and DCEnd, to distinguish themselves from real process and thread creation and termination. A separate Defunct event is written for a process that has terminated, but with outstanding references to it.
Every process and thread active during event collection can be tracked using Process and Thread events. When you build a state machine, the processing routine should keep a list of active processes and threads, perhaps as some type of objects in the program. When a thread or process terminates, the corresponding object may be placed in a different ("complete") list. The process objects may also contain more details (such as process name) on the processes or threads that they refer to, picked up from the Process and Thread event payload. It is simpler and a lot more informative if one can refer to a process by name during analysis, rather than by ID, as IDs can be recycled once a process or thread terminates (and all references to it are released). When events from other core components indicate OS activities during the state machine construction and simulation, the threads and processes that initiated them are located in the state machine, and their objects are updated to attribute those activities, using primarily the process and thread IDs in the event headers. At the end, process and thread objects are aggregated and summarized into a report with various metrics.
Image events correspond to image (also known as module) files getting loaded and unloaded into process address space. There are four types of Image events: Load, Unload, DCStart and DCEnd. These events do not directly correlate to LoadLibrary calls, however. If a DLL is already loaded in the process, subsequent LoadLibrary calls for the same DLL simply increment the count of module references but will not map the module again. Like the DCStart and DCEnd types of Process and Thread events, Image DCStart and DCEnd are used to enumerate loaded modules of already running processes. Image events allow for the tracking of loaded modules and the mapping of addresses within a process. They are also important in mapping DPC, ISR, and System Call events, as will be discussed in a later section and in Part 2 of this series. The Image event payload contains information such as the module address base and size, and the name of the binary file loaded (or unloaded). Image events are also required for decoding call-stacks. In a typical state machine construction, Image events trigger the updating of a list of loaded modules into an aforementioned process object.
Process Counter events, when enabled, are logged at process termination and record in its payload a few properties regarding the process execution statistics over the lifetime of the process. They consist of peak memory size, peak working set size, peak paged and nonpaged pool usage, and peak page file usage. These events indicate how a process behaved with respect to memory usage. Like Process events, separate rundown Process Counter events are logged for all active processes at the end of event collection.
Figure 2 NT Kernel Logger Enable Flags
>
logman query providers "Windows Kernel Trace"
Provider GUID
------------------------------------------------------------------------
Windows Kernel Trace {9E814AAD-3204-11D2-9A82-006008A86939}
Value Keyword Description
------------------------------------------------------------------------
0x0000000000000001 process Process creations/deletions
0x0000000000000002 thread Thread creations/deletions
0x0000000000000004 img Image load
0x0000000000000008 proccntr Process counters
0x0000000000000010 cswitch Context switches
0x0000000000000020 dpc Deferred procedure calls
0x0000000000000040 isr Interrupts
0x0000000000000080 syscall System calls
0x0000000000000100 disk Disk IO
0x0000000000000200 file File details
0x0000000000000400 diskinit Disk IO entry
0x0000000000000800 dispatcher Dispatcher operations
0x0000000000001000 pf Page faults
0x0000000000002000 hf Hard page faults
0x0000000000004000 virtalloc Virtual memory allocations
0x0000000000010000 net Network TCP/IP
0x0000000000020000 registry Registry details
0x0000000000100000 alpc ALPC
0x0000000000200000 splitio Split IO
0x0000000000800000 driver Driver delays
0x0000000001000000 profile Sample based profiling
0x0000000002000000 fileiocompletion File IO completion
0x0000000004000000 fileio File IO
The command completed successfully.
Context Switch events are logged every time thread switches occur on a CPU and can be used to construct a very accurate history as to which threads have been running and for how long. They occur very frequently and produce a large amount of data. In each switch, two threads are involved. The old thread will give up its share of execution time and hand the execution to the new thread. Thus, Context Switch events contains old and new thread IDs, old and new thread priorities, wait reason and wait time. Context switches can happen for various reasons, including blocking on kernel synchronization objects (events, timers, semaphores and so on.), preemption by a higher priority thread, quantum expiration, and changes in thread affinity. A certain amount of context switches are always expected. However, excessive context switching can be an indication of inefficient use of synchronization primitives and can lead to poor scaling in performance. Enabling call-stacks on Context Switch events allows in-depth analysis on reasons for threads getting switched out.
Deferred Procedure Call (DPC) events are logged when DPCs are executed. DPC is a kernel-mode function executed at elevated interrupt-level execution mode, and it preempts regular thread execution. The DPC event payload includes DPC entry time and routine address. Interrupt Service Routine (ISR) is a similar mechanism, and it runs at a higher execution level than DPC. ISR events have ISR entry time, routine address, ISR vector and ISR return value. DPC and ISR mechanisms are important elements in a Windows driver, as they are typically used for handling hardware interrupts. Drivers and kernel-mode components have a right to use DPC and ISR, but it is strongly recommended that they spend as little time as possible in these elevated modes. DPC and ISR events are used to monitor and verify the behavior of various drivers and kernel-mode components. By comparing the routine addresses against the range information in Image events, one can locate the kernel component responsible for those DPC and ISR events.
In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization. By setting aside storage for each processor that records its current active thread based on Context Switch, DPC and ISR events, one can monitor -- given any timestamp -- what each CPU was doing at that time and whose code it was executing. In the state machine simulation method, when a context switch takes place, a CPU object is updated with the new thread ID, and so is the object for the old thread with CPU usage up to the switch. Likewise, DPC and ISR events are attributed to the corresponding kernel-mode components, if needed.
In certain cases, such as with IO, Memory or System Call events, ETW does not record the process or thread IDs in the event header, primarily to reduce the overhead of very frequent events. For such events, the values of thread and process IDs in the header show up as 0xFFFFFFFF (= 4294967295). If Context Switch, DPC and ISR events are tracked as described above, those events can be traced to the thread or kernel-mode component by examining the CPU object for the currently running thread or DPC/ISR.
Memory events denote memory manager (MM) operations. Windows offers Page Fault events, Hard Page Fault events and Virtual Memory events. Memory events tend to be very frequent, especially on a busy system.
A page fault occurs when a sought-out page table entry is invalid. If the requested page needs to be brought in from disk, it is called a hard page fault (a very expensive operation), and all other types are considered soft page faults (a less expensive operation). A Page Fault event payload contains the virtual memory address for which a page fault happened and the instruction pointer that caused it. A hard page fault requires disk access to occur, which could be the first access to contents in a file or accesses to memory blocks that were paged out. Enabling Page Fault events causes a hard page fault to be logged as a page fault with a type Hard Page Fault. However, a hard fault typically has a considerably larger impact on performance, so a separate event is available just for a hard fault that can be enabled independently. A Hard Fault event payload has more data, such as file key, offset and thread ID, compared with a Page Fault event.
Virtual Memory events consist of Virtual Memory Allocation and Virtual Memory Free types and correspond to MM calls to allocate and free virtual memory ranges. Their payload contains the process ID, base address, requested size and flags used in the API call. Virtual Memory events were newly added for Windows 7 and are useful for tracking down leaked calls to the VirtualAlloc function and excessive virtual memory usage by applications.
Memory event headers do not contain the IDs of the threads and processes that caused the particular activities. The Page Fault event payload, however, has the thread ID of the thread causing the fault. This allows the correlation of Page Fault events to threads and processes through the state machine. The Virtual Memory event payload contains the ID of the process for which virtual memory operations were performed. To track it to the thread making the API call, you need the context switch accounting, described in the previous section.
Windows 7 features hundreds of event providers from various components. In the first part of this two-part article series, we have presented some of the core OS ETW events available on Windows 7 and the analysis techniques that we have used for many years. Individual events indicate certain activities in the core OS, but if combined through context-sensitive analysis methods, they can be used to produce meaningful reports that provide insights into patterns and anomalies in resource usage. In Part 2, we plan to cover other core OS ETW events as well as present simple scripts to demonstrate a few basic accounting techniques on some of the OS events introduced in these two parts. We hope that many people will take advantage of the content presented here and that it will lead to the promotion of sound engineering practice, greater software quality and better user experiences.
Dr. Insung Parkis a development manager for the Windows Instrumentation and Diagnosis Platform Team. He has published a dozen papers on performance analysis, request tracking, instrumentation technology, and programming methodology and support. His e-mail address is [email protected].
Alex Bendetov is a development lead for the Windows Instrumentation and Diagnosis Platform Team. He works on both Event Tracing for Windows and the Performance Counter technologies. He can be reached at [email protected].