原文: AMD TSC Drift Solutions in Red Hat Enterprise Linux®
AMDOpteron™ and AMD Athlon™ 64 processors are well known for their ACPIcompliant power management capabilities that allow the processor coresto independently adjust the performance state and power state, andresult is significant power savings.
The performance state is known as the P-stateand is defined as the valid operating combinations of processor corevoltage and frequency; and, as a result changes, the rate at which TimeStamp Counter (TSC) increments. The power state is known as the C-state when an operating system can place a processor in idle state, withC1-statebeing the most useful and interesting power state. C1 clock rampingfeature has been enabled in recent multi-core systems, whichsignificantly reduces the power consumption of an idle core that issuesa halt (HLT) instruction, but also causes the TSC to increment veryslowly while in halt.
The P-state and C-statechanges can affect the TSC and can result in TSC drift amongst theprocessor cores. The drift occurs only when the operating system usesTSC as the time keeping source. TSC drift can occur on K8 AMDmulti-processor platforms and single-processor dual-core platforms asthey do not provide frequency independent TSC. This drift does notoccur on single-processor single-core platforms for obvious reasons.
Linux operating system gives three choices to the user to select the time source and they areHigh Precision Event Timer (HPET), PMtimer andTSC.In order to use HPET as the time source it needs to be exposed andenabled in the BIOS and not all platforms have this feature. Thatleaves the user with two options, namely,PMtimer or TSC. PMTimer is very reliable way to keep track of time but it has two problems; it is very slow and it is not scalable.TSC on the other hand is very fast and is scalable.
Thereare certain RHEL customers who are looking for both speed andscalability, who feel TSC is the right time source for theirapplications. Typically, these are database applications that make alarge number ofgettimeofday() calls.
As a result when the OS makes gettimeofday()queries with TSC as the time source, it has been found that timeappears to fluctuate and go backwards. This can cause several problemson a system including, a single key press resulting in multiplekeystrokes displayed on the console, and the cumulative output of gettimeofday() shows time going backwards.
Linux®operating system does provide mechanisms to turn off certain timers andspecify the timer as part of the boot configuration options. To turnoff TSC as the time source,notsc can be used as the kernel parameter at boot time.PMtimer can be specified as the time source by using PMtimer as the kernel parameter at boot time.
RHEL4 has support for all three timers, with the default being PMtimer. HPET is faster thanPMtimer and preferable, but is not available on all platforms. These are some RHEL customers who are actually passing innopmtimer as a kernel boot configuration option to force the system to useTSC.
There are two different causes of TSC drift in RHEL; RHEL 4 users experienced TSC drift due toP-state changes caused by Power Now! driver. And RHEL 3 users experienced TSC drift due toC-state changes caused by C1-clock ramping. Solutions are available for both problems.
Red Hat Enterprise Linux 4
RHEL 4 operating system experienced severe skew when run with TSC as time source and PowerNow! performingP-statechanges, within minutes after booting up the system. The problem hereis that the AMD K8 processors do not provide frequency independent TSC.When Power Now! driver changes frequency and voltage of the processorto increase or decrease the processor speed based on processor workload, TSC drift results.
A workaround was providedto get around this problem, by changing all the processors to the samefrequency at the same time. This ensures that theTSC on every processor increments at the same rate, and never different than theTSC on another processor.
if (TSC_drift_exists) { if (CPUfamily == 0x0f) { // Save the current CPU affinity for future restore // after programming the Model Specific Registers(MSRs). for_each_online_cpu(i) { // Migrate the current task to the appropriate // CPU. set_cpus_allowed(current, cpumask_of_cpu(i) // Write the new fid value along with the other // control fields to the msr. wrmsr(MSR_FIDVID_CTL, New FID Value, PLL Lock Value); } // Restore CPU affinity bit mask } }
Also on a multi-processor system that is transitioning all cores insync, the voltage for each frequency is adjusted to the highest. Thisprevents systems with processors with different steppings from failing.
A boot configuration option powernow-k8.tscsyncwas created, which needs to be enabled when TSC is used as the timesource to enable simultaneous transitions. The kernel bootconfiguration options to force the use of TSC as the time source for gettimeofday() queries would look like this:
nohpet nopmtimer powernow-k8.tscsync=1
Red Hat Enterprise Linux 3
RHEL 3 operating systemexperienced skew when running on AMD processor based systems. RHEL 3kernel code base does not support thecpufreq sub-systemwhich is necessary for the AMD PowerNow!™ technology driver support.The TSC drift was introduced in this case due to theC-state changes, specifically the C1-clockramping. This issue is inherent to K8 AMD multi-processor platforms andsingle-processor dual-core platforms as they do not provide frequencyindependentTSC.
The TSC drift is generally noticeable only when the operating system uses theTSCas either the only source of time or as a fast timer to interpolatebetween periodic timer interrupts. RHEL3 only supports TSC as the timesource forgettimeofday() queries and there is no HPET orPMtimer support in the 32-bit architecture. So this issue pretty affects all users.
The workaround to fix the TSC problem is by disablingC1-clockramping. This can be done by clearing PMM7 bits on each core'sMiscellaneous Control device, which is configurable via PCI space.
Accessingthe PCI configuration space is a function of the north bridge PCI addr(NB_PCI_ADDR), north bridge power management device (NB_PM_DEV), northbridge C1 register (NB_C1_REG), and the north bridge C1 mask(NB_C1_MASK).
if (CPU == 0x0f) { if (Processor Addr >= Rev_E) { // Disable C1 clock ramping to avoid TSC drift // Read PCI configuration register on the north // bridge. reg = read_pci_config(0, NB_PCI_ADDR + physproc_id[cpu], NB_PM_DEV, NB_C1_REG) // Clear the PMM7 bits on each core. write_pci_config(0, NB_PCI_ADDR + physproc_id[cpu], NB_PM_DEV, NB_C1_REG, reg & NB_C1_MASK); } }
RDTSCP is the new read time-stamp counter and processor IDinstruction introduced in AMD NPT Family 0Fh processors, and is used toread the model-specific TSC register. The instruction returns 64-bitTSC value inedx:eax register pair. This is a serializing operation and prevents speculative reads of TSC, by returning TSC_AUX[31:0] MSR inecx at the same time as TSC. The atomicity ensures no context switch between the TSC and TSC_AUX.
The RDTSCP feature in software adds initialization of the RDTSCP auxilliary values to CPU numbers intime.c.If RDTSCP is available, the MSRs are written with the respectivevalues. It adds a fast time-stamp based cache using a user suppliedargument to speed things even more. It adds macros for reading TSC viathe RDTSCP instruction, as well as writing the auxilliary MSR read byRDTSCP to msr.h
Testing shows that on RDTSCP capable CPUs, vast improvements in the time it takes to makegettimeofday() (GTOD)calls. It takes 324 cycles per call to complete 1 million GTOD callswithout RDTSCP and 221 cycles per call with the capability. Scalabilitywith PIT/TSC enabled remains unaltered. With the newvgetcpuvsyscall that is currently under development in the Linux communitythere is the potential for the latency to improve by 15-20 cycles perGTOD call.
RDTSCP has shown to provide significant latency improvements for doing gettimeofday()operations, while not affecting the scalability that is much desired.These improvements are huge for database server applications that do alarge number ofgettimeofday() transactions. RDTSCP coupled with the TSC drift fixes in the RHEL operating systems, provides a fast means of doinggettimeofday() transactions, while retaining scalability.
For the sake of completeness, here are a few sample tests that were developed during the course of theTSC drift development work. The important tests include scalability,TSC drift and AMD PowerNow! user space governor tests.
Scalability
Toverify scalability, include a test that spawns increasing number ofPOSIX threads and increasing the number by one. Each thread must returna nearly identical value forgettimeofday() call. If the value returned starts to reduce then the system is not scalable.
for (i = 1; i < 10; i++) { for (j = 0; j < i; j++) { result = pthread_create(&thread[j], NULL, threadf, (void *) 0); } for (j = 0; j < i; j++) { pthread_join(thread[j], NULL); } }
TSC drift
TSC drift can be determined by gathering the hardware clock (hwclock) and the software clock (date) time-stamps and verifying the skew between the hardware and software hours, minutes and seconds.
Power Now
Changing the P-state of the processor cores caused the system to experienceTSCdrift. Here is a test that will set the processor cores to a highvalue. Similarly tests can be setup to set the processor cores to aminimum (1 GHz), low (1.8 GHz), and max (2.4 GHz) values. The exampleshows the cores being setup to a high value of 2.2 GHz. Note that thevalues chosen for the test depends on the frequency of the processor.
for ((i = 0; i < NUM_PROCESSORS; i++)) do echo 2200000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_setspeed
Toggle amongst all the P-states and check for TSC drift to test the robustness of your system.
Red Hat Enterprise Linux TSC work has proven that there is a definite need for using a fast time source such asTSC amongst enterprise Linux customers. The fact of the matter is thatTSCwas never designed to be used as a time source with multi-processorsystems. However, customers are interested in using TSC as the timesource to perform the time sensitive operations and at the same timedesire scalable solutions.
Next generation AMD processors will provide a TSC that isP-state and C-State invariant, which will make theTSC immune to drift. Applications will then be able to useTSCas the time source without any need for custom engineering workdescribed in this article. At the same time the Linux community isworking on aper-CPU TSC solution that will keep track of each core's TSC frequency.
BhavanaNagendra is a Member of AMD's technical staff supporting Red HatEnterprise Linux on the AMD Opteron™ processor platform. She has workedextensively on TSC drift solutions in RHEL along with her colleagueMark Langsdorf who authored the RHEL4 TSCnow! patch, in collaborationwith Red Hat development and performance teams.