Machine check handling on Linux

Machine check handling on Linux

0. Abstract 摘要

The number of transistors in common CPUs and memory chips is growing each year. Hardware busses are getting faster. This increases the chances of data corruption by arbitrary bit flips in hardware. Modern chips can detect and sometimes correct such events using ECC checksums and other techniques, but there are cases the hardware can’t hide such problems completely and software has to handle it. Such an event is called an machine check (MC).
普通CPU和存储芯片中的晶体管数量每年都在增长。硬件总线越来越快。这会通过硬件中的任意位翻转增加数据损坏的可能性。现代芯片可以通过ecc校验和等技术来检测和纠正此类事件,但有时硬件无法完全隐藏此类问题,软件必须加以处理。这种事件称为机器检查(mc)。

As these events become more common, it’s becoming more and more important that Linux recovers as well as possible from them.
随着这些事件变得越来越普遍,linux尽可能地从中恢复也变得越来越重要。

The paper discusses some generic issues in handling MCEs in software and covers the new recently rewritten x86-64 machine check handler.
本文讨论了在软件中处理mce的一些常见问题,并介绍了最近重新编写的x86-64机器检查处理程序。

1. What is a machine check

This paper is about machine checks. A machine check is the hardware’s way to tell you about some internal error. Traditionally when something goes wrong in hardware the machine crashes. With a machine check, software has a chance to do something better.
这篇论文是关于机器检查的。机器检查是硬件告诉您一些内部错误的方法。传统上,当硬件出错时,机器就会崩溃。通过机器检查,软件有机会做得更好。

The main focus is the x86/x86-64 platform, in particular on the AMD Opteron (because that is the platform the author has most experience on with machine checks). Most of the discussion applies in general terms to non-x86 architectures too, although some details can differ.
主要关注的是x86/x86-64平台,特别是amd Opteron(因为这是作者在机器检查方面最有经验的平台)。大多数讨论也适用于非x86体系结构,尽管有些细节可能有所不同。

There are two main kinds of machine check: machine check exceptions (MCEs) and silent machine check. A machine check exception happens when there is an error that the hardware cannot correct. It will cause the CPU to interrupt the current program and call a special exception handler.
机器检查主要有两种:机器检查异常(MCE)和静默机器检查。当出现硬件无法更正的错误时,将发生机器检查异常。它将导致CPU中断当前程序并调用一个特殊的异常处理程序。

With a silent machine check the hardware was able to correct the error, but logged the event to internal registers. There the event can be read by the operating system or the firmware later. Silent machine checks don’t need immediate software or administrator action, but it is useful to log and analyze them to get early cues about hardware problems.
通过静默机器检查,硬件能够纠正错误,但将事件记录到内部寄存器。在那里,操作系统或固件可以稍后读取事件。静默式机器检查不需要立即的软件或管理员操作,但是记录和分析它们以获得有关硬件问题的早期提示是很有用的。

2 Why are they important?

Modern hardware has internal self checking, like internal checksums and error detecting and correcting codes for caches and busses. But the number of transistors is growing and feature size is shrinking with each chip generation, which both increases error rates.
现代硬件具有内部自检功能,如内部校验和、缓存和总线的错误检测和纠正代码。但是晶体管的数量在不断增加,而且每一代芯片的特征尺寸都在缩小,这都增加了错误率。

Clustering Linux machines into clusters for high performance scientific computing becomes more and more popular[beowulf]. In these clusters it is important to gather information about machines failing so that corrective action can be taken by the administrator. With a lot of machines the mean time between failures is significantly decreased, which means error handling becomes more important.
将linux机器集群到集群中进行高性能科学计算变得越来越流行[beowulf]。在这些集群中,收集有关机器故障的信息非常重要,这样管理员就可以采取纠正措施。许多机器的平均故障间隔时间显著减少,这意味着错误处理变得更加重要。

When an hardware error occurs on a node the task that ran on it should fail to prevent silent errors from being introduced into the computation. One way to detect these problems would be self checks in the software (like checksums over memory buffers or algorithms with internal sanity checks for results). But this is not always possible and requires a lot of effort from the programmer. Another way is to rely on the hardware error detection. When the kernel logs an uncorrected hardware error the cluster software can take corrective action, like rerunning the task on another node and reporting the failure to the administrator.
当某个节点上发生硬件错误时,在该节点上运行的任务应该无法阻止静默错误被引入到计算中。检测这些问题的一种方法是在软件中进行自我检查(例如内存缓冲区上的校验和或对结果进行内部健全性检查的算法)。但这并不总是可能的,需要程序员付出很大的努力。另一种方法是依靠硬件错误检测。当内核记录一个未纠正的硬件错误时,集群软件可以采取纠正措施,比如在另一个节点上重新运行任务并向管理员报告故障。

The same issues apply on servers and high availability clusters.
同样的问题也适用于服务器和高可用性集群。

Logging hardware errors makes it possible to predict failures early.
记录硬件错误可以提前预测故障。

Even on a desktop silent errors should be avoided. It is better to tell the user that something went wrong due to a hardware issue instead of silently giving wrong results or crashing randomly.
即使在桌面上也应该避免无声错误。最好告诉用户由于硬件问题出现了问题,而不是默默地给出错误的结果或随机崩溃。

Sources of machine checks can be the CPU, PCI IO1, memory, caches, internal busses. The errors can be corrected errors (only logged to registers, no exception) or uncorrected errors (exception happens, software must react).
机器检查的来源可以是CPU、PCI IO1、内存、缓存、内部总线。错误可以是已纠正的错误(仅记录到寄存器,无异常)或未纠正的错误(发生异常,软件必须作出反应)。

When PCI IO errors are enabled machine checks could be also caused by software bugs in drivers.
启用PCI IO错误时,计算机检查也可能由驱动程序中的软件错误引起。

3 A quick overview of the x86 machine check architecture

The original IBM PCs had parity memory and caused Non Maskable Interrupts (NMIs) when a memory error occurred. Later PCs dropped parity memory, but still reported some hardware errors.
原始的IBM PC机具有奇偶校验内存,当发生内存错误时会导致不可屏蔽中断(NMI)。后来的PC机丢弃了奇偶校验存储器,但仍然报告了一些硬件错误。

Then with the Intel Pentium, basic machine check handling was added to the CPU again. With the Pentium Pro Intel defined a new generic x86 machine architecture[intelsys]. This architecture is implemented by modern x86 CPUs from Intel and AMD. It consists of a standard exception (interrupt 18) for machine checks and some standardized Machine Specific Registers (MSRs). The common registers allow software to check if an machine check occurred, to enable and disable them, check whether the error was corrected or corrupted the CPU state and some other things.
然后,使用英特尔奔腾,基本的机器检查处理被再次添加到CPU中。英特尔在奔腾Pro上定义了一种新的通用x86机器体系结构[intelsys]。该体系结构由Intel和AMD的现代x86 CPU实现。它由一个用于机器检查的标准异常(中断18)和一些标准化的机器专用寄存器(msr)组成。公共寄存器允许软件检查是否发生了机器检查,启用和禁用它们,检查错误是否被纠正或损坏了CPU状态和其他一些事情。

In addition there are some more registers for each bank. A bank is a group of errors generated by a specific subsystem (like CPU, bus unit, cache, north bridge). The number and meaning of banks is CPU dependent.
此外,每组还有一些登记册。组是由特定子系统(如CPU、总线单元、缓存、北桥)生成的一组错误。组的数量和意义取决于CPU。

Each bank has a number of sub-errors that can be enabled or disabled individually. Normally a generic machine check handler enables all errors and all banks 3 A machine check bank also has a register for the address associated with the error.
每个组都有许多可以单独启用或禁用的子错误。通常,通用机器检查处理程序启用所有错误和所有组,机器检查组也有与错误相关联的地址的寄存器。

Some CPUs like the Intel Pentium 4 also have extensions over the standard registers[intelsys].
一些CPU,如英特尔奔腾4,也有超过标准寄存器[intelsys]的扩展。

The advantage of this generic architecture is that a single machine check handler can work on many different CPUs. When an machine check is detected, the kernel reads all the generic machine check registers and the registers from any banks that signaled an error.
这种通用体系结构的优点是,一个机器检查处理程序可以在许多不同的CPU上工作。当检测到机器检查时,内核从任何发出错误信号的库中读取所有通用机器检查寄存器和寄存器。

The actual decoding and interpretation of the different errors is CPU dependent and up to the user. Some generic handling can be done; for example when a bank has a valid error address, the handler can assume that the memory at this address got corrupted. Also the handler can take different action depending on if the error was corrected or not and if the error corrupted the CPU context.
不同错误的实际解码和解释取决于cpu,取决于用户。可以执行一些通用处理;例如,当银行有一个有效的错误地址时,处理程序可以假设该地址的内存已损坏。此外,处理程序可以采取不同的操作,这取决于错误是否被纠正,以及错误是否损坏了CPU上下文。

Modern Intel CPUs have special thermal errors that happen when the CPU overheats and gets throttled. This normally only needs to be logged.
现代英特尔CPU有特殊的热错误,当CPU过热和节流时会发生。这通常只需要记录。

Some chipsets can be configured to trigger NMIs on various PCI or other bus errors.
一些芯片组可以配置为在各种PCI或其他总线错误上触发NMI。

4. Why is it hard to write a machine check handler

Cannot use any normal kernel services. Normally kernel code can be in process context or in interrupt context. Interrupt context can do less than process context; it can only call functions that properly protect their data structure against parallel occurring interrupts. Such ”safe” functions are called ”interrupt-safe”.
不能使用任何正常的内核服务。通常内核代码可以在进程上下文中,也可以在中断上下文中。中断上下文的作用比进程上下文小;它只能调用适当保护其数据结构不受并行中断影响的函数。这种“安全”功能称为“中断安全”。

Machine check exceptions can trigger all the time, even in a critical section when all normal interrupts are disabled. This implies that the machine check handler cannot even use interrupt-safe functions, otherwise it would risk deadlocking on kernel spin locks.
机器检查异常可以一直触发,即使在所有正常中断都被禁用的关键部分。这意味着机器检查处理程序甚至不能使用中断安全函数,否则将有可能导致内核自旋锁死锁。

For silent machine checks, undefined interrupt state isnt a problem because they normally run from the timer interrupt, which honors normal interrupt exclusion rules. However to make the code simpler the silent checking and the exception handling share the same code paths, which means that these problems apply to some extent to the silent event check too.
对于静默式机器检查,未定义的中断状态不是问题,因为它们通常从计时器中断运行,而计时器中断遵循正常的中断排除规则。然而,为了使代码更简单,静默检查和异常处理共享相同的代码路径,这意味着这些问题在一定程度上也适用于静默事件检查。

It is also important to handle the machine check quickly (because the machine may be already unstable after an hardware failure). When the handling is delayed to bring the kernel into a easier to handle state first there is a risk that the event cannot be handled at all. Also when another machine check occurs on the same bank in this time window it would overwrite the old event and become un-handleable.
快速处理机器检查也很重要(因为硬件故障后机器可能已经不稳定)。当处理被延迟以使内核首先进入更易于处理的状态时,有可能根本无法处理该事件。此外,当在这个时间窗口内同一银行发生另一个机器检查时,它将覆盖旧事件并变得不可处理。

For more complex event like an RAM error however there may be no other choice than to delay handling, because they must synchronize with kernel locks.
但是对于更复杂的事件,比如ram错误,可能除了延迟处理之外别无选择,因为它们必须与内核锁同步。

Unlike other exceptions, machine checks are asynchronous. This means the CPU core does not take care of reporting them at the exact instruction that caused the failure, but they may be reported only hundreds of cycles later. This makes handling less reliable as discussed below.
与其他异常不同,机器检查是异步的。这意味着CPU核心不负责按照导致故障的确切指令报告它们,但它们可能在数百个周期后才被报告。这使得操作不太可靠,如下所述。

5. Logging machine checks

Traditionally machine checks were logged by the firmware4. When the operating system does not have an machine check handler, the MC registers will never be cleared. After the next warm boot5 the BIOS finds the information from the last machine check and logs it to an event log. This method has obvious shortcomings: the logging only happens when the machine is rebooted, it cannot log multiple errors in the same bank and it is hard to collect this information in a network or save it to disk.
传统的机器检查是由固件记录的。当操作系统没有机器检查处理程序时,mc寄存器将永远不会被清除。在下一次热启动5之后,bios从上一次机器检查中找到信息并将其记录到事件日志中。这种方法有明显的缺点:日志记录只在机器重新启动时发生,不能在同一个银行记录多个错误,而且很难在网络中收集这些信息或将其保存到磁盘。

Moving the logging to the operating system can avoid all these problems. Even then it is still difficult. Most Linux users have the X server running and the console is invisible. This means that the handler could log a fatal machine check, but the user wouldn’t see it and just see a frozen X. One way to avoid this right now is to log the error after the next reboot only. This also allows us to save it to disk, which makes it possible to check it later by support personnel.
将日志记录移到操作系统可以避免所有这些问题。即便如此,这仍然是困难的。大多数linux用户都在运行x服务器,控制台是不可见的。这意味着处理程序可能会记录一个致命的机器检查,但用户不会看到它,而只是看到一个冻结的X。现在避免这种情况的一种方法是只在下次重新启动后记录错误。这也允许我们将其保存到磁盘,这样以后支持人员就可以对其进行检查。

It is important to clearly separate machine check logs from other software errors (like oopses). Most users cannot distingush them and they will ask their software vendors about it, when they should really contact the hardware vendor. Experience has shown that the only good way to do this is to separate the log mechanisms completely.
将机器检查日志与其他软件错误(如oopse)清楚地分开是很重要的。大多数用户无法区分他们,他们会询问他们的软件供应商,当他们真的应该联系硬件供应商。经验表明,这样做的唯一好方法是完全分离日志机制。

6. x86-64 rewrite

The original x86-64 machine check handler in Linux 2.4 was derived from the i386 version, which was originally written by Alan Cox. One of the first enhancements was a text decoder of the AMD Opteron specific banks6, in particular for memory errors. This decoder code unfortunately had a few bugs and turned out to be a design mistake. It is better to do such decoding in user space.
Linux2.4中最初的x86-64机器检查处理程序是从i386版本派生的,该版本最初由AlanCox编写。第一个增强功能之一是AMDOpteron特定库6的文本解码器,特别是针对内存错误。不幸的是,这个解码器代码有一些错误,结果是一个设计错误。最好在用户空间中进行这样的解码。

The handler also had a few problems inherited from the i386 version, in particular it used printk directly from the handler, which could deadlock. There were also a few other problems, which triggered a rewrite from scratch of the x86-64 handler during Linux 2.5.
处理程序还继承了i386版本的一些问题,特别是它直接从处理程序使用printk,这可能会导致死锁。还有一些其他的问题,在Linux2.5中触发了对x86-64处理程序的从头重写。

It closely follows the recommendations given by Intel[intelsys] and AMD[amdsys] for machine check handlers. One important change is that the new handler makes some attempts to distinguish between uncorrected errors and errors that corrupt the processor context. In the first case only the process is killed when it is safe. The old handler would always panic.
它严格遵循英特尔(intelsys)和amd(amdsys)对机器检查处理器的建议。一个重要的变化是,新的处理程序尝试区分未更正的错误和损坏处理器上下文的错误。在第一种情况下,只有在过程安全时才会终止。老管家总是惊慌失措。

This killing can be slightly risky when the process was in kernel mode, because it could have been holding locks and that deadlock on process exit.
当进程处于内核模式时,这种终止可能有点危险,因为它可能持有锁,并且进程退出时出现死锁。

By default the kernel will always panic on a MC in the kernel to avoid this deadlock. The rationale is that a panic can be handled better than a deadlock, especially in a cluster.
默认情况下,内核总是在内核中的mc上死机,以避免这种死锁。其基本原理是恐慌可以比死锁更好地处理,特别是在集群中。

The new handler doesn’t have any CPU specific code any more 8, it handles everything using the generic x86 machine check architecture.
新的处理程序不再有任何特定于cpu的代码,它使用通用的x86机器检查体系结构处理所有事情。

It has a new lockless binary logging system. All machine check events (silent and exceptions) will be logged to a special buffer. This buffer isn’t a ring buffer, if the buffer fills up new entries are discarded. It is completely separated from the normal printk log and can be accessed from user space using the /dev/mcelog character device. This device should be read in a regular cronjob by the mcelog[mcelog] program. mcelog decodes the event and logs it into a special log file. It could also notify administrators about the event.
它有一个新的无锁二进制日志系统。所有机器检查事件(静默和异常)都将记录到一个特殊的缓冲区中。如果缓冲区填充了新条目,则该缓冲区不是环形缓冲区。它与普通的printk日志完全分离,可以使用/dev/mcelog字符设备从用户空间访问。这个设备应该由mcelog[mcelog]程序在常规的cronjob中读取。mcelog解码事件并将其记录到一个特殊的日志文件中。它还可以将事件通知管理员。

The log buffer also has a special signature in memory that could be used by an external debugger or special firmware to look for hardware errors after reboot.
日志缓冲区在内存中还有一个特殊的签名,外部调试器或特殊固件可以使用该签名在重新启动后查找硬件错误。

On a panic the bank causing the fatal error is not cleared to allow firmware or the kernel to log the error after an warm reboot to permanent storage.
在死机时,导致致命错误的库不会被清除,以允许固件或内核在热重启到永久存储后记录错误。

It has a regular polling timer that reads silent machine checks and logs them.
它有一个定期的轮询计时器,可以读取静默机器检查并记录它们。

/* A machine check record */
struct mce {
    __u64 status; /* bank status register */
    __u64 misc; /* misc register (always 0 right now) */
    __u64 addr; /* address or 0 */
    __u64 mcgstatus; /* global MC status register */
    __u64 rip; /* Program counter or 0 for silent error */
    __u64 tsc; /* cpu time stamp counter */
    __u64 res1; /* for future extension */
    __u64 res2; /* dito. */
    __u8 cs; /* code segment */
    __u8 bank; /* machine check bank */
    __u8 cpu; /* cpu that raised the error */
    __u8 finished; /* entry is valid */
    __u32 pad;
};

7. Configuring the new x86-64 handler

The new handler can be configured at system run time by reading or writing the control files in /sys/devices/system/machinecheck/machinecheck0/ 10 Valid fields are:
可以在系统运行时通过读取或写入/sys/devices/system/machinecheck/machinecheck0/10中的控制文件来配置新的处理程序,有效字段为:

  • tolerant Tolerance level. The higher this level the more risk the machine check handler takes to keep the machine running.
    容忍度。此级别越高,机器检查处理程序保持机器运行的风险就越大。

    Valid levels are: 有效级别为:

    • 0 always panic on uncorrected errors.总是对未纠正的错误感到恐慌。
    • 1 panic if deadlock possible.可能出现死锁时出现恐慌
    • 2 try to avoid panic at slight deadlock risk.尽量避免有轻微死锁风险的恐慌。
    • 3 never panic or exit (for testing only).切勿惊慌或退出(仅用于测试)。

    Specifying oops=panic on the kernel command line implies zero tolerance.
    在内核命令行上指定oops=panic意味着零容忍。

    For a cluster setting tolerant to zero may be best, together with panic=10 to force an reboot.
    对于群集,最好将容错设置为零,同时使用panic=10强制重新启动。

  • check interval Interval in seconds to check for silent machine check events. Default 5 minutes. 0 disables background checking.
    检查间隔(以秒为单位)以检查静默机器检查事件。默认5分钟。0禁用后台检查。

  • bank0ctl … bankNctl Binary mask of errors enabled in bank N. Default is to enable all errors in each bank. An disabled error will be ignored. For details on the banks and their sub-errors for AMD and Intel CPUs see [opteron] and [intelsys].
    bank0CTL…bankNctl在组n中启用的错误二进制掩码。默认值是启用每个组中的所有错误。将忽略禁用的错误。有关AMD和Intel CPU的组及其子错误的详细信息,请参阅[Opteron]和[Intelsys]。

8. Future work: New RAM/cache error handling 未来工作:新的ram/cache错误处理

RAM errors are the most common sources of machine check events. The memory controller runs asynchronously from the CPU core, which results in errors getting reported imprecisely. The MCE handler assumes that the error occurred in the process that was active at exception time and checks if it was in kernel or user mode. It then uses this information to decide which process to kill or if it should panic. When the error happened shortly before a kernel call or a context switch this information may be stale. A more reliable alternative would be to use the physical error address provided in the MCn ADDR register and use VM data structures to look up to which process the memory address belong. This could be multiple processes for shared memory.
RAM错误是机器检查事件最常见的来源。内存控制器从CPU核心异步运行,这会导致错误报告不精确。MCE处理程序假设错误发生在异常时处于活动状态的进程中,并检查该进程是否处于内核或用户模式。然后,它使用这些信息来决定要终止哪个进程或是否应该恐慌。当错误发生在内核调用或上下文切换之前不久时,此信息可能已过时。更可靠的替代方法是使用mcn addr寄存器中提供的物理错误地址,并使用vm数据结构查找内存地址所属的进程。这可能是共享内存的多个进程。

The handler would first need to synchronize to process state because VM locks are not interrupt safe. It could first go into interrupt context by forcing an self interrupt11 on the same CPU (this would delay execution to the next local interrupt enable and a standard interrupt context). Then this interrupt handler could set up a work queue item to run a callback in one of the event processes on the local CPU.
处理程序首先需要同步到进程状态,因为vm锁不是中断安全的。它可以首先通过在同一个CPU上强制一个自中断11进入中断上下文(这将延迟到下一个本地中断启用和标准中断上下文的执行)。然后,此中断处理程序可以设置一个工作队列项,以便在本地CPU上的某个事件进程中运行回调。

This callback could use the mem map and rmap data structures offered in Linux 2.6 to look up the owner of the failed page. There are various cases to distinguish in the kernel page cache:
这个回调可以使用linux 2.6中提供的mem map和rmap数据结构来查找失败页面的所有者。在内核页缓存中有多种情况需要区分:

类型 操作
Free page Ignore and clear error
Clean page Free page and reread contents from disk
Dirty page Kill process owning or force IO error for unmapped file cache data
Kernel page Panic or kill process depending on tolerance level

This approach could also be possible to handle uncorrected cache errors. In the future it may be also possible to give an application a chance to react to a machine check error by sending it a signal with the failed address as payload instead of unconditionally killing it12. The program could then decide how to handle the corrupted memory. For example a database server with a lot of data cache which is backed by the disk could just drop a corrupted cache page and reread it.
这种方法还可以处理未修正的缓存错误。在未来,还可以给应用程序一个机会,通过向它发送一个以失败地址作为有效负载的信号,而不是无条件地杀死它12,从而对机器检查错误作出反应。然后程序可以决定如何处理损坏的内存。例如,一个拥有大量数据缓存的数据库服务器,它由磁盘支持,只需删除一个损坏的缓存页,然后重新读取它。

9. Future work: Handling IO errors on PCs

Some non-PC platforms like HP zX or IBM PPC64 chipsets raise machine checks on PCI IO bus aborts. On PCs these errors are normally silently ignored. Some chipsets can be configured to raise an NMI in this case. It would be possible to write chipset specific drivers that look up the PCI bridge error registers on NMI and try to figure out what device caused the error. Then disable the PCI device to prevent further corruption13 and log an error to the user. This would be useful for driver debugging and could potentially protect the kernel against failing PCI cards. To do the latter properly it would also need a full IOMMU.
一些非PC平台,如HP ZX或IBM PPC64芯片组,会在PCI IO总线中断时引发机器检查。在个人电脑上,这些错误通常会被忽略。在这种情况下,可以配置一些芯片组来提高NMI。可以编写特定于芯片组的驱动程序,查找NMI上的PCI桥错误寄存器,并尝试找出导致错误的设备。然后禁用PCI设备以防止进一步损坏13并向用户记录错误。这将有助于驱动程序调试,并可能保护内核免受PCI卡故障的影响。要正确地完成后者,还需要一个完整的IOMMU。

PCI Express[pcie] has an optional but standardized advanced error report capability in its bridge configuration space that may be useful here.
PCI Express[PCIE]在其网桥配置空间中具有可选但标准化的高级错误报告功能,在这里可能很有用。

One problem is that there is no well-defined way to find the source of an NMI because it is used by other subsystems like oprofile.
一个问题是,由于nmi被其他子系统(如oprofile)使用,因此没有明确定义的方法来找到它的来源。

Another problem is that changing the PCI bridge programming of the firmware (e.g. to enable additional error reporting using NMI) has always some risk.
另一个问题是,更改固件的PCI网桥编程(例如,使用NMI启用额外的错误报告)总是有一些风险。

Still it might be worth it because handling PCI errors better could potentially increase Linux/x86 reliability longer term. Short term it would uncover some more driver bugs, although many of those should be already fixed from testing on PPC64 and IA64. I am not quite sure it will be possible to implement this generally on PCs, but it would be at least worth a try.
不过,这可能还是值得的,因为更好地处理PCI错误可能会提高Linux/x86的长期可靠性。短期内,它会发现更多的驱动程序错误,尽管这些错误中的许多应该已经在ppc64和ia64上的测试中修复了。我不太确定能否在个人电脑上实现这一点,但至少值得一试。

10. Work to do

NMI handling is still broken. Currently it reads some IO ports and handles them based on what they did on IBM AT, which is not very useful anymore on modern machines. It also still uses printk and should use the lockless logging framework.
NMI处理仍然中断。目前,它读取一些IO端口,并根据它们在IBM AT上所做的操作来处理它们,这在现代机器上已不再很有用。它还使用printk,应该使用无锁日志框架。

Add a proper thermal handler on x86-64.
在x86-64上添加适当的热处理程序。

The improved x86-64 machine check handler should be ported to x86.
改进的x86-64机器检查处理程序应移植到x86。

Mcelog doesn’t decode AMD Opteron specific errors so far. Currently it only dumbs the registers as hexadecimal. It would be more user friendly to show the individual banks as text, with the various error bits decoded. Intel decoding support should also eventually be added.
到目前为止,mcelog还没有解码amd opteron的特定错误。目前它只将寄存器简化为十六进制。将各个组显示为文本,并对各种错误位进行解码,这将更加方便用户。英特尔解码支持也应最终添加。

11. Acknowledgments 致谢

The original i386 Linux MCE handler was written by Alan Cox. Dave Jones and Zwane Mwaikumbo also worked on the i386 handler in Linux 2.5. Paul DeVriendt, Richard Brunner, David Boles gave valuable feedback on the x86-64 rewrite and helped me to understand the problem better. Special thanks to Eric Morton for a lot of review and patches of the code and to Randy Dunlap for reviewing this paper.
最初的i386 linux mce处理器是由alan cox编写的。dave jones和zwane mwaikumbo也在linux 2.5中开发了i386处理器。paul devriendt、richard brunner、david boles对x86-64重写给出了宝贵的反馈,帮助我更好地理解问题。特别感谢Eric Morton对代码进行了大量的审查和修补,感谢Randy Dunlap对本文的审查。

你可能感兴趣的:(文献简译)