内存 ECC 校验错误

目的

dmesg 中发现内存 ECC 校验错误
检测出有问题的内存位置

dmesg 信息

[    4.745351] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[    4.745359] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[    5.746989] EDAC MC0: 27609 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x105649c offset:0x6c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0093 socket:0 channel_mask:2 rank:1)
[    5.747001] EDAC MC0: 23245 CE memory scrubbing error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x105649e offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c3 socket:0 channel_mask:1 rank:1)
[  300.644412] mce: [Hardware Error]: Machine check events logged

获取内存错误信息

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:23245   <- 校验错误 dimm 0, channel 0, branch 0 
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:27609   <- 校验错误 dimm 0, channel 1, branch 0 
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0

参考文档信息

# yum install -y kernel-doc
已加载插件:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * updates: mirrors.sh.vclound.com
正在解决依赖关系
--> 正在检查事务
---> 软件包 kernel-doc.noarch.0.3.10.0-514.6.2.el7 将被 安装
--> 解决依赖关系完成

依赖关系解决

==============================================================================================================================
 Package                      架构                     版本                                   源                         大小
==============================================================================================================================
正在安装:
 kernel-doc                   noarch                   3.10.0-514.6.2.el7                     updates                    15 M

事务概要
==============================================================================================================================
安装  1 软件包

总下载量:15 M
安装大小:48 M
Downloading packages:
kernel-doc-3.10.0-514.6.2.el7.noarch.rpm                                                               |  15 MB  00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  正在安装    : kernel-doc-3.10.0-514.6.2.el7.noarch                                                                      1/1
  验证中      : kernel-doc-3.10.0-514.6.2.el7.noarch                                                                      1/1

已安装:
  kernel-doc.noarch 0:3.10.0-514.6.2.el7

完毕!

参考下面信息

 vim /usr/share/doc/kernel-doc-3.10.0/Documentation/edac.txt

                Channel 0       Channel 1
        ===================================
        csrow0  | DIMM_A0       | DIMM_B0 |
        csrow1  | DIMM_A0       | DIMM_B0 |
        ===================================

        ===================================
        csrow2  | DIMM_A1       | DIMM_B1 |
        csrow3  | DIMM_A1       | DIMM_B1 |
        ===================================

从上面可以看出,这里分两部分内存组, mc0, mc1
内存组 mc0 中第一第二内存 ECC 故障
即 DIMM 0 中的 channel 0 与 channel 1 位置

获取内存位置

 dmidecode -t memory |  grep -E 'Memory Device|Size:|Locator'
Memory Device
        Size: 16384 MB
        Locator: DIMM000
        Bank Locator: BRANCH 0 CHANNEL 0 DIMM 0         <-  故障
Memory Device
        Size: No Module Installed
        Locator: DIMM001
        Bank Locator: BRANCH 0 CHANNEL 0 DIMM 1       
Memory Device
        Size: 16384 MB
        Locator: DIMM010
        Bank Locator: BRANCH 0 CHANNEL 1 DIMM 0       <- 故障
Memory Device
        Size: No Module Installed
        Locator: DIMM011
        Bank Locator: BRANCH 0 CHANNEL 1 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM020
        Bank Locator: BRANCH 0 CHANNEL 2 DIMM 0
Memory Device
        Size: No Module Installed
        Locator: DIMM021
        Bank Locator: BRANCH 0 CHANNEL 2 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM030
        Bank Locator: BRANCH 0 CHANNEL 3 DIMM 0
Memory Device
        Size: No Module Installed
        Locator: DIMM031
        Bank Locator: BRANCH 0 CHANNEL 3 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM100
        Bank Locator: BRANCH 1 CHANNEL 0 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM101
        Bank Locator: BRANCH 1 CHANNEL 0 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM110
        Bank Locator: BRANCH 1 CHANNEL 1 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM111
        Bank Locator: BRANCH 1 CHANNEL 1 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM120
        Bank Locator: BRANCH 1 CHANNEL 2 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM121
        Bank Locator: BRANCH 1 CHANNEL 2 DIMM 1
Memory Device
        Size: 16384 MB
        Locator: DIMM130
        Bank Locator: BRANCH 1 CHANNEL 3 DIMM 0
Memory Device
        Size: 16384 MB
        Locator: DIMM131
        Bank Locator: BRANCH 1 CHANNEL 3 DIMM 1

你可能感兴趣的:(内存 ECC 校验错误)