192.168.219.90 使用 dmesg|grep -i error 查看时发现这台机器内存有问题,如下图所示:
[Hardware Error]: MC4 Error (node 1): L3 cache tag error.
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000018edfd9100
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

进一步查询发现是第5条内存有问题,需要联系私有云那边报修。
grep [0-9] /sys/devices/system/edac/mc/mc/csrow/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146
/sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0

count不为0的行即代表存在内存错误。
mc:第几个CPU。
csrow
:内存通道。
ch*:通道内的第几根内存。

然后通过dmidecode查看:

[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM01
Locator: DIMM02
Locator: DIMM03
Locator: DIMM04
Locator: DIMM05
Locator: DIMM06
Locator: DIMM07
Locator: DIMM08
Locator: DIMM09
Locator: DIMM10
Locator: DIMM11
Locator: DIMM12
Locator: DIMM13
Locator: DIMM14
Locator: DIMM15
Locator: DIMM16
Locator: DIMM17
Locator: DIMM18
Locator: DIMM19
Locator: DIMM20
Locator: DIMM21
Locator: DIMM22
Locator: DIMM23
Locator: DIMM24
Locator: DIMM25
Locator: DIMM26
Locator: DIMM27
Locator: DIMM28
Locator: DIMM29
Locator: DIMM30
Locator: DIMM31
Locator: DIMM32
通过服务器控制台查看内存:
Hardware Error 内存报错_第1张图片

主板上内存插槽的分布:
Hardware Error 内存报错_第2张图片

结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
应该是内存插槽DIMM_F1的问题。

解决:
最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。