一套Linux x86-64上的11.1.0.7 RAC系统,该RAC使用Netapps NFS作为共享存储。其中一个节点出现ORA-00600: internal error code, arguments: [kccchb_3]内部错误并导致实例意外终止,详细日志如下:
Mon Dec 27 00:03:13 2010
Error: Controlfile sequence number in file header is different from the one in memory
Please check that the correct mount optionsare used if controlfile is located on NFS
Errors in file /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/trace/TSMOTEBS1_ckpt_15907.trc (incident=11353):
ORA-00600: internal error code, arguments: [kccchb_3], [555541], [555540], [555540], [], [], [], [], [], [], [], []
Incident details in: /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/incident/incdir_11353/TSMOTEBS1_ckpt_15907_i11353.trc
Errors in file /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/trace/TSMOTEBS1_ckpt_15907.trc:
ORA-00600: internal error code, arguments: [kccchb_3], [555541], [555540], [555540], [], [], [], [], [], [], [], []
CKPT (ospid: 15907): terminating the instance due to error 469
根据Metalink文档 Note 303986.1 RAC instance using NFS via Netapps report Ora-600 [kccchb_3]:
Symptoms
RAC instance using NFS via Netapps occasionally reports:
ORA-00600 [kccchb_3], [689], [0], [],[],[],[]
ORA-00600 [kclchkblk_3], [7], [738515663], [14], [], [], [], []
Cause
This is most probably a Netapp issue wherein 'noac' doesn't work as expected.
NetApp recommended mount options for Oracle9i RAC on Solaris:
rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio,noac
NetApp recommended mount options for Linux SLES9, RHEL4, RHEL3 QU3 and later:
rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=ORACLE_BLOCK_SIZE,
wsize=ORACLE_BLOCK_SIZE,actimeo=0
NetApp recommended mount options for Linux RHAS2.1, RHEL3 pre QU3:
rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=ORACLE_BLOCK_SIZE,
wsize=ORACLE_BLOCK_SIZE,noac
And for AS 2.1 in /etc/modules.conf add:
options nfs nfs_uncached_io=1
Solution
Workaround
set filesystemio_options=DIRECTIO for all instances (to bypass any buffer cache)
以上介绍了该ORA-00600 [kccchb_3]内部错误可以通过设置初始化参数filesystemio_options为DIRECTIO来workaround;实际上更好的选择可以是设置filesystemio_options为SETALL,即同时使用DIRECTIO和ASYNC异步IO。 顺便提一下在Linux上启用异步ASYNC IO不仅需要设置该filesystemio_options参数,我们还需要同时保证以下几点: 1.filesystemio_options参数设置为ASYNC或者SETALL 2.合理设置Linux Kernel内核参数fs.aio-max-nr,例如设置为3145728;以及参数fs.file-max =6815744 3.disk_asynch_io初始化参数设置为TRUE 4.在10gR2下oracle binary可能没有正确以async_on选项编译,我们需要手动重新make(The reason behind this behavior is that the LIBAIO_0.1 Linux OS library is not attached for io_getevents.)
SQL>shutdown immediate
[maclean@rh8 ~]$ cd $ORACLE_HOME/rdbms/lib
[maclean@rh8 lib]$ ln -s /usr/lib/libaio.so skgaio.o
[maclean@rh8 lib]$ make PL_ORALIBS=-laio -f ins_rdbms.mk async_on
rm -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaioi.o
cp /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaio.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaioi.o
chmod 755 /u01/oracle/product/10.2.0/db_1/bin
- Linking Oracle
rm -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle
gcc -o /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle -L/u01/oracle/product/10.2.0/db_1/rdbms/lib/ -L/u01/oracle/product/10.2.0/db_1/lib/ -L/u01/oracle/product/10.2.0/db_1/lib/stubs/ -L/usr/lib -lirc -Wl,-E `test -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/sskgpsmti.o && echo /u01/oracle/product/10.2.0/db_1/rdbms/lib/sskgpsmti.o` /u01/oracle/product/10.2.0/db_1/rdbms/lib/opimai.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/ssoraed.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/ttcsoi.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/defopt.o -Wl,--whole-archive -lperfsrv10 -Wl,--no-whole-archive /u01/oracle/product/10.2.0/db_1/lib/nautab.o /u01/oracle/product/10.2.0/db_1/lib/naeet.o /u01/oracle/product/10.2.0/db_1/lib/naect.o /u01/oracle/product/10.2.0/db_1/lib/naedhs.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/config.o -lserver10 -lodm10 -lnnet10 -lskgxp10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lhasgen10 -lcore10 -lskgxn2 -locr10 -locrb10 -locrutl10 -lhasgen10 -lcore10 -lskgxn2 -lclient10 -lvsn10 -lcommon10 -lgeneric10 -lknlopt `if /usr/bin/ar tv /u01/oracle/product/10.2.0/db_1/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap10" ; fi` -lslax10 -lpls10 -lplp10 -lserver10 -lclient10 -lvsn10 -lcommon10 -lgeneric10 -lknlopt -lslax10 -lpls10 -lplp10 -ljox10 -lserver10 -lclsra10 -ldbcfg10 -locijdbcst10 -lwwg `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags` -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags` -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lmm -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags` -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags` -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `if /usr/bin/ar tv /u01/oracle/product/10.2.0/db_1/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo10"; fi` -lctxc10 -lctx10 -lzx10 -lgx10 -lctx10 -lzx10 -lgx10 -lordimt10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lsnls10 -lunls10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -laio `cat /u01/oracle/product/10.2.0/db_1/lib/sysliblist` -Wl,-rpath,/u01/oracle/product/10.2.0/db_1/lib -lm `cat /u01/oracle/product/10.2.0/db_1/lib/sysliblist` -ldl -lm -L/u01/oracle/product/10.2.0/db_1/lib
/u01/oracle/product/10.2.0/db_1/lib//libcore10.a(lcd.o): In function `lcdprm':
lcd.c:(.text+0x8e8): warning: the `gets' function is dangerous and should not be used.
mv -f /u01/oracle/product/10.2.0/db_1/bin/oracle /u01/oracle/product/10.2.0/db_1/bin/oracleO
mv /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle /u01/oracle/product/10.2.0/db_1/bin/oracle
chmod 6751 /u01/oracle/product/10.2.0/db_1/bin/oracle
SQL> startup;
/* 通过以下方式可以验证ASYNC IO的工作情况 */
cat /proc/slabinfo | grep kio
[maclean@rh8 ~]$ cat /proc/slabinfo | grep kio
kioctx 56 72 320 12 1 : tunables 54 27 8 : slabdata 6 6 0
kiocb 15 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0
/* 这里kioctx的值大于零,说明正在使用异步ASYNC IO */
如何是11g,那么无需那么麻烦要重新编译async_on,同时可以通过以下手段了解异步ASYNC IO的使用情况:
SQL> col name for a60
SQL> SELECT name, asynch_io FROM v$datafile f,v$iostat_file i
WHERE f.file# = i.file_no
AND filetype_name = 'Data File' ;
NAME ASYNCH_IO
------------------------------------------------------------ ---------
/standby/oradata/PROD/datafile/o1_mf_system_6q9dwgwh_.dbf ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_sysaux_6q9dwgyp_.dbf ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_undotbs1_6q9dwh0r_.dbf ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_users_6q9dwh2x_.dbf ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_example_6q9dzhh1_.dbf ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_enc_6q9fdzcx_.dbf ASYNC_ON
6 rows selected.
/* 可以看到ASYNCH_IO的状态为ON,说明正使用异步IO */