尝试逻辑恢复ocr设备,导致了ocr状态不一致的问题。
在RAC的其中一个节点上执行下面的操作:
bash-2.03# /data/oracle/product/10.2/crs/bin/ocrconfig -export /export/home/oracle/ocr.log
bash-2.03# /data/oracle/product/10.2/crs/bin/ocrconfig -import /export/home/oracle/ocr.log
PROT-19: Cannot proceed while clusterware is running. Shutdown clusterware first
bash-2.03# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
Nov 21 23:47:25.966 | INF | daemon shutting down
Stopping resources. This could take several minutes.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.
bash-2.03# /data/oracle/product/10.2/crs/bin/ocrconfig -import /export/home/oracle/ocr.log
bash-2.03# /etc/init.d/init.crs start
Startup will be queued to init within 30 seconds.
由于尝试利用物理备份的方式恢复ocr的时候,没有停止其他节点的cluster,因此逻辑恢复操作也没有停止其他节点的cluster,逻辑恢复完成后,发现数据库没有正常启动,检查alert文件,发现数据库启动后出现了29702错误:
Sun Nov 21 23:47:05 2010
Shutting down instance (abort)
License high water mark = 6
Instance terminated by USER, pid = 18181
Sun Nov 21 23:53:31 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 ce110.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 ce0 172.25.0.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /data/oracle/product/10.2/database/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 150
__shared_pool_size = 318767104
__large_pool_size = 16777216
__java_pool_size = 50331648
__streams_pool_size = 33554432
spfile = /dev/rac/spfiletestrac.ora
sga_target = 1258291200
control_files = /dev/rac/control1.ctl, /dev/rac/control2.ctl, /dev/rac/control3.ctl
db_block_size = 16384
__db_cache_size = 822083584
compatible = 10.2.0.1.0
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
thread = 1
instance_number = 1
undo_management = AUTO
undo_tablespace = UNDOTBS1
remote_os_authent = TRUE
O7_DICTIONARY_ACCESSIBILITY= TRUE
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=testracXDB)
local_listener = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.25.198.224)(PORT = 1521))
remote_listener =
job_queue_processes = 10
background_dump_dest = /data/oracle/admin/testrac/bdump
user_dump_dest = /data/oracle/admin/testrac/udump
core_dump_dest = /data/oracle/admin/testrac/cdump
audit_file_dest = /data/oracle/admin/testrac/adump
db_name = testrac
open_cursors = 300
pga_aggregate_target = 418381824
Cluster communication is configured to use the following interface(s) for this instance
10.0.0.1
Sun Nov 21 23:53:35 2010
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=21528
DIAG started with pid=3, OS id=21530
PSP0 started with pid=4, OS id=21532
LMON started with pid=5, OS id=21534
Sun Nov 21 23:53:35 2010
WARNING: Failed to set buffer limit on IPC interconnect socket
Oracle requires that the SocketReceive buffer size be tunable upto 1MB
Please make sure the kernel parameterwhich limits SO_RCVBUF value set by
applications is atleast 1MB
LMD0 started with pid=6, OS id=21536
LMS0 started with pid=7, OS id=21546
LMS1 started with pid=8, OS id=21550
MMAN started with pid=9, OS id=21554
DBW0 started with pid=10, OS id=21556
LGWR started with pid=11, OS id=21558
CKPT started with pid=12, OS id=21560
SMON started with pid=13, OS id=21562
RECO started with pid=14, OS id=21564
CJQ0 started with pid=15, OS id=21566
MMON started with pid=16, OS id=21576
Sun Nov 21 23:53:36 2010
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=17, OS id=21578
Sun Nov 21 23:53:36 2010
starting up 1 shared server(s) ...
USER: terminating instance due to error 29702
Instance terminated by USER, pid = 21382
尝试在sqlplus中手工启动数据库,Oracle报错ORA-29702:
bash-2.03$ sqlplus / as sysdba
SQL*Plus: Release10.2.0.4.0 - Production on星期二11月23 23:11:04 2010
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
已连接到空闲例程。
SQL> startup
ORA-29702: error occurred in Cluster Group Service operation
Oracle的错误文档上对于这个错误的描述是:
ORA-29702: error occurred in Cluster Group Service operation
Cause: An unexpected error occurred while performing a CGS operation.
Action: Verify that the LMON process is still active. Check the Oracle LMON trace files for errors. Also, check the related CSS trace file for errors.
检查对应的trace文件:
bash-2.03$ more testrac1_lmon_21534.trc
/data/oracle/admin/testrac/bdump/testrac1_lmon_21534.trc
Oracle Database10gEnterprise Edition Release10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /data/oracle/product/10.2/database
System name: SunOS
Node name: racnode1
Release: 5.8
Version: Generic_117350-46
Machine: sun4u
Instance name: testrac1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 5
Unix process pid: 21534, image: oracle@racnode1 (LMON)
*** SERVICE NAME:() 2010-11-21 23:53:35.450
*** SESSION ID:(167.1) 2010-11-21 23:53:35.450
GES resources 4260 pool 3
GES enqueues 6175
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:1050 b:1050) Reserve 301
GES IPC: Msg Size Regular 408 Batch 8192
Batching factor: enqueue replay 201, ack 224
Batching factor: cache replay 126 size per lock 64
kjxggin: receive buffer size = 32768
clsc_connect: (1069c0b60) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode1_crs))
2010-11-21 23:53:40.670: [ CSSCLNT]clsssInitNative: connect failed, rc 9
kgxgncin: CLSS init failed with status 3
kjxgmin: kgxgncin fails - (2)
kjxggin: generic group layer init fails
*** 2010-11-21 23:53:40.671
Global Enqueue Service Shutdown
*** 2010-11-21 23:53:40.673
global enqueue service detaching from CM:pid=21534
显然,是恢复ocr时由于没有关闭所有的节点的cluster造成了当前实例的错误。
关闭节点2的cluster,利用逻辑备份ocrconfig �Crestore来恢复ocr设备后,两个节点重新启动cluster及数据库,系统恢复正常。
建议无论是采用逻辑方式还是物理方式恢复ocr,都应该关闭所有节点的cluster后再进行,否则就容易导致cluster状态错误。
oracle视频教程请关注:http://u.youku.com/user_video/id_UMzAzMjkxMjE2.html