今天在做一个change的时候,change的内容本身比较简单,就是将控制文件冗余到不同的mount point去,当alter system control_file scope=spfile之后,关闭RAC,mv控制文件,将之再次启动,使得spfile中指向的新路径的控制文件生效。问题出在了关闭RAC之后,数据库就启动不了了。
我们从关闭RAC开始,看看当时的场景。
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ srvctl stop database -d SIAP
同时我们tail -f节点1和节点2上的alertlog,发现节点2是很快很顺利的关闭了,但是节点1迟迟未关闭。
节点2的alertlog:
Sat Jun 02 14:04:45 GMT 2012
Shutting down instance (immediate)
Sat Jun 02 14:04:45 GMT 2012
Shutting down instance: further logons disabled
Sat Jun 02 14:04:49 GMT 2012
Stopping background process CJQ0
Sat Jun 02 14:04:49 GMT 2012
Stopping background process QMNC
Sat Jun 02 14:04:51 GMT 2012
Stopping background process MMNL
Sat Jun 02 14:04:52 GMT 2012
Stopping background process MMON
License high water mark = 1004
Sat Jun 02 14:04:53 GMT 2012
Job queue slave processes stopped
Sat Jun 02 14:05:59 GMT 2012
ALTER DATABASE CLOSE NORMAL
Sat Jun 02 14:05:59 GMT 2012
SMON: disabling tx recovery
SMON: disabling cache recovery
Sat Jun 02 14:06:00 GMT 2012
Shutting down archive processes
Archiving is disabled
Sat Jun 02 14:06:05 GMT 2012
ARCH shutting down
ARC9: Archival stopped
Sat Jun 02 14:06:10 GMT 2012
ARCH shutting down
ARC8: Archival stopped
Sat Jun 02 14:06:15 GMT 2012
ARCH shutting down
ARC7: Archival stopped
Sat Jun 02 14:06:21 GMT 2012
ARCH shutting down
ARC6: Archival stopped
Sat Jun 02 14:06:31 GMT 2012
ARCH shutting down
ARC4: Archival stopped
Sat Jun 02 14:06:36 GMT 2012
ARCH shutting down
ARC3: Archival stopped
Sat Jun 02 14:06:46 GMT 2012
ARCH shutting down
ARC1: Archival stopped
Sat Jun 02 14:06:51 GMT 2012
ARCH shutting down
Sat Jun 02 14:06:51 GMT 2012
ARC5: Becoming the 'no FAL' ARCH
Sat Jun 02 14:06:51 GMT 2012
ARC0: Archival stopped
Sat Jun 02 14:06:51 GMT 2012
ARC5: Becoming the 'no SRL' ARCH
ARC5: Archiving disabled
Sat Jun 02 14:06:51 GMT 2012
ARCH shutting down
ARC5: Archival stopped
Sat Jun 02 14:06:51 GMT 2012
ARC2: Becoming the heartbeat ARCH
Sat Jun 02 14:06:58 GMT 2012
Thread 2 closed at log sequence 21703
Successful close of redo thread 2
Sat Jun 02 14:07:26 GMT 2012
Completed: ALTER DATABASE CLOSE NORMAL
Sat Jun 02 14:07:26 GMT 2012
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
Sat Jun 02 14:07:33 GMT 2012
freeing rdom 0
在14:07左右节点2就关闭了。
节点1的alertlog:
Sat Jun 02 14:04:52 GMT 2012
Shutting down instance (immediate)
Sat Jun 02 14:04:52 GMT 2012
Shutting down instance: further logons disabled
Sat Jun 02 14:04:54 GMT 2012
Stopping background process CJQ0
Sat Jun 02 14:04:54 GMT 2012
Stopping background process QMNC
Sat Jun 02 14:04:56 GMT 2012
Stopping background process MMNL
Sat Jun 02 14:04:57 GMT 2012
Stopping background process MMON
License high water mark = 1417
Sat Jun 02 14:04:58 GMT 2012
Job queue slave processes stopped
Sat Jun 02 14:12:41 GMT 2012
IPC Send timeout detected.Sender: ospid 14002
Receiver: inst 2 binc 379475 ospid 17274
Sat Jun 02 14:12:42 GMT 2012
IPC Send timeout to 1.0 inc 4 for msg type 8 from opid 33
Sat Jun 02 14:12:42 GMT 2012
Communications reconfiguration: instance_number 2
Sat Jun 02 14:12:42 GMT 2012
Trace dumping is performing id=[cdmp_20120602141242]
Sat Jun 02 14:13:33 GMT 2012
Shutting down instance (abort)
License high water mark = 1417
Termination issued to instance processes. Waiting for the processes to exit
Sat Jun 02 14:13:43 GMT 2012
Instance termination failed to kill one or more processes
Sat Jun 02 14:14:13 GMT 2012
Instance terminated by USER, pid = 5214
我们发现节点1是在14:13分,被RAC以shutdown abort的方式关闭的。
至此,关闭RAC的命令以以下报错结束:
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ srvctl stop database -d SIAP
PRKP-1002 : Error stopping instance SIAP1 on node vus029pa
CRS-0216: Could not stop resource 'ora.SIAP.SIAP1.inst'.
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $
虽然oracle提示了节点1无法关闭,但是检查了oracle的进程都已经消失了。确认节点1已经关闭,因此开始重新启动数据库。没想到启动的时候,起不来了:
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ srvctl start database -d SIAP
PRKP-1001 : Error starting instance SIAP1 on node vus029pa
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.SIAP.SIAP1.inst' has placement error.
PRKP-1001 : Error starting instance SIAP2 on node vus160po
CRS-0215: Could not start resource 'ora.SIAP.SIAP2.inst'.
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $
发现此时节点是是启动到nomount状态,即有进程存在但是无法启动,节点2是没启动。
对应的alertlog为:
节点1:
Starting ORACLE instance (normal)
Sat Jun 02 14:40:57 GMT 2012
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
processes = 1500
timed_statistics = TRUE
__shared_pool_size = 6442450944
shared_pool_size = 6442450944
__large_pool_size = 218103808
large_pool_size = 218103808
__java_pool_size = 33554432
java_pool_size = 33554432
__streams_pool_size = 0
spfile = /ora/db001/data001/SIAP/spfileSIAP.ora
filesystemio_options = SETALL
disk_asynch_io = TRUE
cluster_interconnects = 192.168.0.1
sga_target = 21474836480
control_files = /ora/db001/data001/SIAP/control01SIAP.ctl, /ora/db001/redo002/SIAP/control02SIAP.ctl, /ora/db001/redo001/SIAP/control03SIAP.ctl
control_file_record_keep_time= 30
db_block_size = 8192
__db_cache_size = 14763950080
db_cache_size = 10737418240
db_writer_processes = 10
compatible = 10.2.0
log_archive_config = DG_CONFIG=(SIAP,SIAP_DG)
log_archive_dest_1 = location=/ora/db001/dbfra001/SIAP valid_for=(all_logfiles,all_roles) db_unique_name=SIAP
log_archive_dest_2 = service=SIAP_DG valid_for=(online_logfiles,primary_role) LGWR ASYNC DB_UNIQUE_NAME=SIAP_DG
log_archive_dest_state_1 = ENABLE
log_archive_dest_state_2 = ENABLE
log_archive_max_processes= 10
standby_archive_dest = /ora/db001/dbfra001/SIAP/standby
log_archive_format = arch_SIAP_%s_%t_%r.arc
fal_client = SIAP
fal_server = SIAP_DG
log_buffer = 14234112
db_files = 1000
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
db_recovery_file_dest = /ora/db001/dbfra001/SIAP
db_recovery_file_dest_size= 64424509440
standby_file_management = AUTO
thread = 1
fast_start_mttr_target = 300
log_checkpoints_to_alert = TRUE
instance_number = 1
undo_management = AUTO
undo_tablespace = undots
undo_retention = 3600
create_stored_outlines = FALSE
remote_login_passwordfile= EXCLUSIVE
db_domain = in.telstra.world
instance_name = SIAP1
local_listener = LISTENER_VUS029PA
remote_listener = LISTENER_SIAP
session_cached_cursors = 100
utl_file_dir = /tmp
job_queue_processes = 2
cursor_sharing = force
background_dump_dest = /opt/app/oracle/admin/SIAP/bdump
user_dump_dest = /opt/app/oracle/admin/SIAP/udump
max_dump_file_size = 5M
core_dump_dest = /opt/app/oracle/admin/SIAP/cdump
audit_file_dest = /opt/app/oracle/admin/SIAP/adump
session_max_open_files = 200
open_links = 4
audit_trail = DB
db_name = SIAP
open_cursors = 1000
star_transformation_enabled= true
query_rewrite_enabled = true
query_rewrite_integrity = trusted
pga_aggregate_target = 17179869184
workarea_size_policy = auto
Cluster communication is configured to use the following interface(s) for this instance
192.168.0.1
Sat Jun 02 14:41:01 GMT 2012
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=10522
DIAG started with pid=3, OS id=10524
PSP0 started with pid=4, OS id=10526
LMON started with pid=5, OS id=10528
LMD0 started with pid=6, OS id=10530
LMS0 started with pid=7, OS id=10539
LMS1 started with pid=8, OS id=10544
LMS2 started with pid=9, OS id=10548
LMS3 started with pid=10, OS id=10552
LMS4 started with pid=11, OS id=10556
LMS5 started with pid=12, OS id=10560
LMS6 started with pid=13, OS id=10564
LMS7 started with pid=14, OS id=10576
LMS8 started with pid=15, OS id=10580
LMS9 started with pid=16, OS id=10584
LMSa started with pid=17, OS id=10594
LMSb started with pid=18, OS id=10598
LMSc started with pid=19, OS id=10602
LMSd started with pid=20, OS id=10606
MMAN started with pid=21, OS id=10623
DBW0 started with pid=22, OS id=10653
DBW1 started with pid=23, OS id=10673
DBW2 started with pid=24, OS id=10686
DBW3 started with pid=25, OS id=10690
DBW4 started with pid=26, OS id=10692
DBW5 started with pid=27, OS id=10694
DBW6 started with pid=28, OS id=10696
DBW7 started with pid=29, OS id=10706
DBW8 started with pid=30, OS id=10708
DBW9 started with pid=31, OS id=10710
LGWR started with pid=32, OS id=10712
CKPT started with pid=33, OS id=10714
SMON started with pid=34, OS id=10716
RECO started with pid=35, OS id=10718
CJQ0 started with pid=36, OS id=10720
MMON started with pid=37, OS id=10723
MMNL started with pid=38, OS id=10732
<
<<挂死在此处
节点2的alertlog为:
Sat Jun 02 14:22:15 GMT 2012
Starting ORACLE instance (normal)
Sat Jun 02 14:23:25 GMT 2012
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
processes = 1500
timed_statistics = TRUE
__shared_pool_size = 6442450944
shared_pool_size = 6442450944
__large_pool_size = 218103808
large_pool_size = 218103808
__java_pool_size = 33554432
java_pool_size = 33554432
__streams_pool_size = 0
spfile = /ora/db001/data001/SIAP/spfileSIAP.ora
filesystemio_options = SETALL
disk_asynch_io = TRUE
cluster_interconnects = 192.168.0.2
sga_target = 21474836480
control_files = /ora/db001/data001/SIAP/control01SIAP.ctl, /ora/db001/redo002/SIAP/control02SIAP.ctl, /ora/db001/redo001/SIAP/control03SIAP.ctl
control_file_record_keep_time= 30
db_block_size = 8192
__db_cache_size = 14763950080
db_cache_size = 10737418240
db_writer_processes = 10
compatible = 10.2.0
log_archive_config = DG_CONFIG=(SIAP,SIAP_DG)
log_archive_dest_1 = location=/ora/db001/dbfra001/SIAP valid_for=(all_logfiles,all_roles) db_unique_name=SIAP
log_archive_dest_2 = service=SIAP_DG valid_for=(online_logfiles,primary_role) LGWR ASYNC DB_UNIQUE_NAME=SIAP_DG
log_archive_dest_state_1 = ENABLE
log_archive_dest_state_2 = ENABLE
log_archive_max_processes= 10
standby_archive_dest = /ora/db001/dbfra001/SIAP/standby
log_archive_format = arch_SIAP_%s_%t_%r.arc
fal_client = SIAP
fal_server = SIAP_DG
log_buffer = 14234112
db_files = 1000
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
db_recovery_file_dest = /ora/db001/dbfra001/SIAP
db_recovery_file_dest_size= 64424509440
standby_file_management = AUTO
thread = 2
fast_start_mttr_target = 300
log_checkpoints_to_alert = TRUE
instance_number = 2
undo_management = AUTO
undo_tablespace = undots2
undo_retention = 3600
create_stored_outlines = FALSE
remote_login_passwordfile= EXCLUSIVE
db_domain = in.telstra.world
instance_name = SIAP2
local_listener = LISTENER_VUS160PO
remote_listener = LISTENER_SIAP
session_cached_cursors = 100
utl_file_dir = /tmp
job_queue_processes = 2
cursor_sharing = force
background_dump_dest = /opt/app/oracle/admin/SIAP/bdump
user_dump_dest = /opt/app/oracle/admin/SIAP/udump
max_dump_file_size = 5M
core_dump_dest = /opt/app/oracle/admin/SIAP/cdump
audit_file_dest = /opt/app/oracle/admin/SIAP/adump
session_max_open_files = 200
open_links = 4
audit_trail = DB
db_name = SIAP
open_cursors = 1000
star_transformation_enabled= true
query_rewrite_enabled = true
query_rewrite_integrity = trusted
pga_aggregate_target = 17179869184
workarea_size_policy = auto
Cluster communication is configured to use the following interface(s) for this instance
192.168.0.2
Sat Jun 02 14:23:28 GMT 2012
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=24777
DIAG started with pid=3, OS id=24779
PSP0 started with pid=4, OS id=24781
LMON started with pid=5, OS id=24783
LMD0 started with pid=6, OS id=24785
LMS0 started with pid=7, OS id=24787
LMS1 started with pid=8, OS id=24794
LMS2 started with pid=9, OS id=24803
LMS3 started with pid=10, OS id=24807
LMS4 started with pid=11, OS id=24811
LMS5 started with pid=12, OS id=24816
LMS6 started with pid=13, OS id=24841
LMS7 started with pid=14, OS id=24852
LMS8 started with pid=15, OS id=24857
LMS9 started with pid=16, OS id=24861
LMSa started with pid=17, OS id=24865
LMSb started with pid=18, OS id=24901
LMSc started with pid=19, OS id=24907
LMSd started with pid=20, OS id=24920
MMAN started with pid=21, OS id=24924
DBW0 started with pid=22, OS id=24926
DBW1 started with pid=23, OS id=24928
DBW2 started with pid=24, OS id=24930
DBW3 started with pid=25, OS id=24932
DBW4 started with pid=26, OS id=24934
DBW5 started with pid=27, OS id=24938
DBW6 started with pid=28, OS id=24946
DBW7 started with pid=29, OS id=24948
DBW8 started with pid=30, OS id=24950
DBW9 started with pid=31, OS id=24952
LGWR started with pid=32, OS id=24954
CKPT started with pid=33, OS id=24956
SMON started with pid=34, OS id=24958
RECO started with pid=35, OS id=24968
CJQ0 started with pid=36, OS id=24970
MMON started with pid=37, OS id=24972
MMNL started with pid=38, OS id=24976
Sat Jun 02 14:23:46 GMT 2012
USER: terminating instance due to error 304
Instance terminated by USER, pid = 23354
我们看到节点2其实也尝试启动过,但是遭遇了error 304,所以被自动关闭了。
手工shutdown abort节点1之后,尝试再次重启,还是同样的问题,节点1能启动进程,但是挂死,节点2启动之后被关闭。
不过报错信息稍有不同,只有节点2无法启动的错误了。
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ srvctl start database -d SIAP
PRKP-1001 : Error starting instance SIAP2 on node vus160po
CRS-0215: Could not start resource 'ora.SIAP.SIAP2.inst'.
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $
多次尝试都是该错误,因此,RAC起不来的原因,应该是节点2无法启动,而节点2无法启动,应该是遭遇304的错误。
我们来查一下error 304是什么:
oracle@vus160po:SIAP2:/opt/app/oracle/product/10.2.0/db_1/dbs $ oerr ora 304
00304, 00000, "requested INSTANCE_NUMBER is busy"
// *Cause: An instance tried to start by using a value of the
// initialization parameter INSTANCE_NUMBER that is already in use.
// *Action: Either
// a) specify another INSTANCE_NUMBER,
// b) shut down the running instance with this number
// c) wait for instance recovery to complete on the instance with
// this number.
根据alertlog,发现节点2的instance_number是2:
……
fast_start_mttr_target = 300
log_checkpoints_to_alert = TRUE
instance_number = 2
……
尝试修改pfile中的instance_number,改为:
SIAP1.instance_number=3
SIAP2.instance_number=4
尝试再次重启,还是报错:
oracle@vus029pa:SIAP1:/opt/app/oracle/product/10.2.0/db_1/dbs $ srvctl start service -d SIAP -s SIAP2
PRKP-1030 : Failed to start the service SIAP2.
CRS-0215: Could not start resource 'ora.SIAP.SIAP2.cs'.
此时在节点1的alertlog情况和原来相同,也是到MMNL started with pid=xxx这一步就挂住了,但是在节点2的alertlog完全没有新的内容写出来,也就是节点2根本没尝试启动。
由于改instance_number没效果,只好把instance_number再改回去。
尝试手工启动节点2:
SQL> startup nomount
ORA-00304: requested INSTANCE_NUMBER is busy
SQL>
此时,同事也在一起看这个问题,他发现了故障原因所在:虽然数据库已经被shutdown了,但是在操作系统中,还仍然存在不少LOCAL=NO的进程。
ps -ef |grep SIAP2
oracle 5074 1 0 17:49:36 ? 0:02 oracleSIAP2 (LOCAL=NO)
oracle 18782 1 0 18:00:05 ? 1:01 oracleSIAP2 (LOCAL=NO)
oracle 4649 1 0 17:49:19 ? 0:46 oracleSIAP2 (LOCAL=NO)
oracle 21736 1 0 18:02:31 ? 0:29 oracleSIAP2 (LOCAL=NO)
oracle 3693 1 0 Mar 23 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26392 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 12509 8593 0 17:31:03 pts/4 0:00 tail -f /opt/app/oracle/admin/SIAP/bdump/alert_SIAP2.log
oracle 26382 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 29425 1 0 Apr 14 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26372 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 23702 1 0 May 19 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26368 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26377 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 7945 1 0 Apr 28 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26396 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26379 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 14564 1 0 May 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26367 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26373 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 23715 1 0 May 19 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 6311 1 0 Apr 18 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 29428 1 0 Apr 14 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 27062 1 0 Apr 02 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 1283 1 0 May 12 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26380 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26374 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26369 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26395 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26370 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 20775 1 0 May 28 ? 0:00 oracleSIAP2 (LOCAL=NO)
oracle 26371 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO)
通过ps -ef看到,确实存在很多没有down下去的LOCAL=N的进程。也因为是这个原因,让oracle认为节点2还是在活动的,因此不能重新启动该实例。
而该系统的架构是2台服务器上跑2个RAC,我们目前只down下去了1个RAC,另外一个RAC还在跑,服务器上还有不少LOCAL=NO的进程,因此当时我也没太注意竟然还有down掉的那个RAC的LOCAL=NO的进程。
ok,问题已经找到,用
kill -9 `ps -ef|grep LOCAL=NO |grep SIAP2|grep -v grep|awk '{print $2}'`
杀掉进程之后,再次重启数据库,成功。
唉,关于杀LOCAL=NO的进程的问题,其实我在很早的时候就写过一篇文章讲停数据库的技巧,没想到在RAC环境中,还是一样适用,没杀掉LOCAL=N进程,竟然会引起数据库无法启动的问题。
原文地址:http://www.oracleblog.org/working-case/startup-rac-fail-due-to-error-304/