祭奠第一次数据库升级

         年初接到升级数据库(6T)的需求,从2016年6月份开始着手准备语句9月2日做升级。

        从内心来说,我是拒绝的,升级数据库是一个很大的工作项。不仅仅是数据库的升级,还涉及到很多东西,比如开发版本内容是否与新库兼容,测试是否有问题,关联系统的测试,业务方是否同意升级,当然还要包括数据库升级后性能怎么样等等问题,总之牵涉方方面面。

        数据库升级是由DBA主导,开发,测试,运营配置协调。所以主要是数据库升级,从6月开始,升级开发库。采用的方式,传输表空间 + 增量恢复。开发测试库2周时间搞定。就到达7月份了。7月开始找开发测试做回归测试,并且着手搭建生产12c的数据库,和性能测试库。性能测试库搭建好后,发现生产库在深圳,性能测试库在上海,6T这么大的数据库从深圳到上海,估计一个都传输不完,只能放弃。在深圳搭建一个开发库做性能测试。

        恢复了一个库,传输到开发主机,刚刚传输完,发现恢复的库被程序自动干掉了,增量没有弄出来。(一万头草泥马奔腾而过啊~)6T传输了几天才传输完成啊,而且已经是8月初了,还要做性能测试。赶紧从新恢复一个,搭建好环境已经8月中旬了。(时间都快赶不上升级定的时间了)继续加班使用SPA做性能比对。谁知道尼玛抓取了5万多的sql,去掉一部分没有使用绑定变量的,还有4000多。SPA居然跑了正在两天啊。尼玛一次都要跑这么久,心里那个苦啊。分析一下很多mis报表取数的,大量大表全表扫描,这个能不慢了。这个我快走不动了,找了个同事帮忙搞,同事三下两下的去掉了mis 取数的sql,然后把sql分类继续从新跑。跑完后性能差的有2000多条。这么多sql,别说一个人就是整个部门的人来分析都要花两三天的时间。还是请教那位同事,给出的建议是将sql分类,执行计划变化的,变化的分个类,是通过哪种方式变的?最终定位是直方图统计信息没有收集导致。从新收集直方图后后,sql变差的就只有158条,实际执行后只有几条有问题,直接SPM掉。这个做完基本就9月好几号了,原定实际没有希望了。改到10月14日,还得崔开发测试赶紧测试。

      国庆回来,测试还没有测试完成,麻蛋有延时到10月21日。

     快到10月21日了,把改准备的都准备了,唯一一点就是增量备份每次都要3个小时,本身升级就只有8小时窗口,问题发现了也问了其他人都是可能增量太多导致。而自己没有自己分析,是否步骤漏掉了。

    10月21日到了,所有的脚本和参数都弄好了(唯独没有检查到归档的大小的设置,这个坑让我一辈子都忘不了)。所有的俨然如旧,归档增量还是3小时,弄出来200G+,传输到目标库并且应用足足4小时。弄完后都4点了,还有原数据的导入,spm和outline导入,弄完都6点了。感觉时间不够了,速度找人GoldenGate,这个时间归档满了。而我还笨笨的没有发现啊~GG还在部署,等GG说hung了才去看,才发现满了。这时候已经晚了sqlplus和rman都无法登陆数据库了。kill掉local=no的也没有用。只能kill掉pmon进程,修改参数后起库,但是没有什么用啊~最后没有办法回退升级。


处理方式和总结:

1,  Sat Oct 22 05:56:49 2016   这个时间点数据库归档满了,但是没有发现。继续操作如下步骤:

 

恢复参数aq_tm_processes的值为默认值

    alter system resetaq_tm_processesscope=spfile;             --不设置, 保持默认值

    alter system setaudit_trail = 'DB' scope =spfile;                --重启生效

恢复12c目标库为归档模式

    shutdown immediate;

    startup mount;

    alter databasearchivelog;

alter databaseopen;

 

Sat Oct 22 05:56:492016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_arc3_8361.trc:

ORA-19815: WARNING:db_recovery_file_dest_size of 21474836480 bytes is 100.00% used, and has 0remaining bytes available.

Sat Oct 22 05:56:492016

************************************************************************

You have followingchoices to free up space from recovery area:

1. Consider changingRMAN RETENTION POLICY. If you are using Data Guard,

   thenconsider changing RMAN ARCHIVELOG DELETION POLICY.

2. Back up files totertiary device such as tape using RMAN

   BACKUPRECOVERY AREA command.

3. Add disk space andincrease db_recovery_file_dest_size parameter to

   reflectthe new space.

4. Delete unnecessaryfiles using RMAN DELETE command. If an operating

   systemcommand was used to delete files, then use RMAN CROSSCHECK and

   DELETEEXPIRED commands.

************************************************************************

 

 

2,2016-10-22 06:03:53   数据库重启完成后,数据库没有其他的报错,但是数据库归还是满。

Starting backgroundprocess ARC0

Sat Oct 22 06:04:092016

ARC0 started withpid=38, OS id=17017

ARC0: Archivalstarted

LGWR: STARTING ARCHPROCESSES COMPLETE

Sat Oct 22 06:04:092016

ARC0: STARTING ARCHPROCESSES

Starting backgroundprocess ARC1

Starting backgroundprocess ARC2

Sat Oct 22 06:04:092016

ARC1 started withpid=43, OS id=17035

Starting backgroundprocess ARC3

Sat Oct 22 06:04:092016

ARC2 started withpid=44, OS id=17037

Sat Oct 22 06:04:092016

ARC3 started withpid=45, OS id=17039

ARC1: Archivalstarted

ARC2: Archivalstarted

Sat Oct 22 06:04:092016

ARC1: Becoming the'no FAL' ARCH

ARC1: Becoming the'no SRL' ARCH

Sat Oct 22 06:04:092016

ARC2: Becoming theheartbeat ARCH

Sat Oct 22 06:04:092016

ARC3: Archivalstarted

ARC0: STARTING ARCHPROCESSES COMPLETE

Sat Oct 22 06:04:092016

Thread 1 opened atlog sequence 2260

  Current log# 4seq# 2260 mem# 0: +DATA_VIP_DG/VIP/ONLINELOG/group_4.261.923613797

  Current log# 4seq# 2260 mem# 1: +FRA_VIP_DG/VIP/ONLINELOG/group_4.259.923613797

Successful open ofredo thread 1

Sat Oct 22 06:04:092016

MTTR advisory isdisabled because FAST_START_MTTR_TARGET is not set

Sat Oct 22 06:04:092016

SMON: enabling cacherecovery

Sat Oct 22 06:04:092016

Errors in file /paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_arc1_17035.trc:

ORA-19815: WARNING:db_recovery_file_dest_size of 21474836480 bytes is 100.00% used, and has 0remaining bytes available.

Sat Oct 22 06:04:092016

************************************************************************

You have followingchoices to free up space from recovery area:

1. Consider changingRMAN RETENTION POLICY. If you are using Data Guard,

   thenconsider changing RMAN ARCHIVELOG DELETION POLICY.

2. Back up files totertiary device such as tape using RMAN

   BACKUPRECOVERY AREA command.

3. Add disk space andincrease db_recovery_file_dest_size parameter to

   reflectthe new space.

4. Delete unnecessaryfiles using RMAN DELETE command. If an operating

   systemcommand was used to delete files, then use RMAN CROSSCHECK and

   DELETEEXPIRED commands.

************************************************************************

Sat Oct 22 06:04:092016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_arc1_17035.trc:

ORA-19809: limitexceeded for recovery files

ORA-19804: cannotreclaim 348127232 bytes disk space from 21474836480 limit

ARC1: Error 19809Creating archive log file to '+FRA_VIP_DG'

Sat Oct 22 06:04:092016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_arc3_17039.trc:

ORA-19815: WARNING:db_recovery_file_dest_size of 21474836480 bytes is 100.00% used, and has 0remaining bytes available.

Sat Oct 22 06:04:092016

************************************************************************

You have followingchoices to free up space from recovery area:

1. Consider changingRMAN RETENTION POLICY. If you are using Data Guard,

   thenconsider changing RMAN ARCHIVELOG DELETION POLICY.

2. Back up files totertiary device such as tape using RMAN

   BACKUPRECOVERY AREA command.

3. Add disk space andincrease db_recovery_file_dest_size parameter to

   reflectthe new space.

4. Delete unnecessaryfiles using RMAN DELETE command. If an operating

   systemcommand was used to delete files, then use RMAN CROSSCHECK and

   DELETEEXPIRED commands.

************************************************************************

 

 

3,2016-10-22 06:08:22  开始部署GG链路。

 

4,2016-10-2206:18:32  GG连接hung住GG部署没有进度。

 

5,2016-10-22 06:18:32 - 2016-10-22 06: 54:22 开始查看数据库,发现归档满了,rmansqlplus 登录数据库hung尝试kill掉LOCAL=NO这样的进程,kill掉后rman和sqlplus 登录数据库还是hung。

     登录命令:sqlplus ‘/as sysdba’

                           rman target /

 

6,Sat Oct 22 06:54:58 2016

 kill掉pmon进程,起库修改db_recovery_file_dest_size这个参数为1000G。

 

7,06:58:07重启完后,发现数据库后台没有报错,继续部署GG,GG执行命令报错(Sat Oct 22 07:02 2016)。

 

后台日志的错误:

Sat Oct 22 06:58:072016

SERVER COMPONENTid=UTLRP_BGN: timestamp=2016-10-22 06:58:07

SERVER COMPONENTid=UTLRP_END: timestamp=2016-10-22 06:58:15

Sat Oct 22 07:01:132016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_smon_36312.trc:

ORA-00604: erroroccurred at recursive SQL level 2

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

Sat Oct 22 07:06:162016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_smon_36312.trc:

ORA-00604: erroroccurred at recursive SQL level 2

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

 

XX反馈(SatOct 22 07:02 2016)

GGSCI (cnsz081500)2> dblogin userid ggmgr, password oracle

ERROR: Unable toconnect to database using user ggmgr. Please check privileges.

ORA-00604: erroroccurred at recursive SQL level 2

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024: self-deadlockdetected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB7F3BC368.

 

 

 

8,Sat Oct 22 07:06:43 2016 再次重启数据,观察两分钟后,数据库没有问题,继续部署GG链路,还是报ORA-04024。

Sat Oct 22 07:09:572016

SERVER COMPONENTid=UTLRP_BGN: timestamp=2016-10-22 07:09:57

SERVER COMPONENTid=UTLRP_END: timestamp=2016-10-22 07:10:05

Sat Oct 22 07:11:452016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_smon_21325.trc:

ORA-00604: erroroccurred at recursive SQL level 2

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB56CA8800

Sat Oct 22 07:16:472016

Errors in file/paic/app/oracle/rdbms/diag/rdbms/vip/vip/trace/vip_smon_21325.trc:

ORA-00604: erroroccurred at recursive SQL level 2

ORA-04024:self-deadlock detected while trying to mutex pin cursor 0xB56CA8800

Sat Oct 22 07:21:352016

Beginning log switchcheckpoint up to RBA [0x8dd.2.10], SCN: 10294877424726

Sat Oct 22 07:21:352016

Thread 1 advanced tolog sequence 2269 (LGWR switch)

  Current log# 1seq# 2269 mem# 0: +DATA_VIP_DG/VIP/ONLINELOG/group_1.256.923613795

  Current log# 1seq# 2269 mem# 1: +FRA_VIP_DG/VIP/ONLINELOG/group_1.262.923613795

 

9,Sat Oct 22 07:13  升级回退vip升级


最后总结一下:

1,个人在做事情的时候时间上没有足够的评估准确。

2,自己太大意,相信建库的人会按照规定参数设置,而没有每个参数去比对。

3,总的来说是因为自己角色还没有完全改过来,以前是搞开发测试库,现在搞生产库。开发测试库可以想怎么搞就怎么搞,步骤漏掉影响不大。而生产库不能有任何的疏忽。

这个我需要改正自己粗枝大叶的想法,DBA是一个谨慎而且严谨的职位。需要对每个命令的含义和结果都要明白。对于数据库更是不能有任何的疏忽。一个血淋淋的教训,让我明白我不能在大树底下,需要自己独立去承担。

你可能感兴趣的:(oracle,杂谈)