gc current request 引起长期锁表的故障

gc current request 引起长期锁表的故障

故障描述:一个数仓系统长期跑批时出现锁表,锁表偶尔严重时影响整体跑批流程,进而影响第2天业务运行。根据业务请求需要进行优化。优化目标尽量减少锁表时间(业务人员最低要求不影响整体跑批流程)。
业务人员诉求分析:找出严重锁表原因,解决锁表严重问题不影响整体跑批即可。
DBA解决思路分析:现象=》AWR/ASH/ADDM报告和调用(监控foglight/bmc)等=》结合经验分析结果=》结合现象+技术原理分析=》最终确定问题原因+解决方案=》长期观察是否解决问题

1.系统环境 
SQL> select * from gv$version;


   INST_ID BANNER
---------- --------------------------------------------------------------------------------
         2 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
         2 PL/SQL Release 11.2.0.4.0 - Production
         2 CORE 11.2.0.4.0      Production
         2 TNS for IBM/AIX RISC System/6000: Version 11.2.0.4.0 - Production
         2 NLSRTL Version 11.2.0.4.0 - Production
         1 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
         1 PL/SQL Release 11.2.0.4.0 - Production
         1 CORE 11.2.0.4.0      Production
         1 TNS for IBM/AIX RISC System/6000: Version 11.2.0.4.0 - Production
         1 NLSRTL Version 11.2.0.4.0 - Production


10 rows selected.


2.故障时间点确认
Troubleshooting:数据说话。
我们这里用foglight(DB监控软件)来抓取数据(CPU/IO/MEMORY波峰),确定故障时间点。


2.1 以等待事件为基础数据
7天统计如下:
截图省略...

9月13号,业务人员报故障的时间段
截图省略...

==经DBA分析,决定选择00:00:00--01:00:00和03:00:00--04:00:00进行分析==


3.根据性能分析工具AWR的数据进行分析
00:00:00--01:00:00 AWR报告
3.1 因为是锁表也是一个种等待事件,所以直接快速关注events


==node A==
Top 10 Foreground Events by Total Wait Time


Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
enq: CF - contention 3,651 13.7K 3746 57.9 Other
enq: TX - row lock contention 51 3610.8 70800 15.3 Application
direct path read 10,652 1705.3 160 7.2 User I/O
db file scattered read 35,561 1056.8 30 4.5 User I/O
direct path write 22,929 989.5 43 4.2 User I/O
DB CPU 530.4 2.2  
direct path read temp 18,124 270.6 15 1.1 User I/O
db file sequential read 46,447 264.3 6 1.1 User I/O
enq: HW - contention 3,241 199.1 61 .8 Configuration
gc current block busy 3,794 164.5 43 .7 Cluster


==node B==
Top 10 Foreground Events by Total Wait Time


Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
enq: CF - contention 7,114 6801.1 956 30.9 Other
direct path write 102,606 4393.3 43 19.9 User I/O
direct path read 28,670 4047.9 141 18.4 User I/O
db file scattered read 91,564 2344.5 26 10.6 User I/O
DB CPU 1296 5.9  
gc current request 1 383.1 383095 1.7 Cluster
db file parallel read 3,286 284.7 87 1.3 User I/O
db file sequential read 86,524 247 3 1.1 User I/O
CSS initialization 298 222.5 747 1.0 Other
control file parallel write 31,876 148.7 5 .7 System I/O


Performance Degradation as a Result of 'enq: CF - contention' (文档 ID 1072417.1)
Waits for 'Enq: TX - ...' Type Events - Transaction (TX) Lock Example Scenarios (文档 ID 62354.1)


主要问题:
1.enq: CF - contention 推断分析:redo log 产生过大且频繁
2.enq: TX - row lock contention 推断分析:应用设计有问题:长事务,未批量提交


4.验证推断分析的结果
查看等待事件
00:00:00--01:00:00 EVENT TOP 10
INSTANCE_NUMBER EVENT                                           CNT
--------------- ---------------------------------------- ----------
              1 enq: CF - contention                           1338
              2 enq: CF - contention                            690
              2 direct path write                               382
              1 enq: TX - row lock contention                   360
              2 direct path read                                321
              2 db file scattered read                          251
              1 db file scattered read                           94
              1 direct path read                                 65
              2 PX Deq: Table Q Get Keys                         64
==此时间段,还有事务进行的,只是 redo log 等待严重。 


00:00:00--08:00:00 EVENT TOP 10
(08:00:00为KILL LOCK TABLE后正常跑批流程的时间点)
INSTANCE_NUMBER EVENT                                           CNT
--------------- ---------------------------------------- ----------
              1 enq: TX - row lock contention                  2878
              2 gc current request                             2171
              2 direct path read                               2005
              1 direct path read                               1827
              1 enq: CF - contention                           1338
              2 Disk file operations I/O                        900
              1 Disk file operations I/O                        888
              2 enq: CF - contention                            690
              1 db file sequential read                         472


9 rows selected.
==此时间段,基本上已经没有事务进行了,完成被TX和gc current request等待所占据。


00:00:00--01:00:00
==node A==
Load Profile


Per Second Per Transaction Per Exec Per Call
DB Time(s): 6.6 3.2 0.40 0.17
DB CPU(s): 0.2 0.1 0.01 0.00
Redo size (bytes): 1,669,685.0 814,982.5  


==node B==
Load Profile


Per Second Per Transaction Per Exec Per Call
DB Time(s): 6.1 3.1 0.24 0.17
DB CPU(s): 0.4 0.2 0.01 0.01
Redo size (bytes): 2,170,422.3 1,082,389.6


03:00:00--04:00:00
==node A==
Load Profile


Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.1 0.6 0.06 0.02
DB CPU(s): 0.0 0.0 0.00 0.00
Redo size (bytes): 329,441.6 183,969.1  


==node B==
Load Profile


Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.0 0.6 0.10 0.03
DB CPU(s): 0.0 0.0 0.00 0.00
Redo size (bytes): 1,489.5 840.2


00:00:00--01:00:00与03:00:00--04:00:00时间段,每秒产生redo size比值
node A :814,982/183,969 ==4.4
node B :1,082,389/840 ==1288.5
==根据redo size每秒产生的量,可以推断出3-4这个时间段与0-1时间段的事务量剧减,可以说没有事务完成,数据库完全处于一个HANG的状态下。


下一步看看是谁HANG了整个数据库
00:00:00--01:00:00
enq: CF - contention =>
SQL_ID          COUNT(1)
------------- ----------
b9zh67r40t24q        461
9tw8kwwj97zub        353
c404bhf4rjaum        204
66pb0fp8jt1c4        197
5dj472vrff0k6        127
SQL_TEXT
--------------------------------------------------------------------------------
INSERT /*+append parallel 32*/ INTO TRANSACTION_DETAIL(AGREEMENT_SEQ,CHARGE_CDE,
INSERT /*+append parallel 32*/ INTO SAP_INTERMEDIATE(CONTRACT_TYP,F_ACCOUNT_CD,F
INSERT /*+append parallel 32*/ INTO DC_FILE_DETAIL_RECORD(ACCOUNT_NBR_TRACE,ACCO
INSERT /*+append parallel 32*/ INTO RECEIVABLE_PAID(ACCOUNT_NBR,ACTUAL_REALIZATI
INSERT /*+append parallel 32*/ INTO INCOME_RECOGNITION(AGREEMENT_SEQ,CONTRACT_ID
enq: TX - row lock contention =>
SQL_TEXT
--------------------------------------------------------------------------------
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND r


00:00:00--08:00:00
gc current request =>
SQL_TEXT
--------------------------------------------------------------------------------
insert  /*+append*/  into F_NS_PD_BMC_DETAIL(ADDRESS_ONE,ADDRESS_THREE,ADDRE
==据业务人员所述,当晚kill lock table F_NS_PD_BMC_DETAIL,数据库就正常运行了
==DBA工具分析也关注到了这张问题表F_NS_PD_BMC_DETAIL,这对象伴随着gc current request等待事件。
所以这个 gc current request 等待事件值得分析。


接下来分析gc current request产生的原因:
分析思路如下:
A.什么时候开始产生的gc current request等待事件?
B.什么时候结束了gc current request等待事件?
C.这个gc current request等待事件与BUG是否有关系?


A.什么时候开始产生的gc current request等待事件?
gc current request 开始
SQL> select * from (
  2  select INSTANCE_NUMBER,event,sql_id,to_char(SAMPLE_TIME,'yyyy-mm-dd hh24:mi:ss')
  3  from dba_hist_active_sess_history
  4  where to_char(SAMPLE_TIME,'yyyy-mm-dd')='2017-09-13'
  5  and event like 'gc current request'  order by SAMPLE_TIME) where rownum<20;


INSTANCE_NUMBER EVENT                                    SQL_ID        TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:54:26
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:54:36
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:54:46
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:54:56
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:06
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:16
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:26
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:36
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:46
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:55:56
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:06


INSTANCE_NUMBER EVENT                                    SQL_ID        TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:16
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:26
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:36
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:46
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:56:56
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:57:06
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:57:16
              2 gc current request                       fz486aazdcx0h 2017-09-13 00:57:26


B.什么时候结束了gc current request等待事件?
gc current request 结束
SQL> select * from (
  2  select INSTANCE_NUMBER,event,sql_id,to_char(SAMPLE_TIME,'yyyy-mm-dd hh24:mi:ss')
  3  from dba_hist_active_sess_history
  4  where to_char(SAMPLE_TIME,'yyyy-mm-dd')='2017-09-13'
  5  and event like 'gc current request'  order by SAMPLE_TIME desc) where rownum<20;


INSTANCE_NUMBER EVENT                                    SQL_ID        TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:56:22
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:56:12
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:56:02
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:52
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:42
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:32
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:22
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:12
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:55:02
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:52
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:42


INSTANCE_NUMBER EVENT                                    SQL_ID        TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:32
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:22
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:12
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:54:02
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:53:52
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:53:42
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:53:32
              2 gc current request                       fz486aazdcx0h 2017-09-13 06:53:22


==SQL_ID='fz486aazdcx0h' 从 2017-09-13 00:54:26 开始,每10秒钟出现一次gc current request等待。直到2017-09-13 06:56:22,kill lock table后结束gc current request等待事件。
显然 insert  /*+append*/  into F_NS_PD_BMC_DETAIL(跑批中主要步骤)没有完成,是与gc current request 有着密不可分的关系。


解决gc current request问题就可以解决跑批的问题了。


C.这个gc current request等待事件与BUG是否有关系?
已经定位根本问题了,gc current request问题。
ETL process hangs when using an Oracle 11g RAC database (文档 ID 1996100.1)


解决办法:
Option 1


Set the "_gc_read_mostly_locking" parameter to FALSE using the following steps:


    a. Stop STARETL Process.  (If running on linux, you can check the process by running "ps -ef | grep star".  If any process is found it can be terminated using the "kill" command)


    b. Connect Oracle Database Hosting The STARUSER schema as the "SYS" User.


    c. Backup the content of spfile to pfile and set _gc_read_mostly_locking to FALSE using the below commands:


SQL> create pfile='/oracle/product/base/oradb/admin/P6_test/adump/bakspfile.ora' from spfile;   
SQL> alter system set "_gc_read_mostly_locking"=FALSE scope=spfile;
    d. Restart Oracle Database hosting the STARUSER schema.


    e.  Rerun STARETL.


enq: TX - row lock contention =>
SQL_TEXT
--------------------------------------------------------------------------------
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND r




经7天观察,跑批流程一切正常再没有出现进行不去的情况。已经解决了业务人员的基本要求。






DBA分析问题,不能已解决问题为标准,其实构架上应用上也是存在问题的。所以,继续研究一下。

在分析过程中,还发现不少问题。
1.锁争用严重。RAC并行使用有问题。分库分表应用构架有问题。同时的业务模块(insert into),同时在2个节点进行插入操作。
2.logfile 文件争用严重;批量insert,没有分批造成过多的redo log和archive log 。
3.长时间的enq: TX - row lock contention 竟然是一个程序模块引起的。写死了,没有提交动作程序也是奇葩。
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND ...   不 commit,也是一个奇葩的程序了。

########################################################################################
版权所有,文章允许转载,但必须以链接方式注明源地址,否则追究法律责任!【QQ交流群:53993419】
QQ:14040928 E-mail:[email protected]
本文链接: http://blog.itpub.net/26442936/viewspace-2145624/
########################################################################################

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26442936/viewspace-2145624/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/26442936/viewspace-2145624/

你可能感兴趣的:(gc current request 引起长期锁表的故障)