一家贵阳地区的重点用户在一次ZLHIS升级sp后,性能急剧下降。特别是医嘱相关的操作(包括新开,修改) 非常慢。现场人员同时反馈即使是对医嘱记录的单单行数据的进行update也非常慢,这种情况已经持续了1个多小时,严重影响了系统的运行,医嘱相关的业务基本处于停滞状态。
通过电话沟通,感觉象是遇到了“表级锁”,导致事务无作获取TM锁。通过现场人员提取时段的awr报告,作了简单分析:
DB Name |
DB Id |
Instance |
Inst num |
Release |
RAC |
Host |
ORCL |
1160490627 |
orcl |
1 |
10.2.0.1.0 |
NO |
ZYHOSPIT-C55630 |
|
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
Begin Snap: |
38900 |
17-2月 -12 08:01:04 |
277 |
59.6 |
End Snap: |
38901 |
17-2月 -12 09:00:52 |
414 |
61.4 |
Elapsed: |
|
59.81 (mins) |
|
|
DB Time: |
|
1,768.12 (mins) |
|
|
可以看到db time是间隔时间的近30倍,系统性能非常差,接下来查看top 5 timed Events来确定db time的主要的构成:
Event |
Waits |
Time(s) |
Avg Wait(ms) |
% Total Call Time |
Wait Class |
enq: TM - contention |
25,550 |
75,077 |
2,938 |
70.8 |
Application |
CPU time |
|
16,319 |
|
15.4 |
|
db file sequential read |
1,987,166 |
8,702 |
4 |
8.2 |
User I/O |
db file scattered read |
1,386,286 |
2,933 |
2 |
2.8 |
User I/O |
enq: TX - row lock contention |
411 |
1,094 |
2,662 |
1.0 |
Application |
enq:TM-contention等待占整个db time的70.8%,平均等待时间也达到了2938ms(也就是,2.93s) ,出现了严重的tm 类的enqueue等待,使用脚本查询v$视图。
TM 锁(TM lock)用于确保在修改表的内容时,表的结构不会改变。例如,如果你已经更新了一个表,会得到这个表的一个TM 锁。这会防止另一个用户在该表上执行DROP 或ALTER 命令。如果你有表的一个TM 锁,而另一位用户试图在这个表上执行DDL,他就会得到以下错误消息:
drop table dept
*
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified
在一个事务中 , 如果修改了多个表,则会得到多个表的 TM 锁。常见的enqueue的锁mode有3和6,那我们这里持有的是那种模式的锁呢?我们使用下述sql来进行查询:
Select Decode(Request, 0, 'Holder: ', 'Waiter: ') || Sid Sess, Id1, Id2, Lmode, Request, Type
From V$lock
Where (Id1, Id2, Type) In (Select Id1, Id2, Type From V$lock Where Request > 0)
Order By Id1, Request;
|
SESS |
ID1 |
ID2 |
LMODE |
REQUEST |
TYPE |
1 |
Holder: 220 |
52074 |
0 |
3 |
0 |
TM |
2 |
Waiter: 224 |
52074 |
0 |
0 |
2 |
TM |
3 |
Waiter: 138 |
52074 |
0 |
0 |
2 |
TM |
4 |
Waiter: 125 |
52074 |
0 |
0 |
2 |
TM |
5 |
Waiter: 243 |
52074 |
0 |
0 |
2 |
TM |
6 |
Waiter: 401 |
52074 |
0 |
0 |
2 |
TM |
7 |
Waiter: 136 |
52074 |
0 |
0 |
2 |
TM |
8 |
Waiter: 506 |
52074 |
0 |
0 |
2 |
TM |
9 |
Waiter: 502 |
52074 |
0 |
0 |
2 |
TM |
10 |
Waiter: 61 |
52074 |
0 |
0 |
2 |
TM |
11 |
Waiter: 7 |
52074 |
0 |
0 |
2 |
TM |
12 |
Waiter: 99 |
52074 |
0 |
0 |
2 |
TM |
13 |
Waiter: 207 |
52074 |
0 |
0 |
2 |
TM |
14 |
Waiter: 491 |
52074 |
0 |
0 |
2 |
TM |
15 |
Waiter: 245 |
52074 |
0 |
0 |
2 |
TM |
16 |
Waiter: 140 |
52074 |
0 |
0 |
3 |
TM |
17 |
Waiter: 150 |
52074 |
0 |
0 |
3 |
TM |
18 |
Waiter: 66 |
52074 |
0 |
0 |
3 |
TM |
19 |
Waiter: 116 |
52074 |
0 |
0 |
3 |
TM |
20 |
Waiter: 132 |
52074 |
0 |
0 |
3 |
TM |
21 |
Waiter: 106 |
52074 |
0 |
0 |
3 |
TM |
|
|
|
|
|
|
|
可以看到持有的是mode为3的tm锁,而请求的mode为2的锁;enqueu事件的id1列描述了表的object_id,查询dba_objects可以查到OBJECT_ID为520740正是“病人医嘱记录”这一张表。
Wait for TM Enqueue in Mode 3
Unindexed foreign key columns are the primary cause of TM lock contention in mode 3. However, this only applies to databases prior to Oracle9i Database. Depending on the operation, when foreign key columns are not indexed, Oracle either takes up a DML share lock (S – mode 4) or share row exclusive lock (SRX – mode 5) on the child table whenever the parent key or row is modified. (The share row exclusive lock is taken on the child table when the parent row is deleted and the foreign key constraint is created with the ON DELETE CASCADE option. Without this option, Oracle takes the share lock.) The share lock or share row exclusive lock on the child table prohibits other processes from getting a row exclusive lock (RX—mode 3) on the table. The waiting session will wait until the blocking session commits or rolls back its transaction.
Here is a philosophical question for you: Are you going to start building new indexes for all the foreign key columns in your databases? DBAs are divided on this. Our take is that you should hold your horses and don’t get carried away building new indexes just yet. If you do, you will introduce many new indexes to the database, some that are unnecessary. For example, you don’t need to create new indexes on foreign key columns when the parent tables they reference are static. You only need to create indexes on foreign key columns of the child table that is being identified by the
enqueue
wait event. The object ID for the child table is recorded in the P2 column, which corresponds to the ID1 column of the V$LOCK view. Query the DBA_OBJECTS view using the object ID and you will see the name of the child table. Yes, you will be operating in reactive mode, but it beats creating unnecessary indexes in the database, which not only wastes storage and increases maintenance, but may open up another can of worms for SQL tuning.
这段话的大体意思是,没有索引的外键列是模式3 中tm锁争用的主要原因,然而这种原因只适用9i之前的数据库,根据不同的操作,当外键列没有被索引时,Oracle在子表上采用一个DML共享锁或共享独占锁,只要父键或父行被修改。子表上的共享锁或共享行独占锁禁止进程或会话获得表上的独占锁,会话交持续等待,直到造成阻塞的会话提交或回退它的事务。
我们的库是Oracle 10g,似乎这段说明并不适用我们的情况;我们通过下列的sql查找表上有外键,但未建立索引的列:
SELECT TABLE_NAME,
CONSTRAINT_NAME,
CNAME1 || NVL2(CNAME2, ',' || CNAME2, NULL) ||
NVL2(CNAME3, ',' || CNAME3, NULL) ||
NVL2(CNAME4, ',' || CNAME4, NULL) ||
NVL2(CNAME5, ',' || CNAME5, NULL) ||
NVL2(CNAME6, ',' || CNAME6, NULL) ||
NVL2(CNAME7, ',' || CNAME7, NULL) ||
NVL2(CNAME8, ',' || CNAME8, NULL) COLUMNS
FROM (SELECT B.TABLE_NAME,
B.CONSTRAINT_NAME,
MAX(DECODE(POSITION, 1, COLUMN_NAME, NULL)) CNAME1,
MAX(DECODE(POSITION, 2, COLUMN_NAME, NULL)) CNAME2,
MAX(DECODE(POSITION, 3, COLUMN_NAME, NULL)) CNAME3,
MAX(DECODE(POSITION, 4, COLUMN_NAME, NULL)) CNAME4,
MAX(DECODE(POSITION, 5, COLUMN_NAME, NULL)) CNAME5,
MAX(DECODE(POSITION, 6, COLUMN_NAME, NULL)) CNAME6,
MAX(DECODE(POSITION, 7, COLUMN_NAME, NULL)) CNAME7,
MAX(DECODE(POSITION, 8, COLUMN_NAME, NULL)) CNAME8,
COUNT(*) COL_CNT
FROM (SELECT SUBSTR(TABLE_NAME, 1, 30) TABLE_NAME,
SUBSTR(CONSTRAINT_NAME, 1, 30) CONSTRAINT_NAME,
SUBSTR(COLUMN_NAME, 1, 30) COLUMN_NAME,
POSITION
FROM USER_CONS_COLUMNS) A,
USER_CONSTRAINTS B
WHERE A.CONSTRAINT_NAME = B.CONSTRAINT_NAME
AND B.CONSTRAINT_TYPE = 'R'
GROUP BY B.TABLE_NAME, B.CONSTRAINT_NAME) CONS
WHERE COL_CNT > ALL
(SELECT COUNT(*)
FROM USER_IND_COLUMNS I
WHERE I.TABLE_NAME = CONS.TABLE_NAME
AND I.COLUMN_NAME IN (CNAME1, CNAME2, CNAME3, CNAME4, CNAME5,
CNAME6, CNAME7, CNAME8)
AND I.COLUMN_POSITION <= CONS.COL_CNT
GROUP BY I.INDEX_NAME)
这个查询,使用了decode函数来实现行转列的效果,从而得到外键的列;从得到的结果中,查看医嘱记录相关的表,可以看表上确实有这种未建索引的外键:
|
病人医嘱记录 |
病人医嘱记录_FK_前提ID |
前提ID |
|
病人医嘱记录 |
病人医嘱记录_FK_病人科室ID |
病人科室ID |
|
病人医嘱记录 |
病人医嘱记录_FK_开嘱科室ID |
开嘱科室ID |
|
病人医嘱记录 |
病人医嘱记录_FK_执行科室ID |
执行科室ID |
焦点集中在“前提ID”上,因为其他几个外键列都是引用部门表,部门表作为基础表,数据变动的机率比较小。而"前提id"是一个
自引用的外键,并不是简单的主从表形式的外键,从升级脚本中找到这个约束的定义:
ALTER TABLE 病人医嘱记录
ADD CONSTRAINT 病人医嘱记录_FK_前提ID
FOREIGN KEY (前提ID)
REFERENCES 病人医嘱记录(ID);
可以看到我们前提ID引用的是表的主键列(ID),ID虽然基本上不更新,但insert非常频繁;经过测试,这种自引用的外键约束即使是在10g中,当我们更新或insert记录时也会引发对表的tm锁;如果在insert到表时,未建立外键都会引发tm锁,接下来就是建立索引:
CREATE INDEX 病人医嘱记录_IX_前提ID
ON 病人医嘱记录(前提ID)
PCTFREE 10
TABLESPACE zl9CisRec
online nologging;
由于是生产库时,建立索引时加了online选项,同时加了nologging选项不产生日志以加快建立的速度。如果在建立索引的过程中使用了parallel 选项,一定记住在索引建立完成后,将parallel修改回1,以免产生大量的并发进程。
索引建立完成后,相关操作恢复正常。
总结:从owi的说明中可以看到,并不是所有的外建都需要建立索引,是否建立索引要根据引用的主键是否经常变化,以及外键列上的索引是否能够提升性能,防止避免建立一个不使用或很少使用的“僵尸索引“。在我们的案例中,也没有为几个引用部门表的外键建立索引,还是那句话,都得具体问题具体分析,不能简单行事。 这个案例也说明,即使是在10g下,对于自引用的主键经常变化(包括insert)的外键,必须要建立索引。