GoldenGate Director 疯狂发出 ggsci 命令致主数据库CPU使用率居高不下

今天早上 8 点多钟,发现主数据库 CPU 持续居高不下,一直维持在 90% 左右,而这个点数据库应该闲的蛋疼
topas 发现耗掉 CPU 的全是 ggsci 进程。

Topas Monitor for host:    bjsczjdbzsj01        EVENTS/QUEUES    FILE/TTY
Tue Mar 19 09:00:01 2013   Interval:  2         Cswitch   56687  Readch   197.5M
                                                Syscall  120.7K  Writech 4112.5K
CPU  User%  Kern%  Wait%  Idle%  Physc   Entc   Reads     32220  Rawin         0
ALL   82.2    5.6    1.2   11.0  11.63  116.3   Writes    13182  Ttyout      600
                                                Forks        23  Igets         0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs        23  Namei     13892
Total   741.4    786.6   444.0   563.1   178.4  Runqueue   33.5  Dirblk        0
                                                Waitqueue   1.0
Disk    Busy%     KBPS     TPS KB-Read KB-Writ                   MEMORY
Total   100.0     48.4K   10.1K   47.4K  938.6  PAGING           Real,MB   81920
                                                Faults    30321  % Comp     94
FileSystem        KBPS     TPS KB-Read KB-Writ  Steals    15501  % Noncomp   5
Total            187.4K   17.9K 187.3K 121.5    PgspIn        0  % Client    5
                                                PgspOut       0
Name            PID  CPU%  PgSp Owner           PageIn     9507  PAGING SPACE
ggsci      46006566   8.6  15.7 oracle          PageOut      47  Size,MB   16384
ocssd.bi    8388624   4.3 133.5 grid            Sios       9550  % Used     24
oracle     11403492   4.3  42.5 oracle                           % Free     76
ggsci      15925646   4.3  15.7 oracle          NFS (calls/sec)
ggsci      31392128   4.3  15.7 oracle          SerV2         0  WPAR Activ    0
ggsci       6095204   4.3  15.7 oracle          CliV2         0  WPAR Total    0
ggsci      23134254   4.3  15.7 oracle          SerV3         0  Press: "h"-help
ggsci      22151308   4.3  15.7 oracle          CliV3         0         "q"-quit
ggsci      21692764   4.3  15.7 oracle
ggsci      42467358   4.3  15.7 oracle
ggsci      21234052   4.3  15.7 oracle
ggsci      24052400   4.3  15.7 oracle
ggsci      35913764   4.3  15.7 oracle
ggsci        590450   4.3  15.7 oracle
ggsci      23199846   4.3  15.7 oracle
ggsci      12845086   4.3  15.7 oracle
ggsci      38469784   4.3  15.7 oracle
ggsci      24314114   4.3  15.7 oracle
ggsci      33620042   4.3  15.7 oracle
ggsci      33685546   4.3  15.7 oracle

主数据库上安装了 ggs 和 ggsyy 两个 goldengate 实例,一个用
7809 端口,一个用 7810 端口,前一阵 Oracle 原厂实施 OEM 12C 后,尝试安装 goldengate 插件失败,出过
这一性能问题,但是当时通过停止 OEM agent 和屏蔽插件进程参数,已经解决了,为何又再次重现?

通过 ps -ef 查看发现大量的 ./ggsci 命令都是从使用 7810 端口的 ggsyy 实例中发出,只有一个是从 7809 
端口发出(这个是我自己监控打开的)。
bjsczjdbzsj01:/home/oracle/ggs$ps -ef | grep ggsci | grep PORT
  oracle 12845086        1  56 08:20:56      - 10:26 ./ggsci PORT 8000-8300 -m 7810 
  oracle 22151308        1  57 07:52:21      - 18:11 ./ggsci PORT 8000-8300 -m 7810 
  oracle 23134254        1  55 07:16:37      - 27:09 ./ggsci PORT 8000-8300 -m 7810 
  oracle 23199846        1  55 08:53:08      -  2:07 ./ggsci PORT 8000-8300 -m 7810 
  oracle 29098078        1  58 07:48:49      - 18:50 ./ggsci PORT 8000-8300 -m 7810 
  oracle 31457498        1  62 07:20:14      - 26:49 ./ggsci PORT 8000-8300 -m 7810 
  oracle 33620042        1  57 07:09:28      - 28:59 ./ggsci PORT 8000-8300 -m 7810 
  oracle 33685546        1  56 08:24:33      -  9:29 ./ggsci PORT 8000-8300 -m 7810 
  oracle 35913764        1  57 08:35:14      -  7:04 ./ggsci PORT 8000-8300 -m 7810 
  oracle 38469784        1  58 07:55:58      - 17:11 ./ggsci PORT 8000-8300 -m 7810 
  oracle 42467358        1  58 08:06:38      - 14:56 ./ggsci PORT 8000-8300 -m 7810 
  oracle 55967816        1  57 07:13:05      - 28:06 ./ggsci PORT 8000-8300 -m 7810 
  oracle  5767526        1  53 07:34:31      - 22:55 ./ggsci PORT 8000-8300 -m 7810 
  oracle  6095204        1  52 07:45:12      - 23:36 ./ggsci PORT 8000-8300 -m 7810 
  oracle  6357336        1  58 07:27:23      - 26:55 ./ggsci PORT 8000-8300 -m 7810 
  oracle 15925646        1  74 08:38:51      -  8:34 ./ggsci PORT 8000-8300 -m 7810 
  oracle 19399038        1  71 08:49:31      -  4:21 ./ggsci PORT 8000-8300 -m 7810 
  oracle 21234052        1  51 08:17:24      - 11:21 ./ggsci PORT 8000-8300 -m 7810 
  oracle 21692764        1  52 07:38:03      - 24:31 ./ggsci PORT 8000-8300 -m 7810 
  oracle 24314114        1  53 07:59:29      - 17:10 ./ggsci PORT 8000-8300 -m 7810 
  oracle 25952762        1  54 08:10:15      - 16:58 ./ggsci PORT 8000-8300 -m 7810 
  oracle 27984288        1  51 08:42:22      -  5:45 ./ggsci PORT 8000-8300 -m 7810 
  oracle 31392128        1  72 08:45:59      -  5:43 ./ggsci PORT 8000-8300 -m 7810 
  oracle 33161706        1  54 08:13:47      - 16:08 ./ggsci PORT 8000-8300 -m 7810 
  oracle 37683540        1  53 08:03:06      - 19:47 ./ggsci PORT 8000-8300 -m 7810 
  oracle 45678926        1  56 08:31:42      -  7:54 ./ggsci PORT 8000-8300 -m 7810 
  oracle 46006566        1  51 07:41:40      - 20:54 ./ggsci PORT 8000-8300 -m 7810 
  oracle   590450 28901990  57 08:56:40      -  1:10 ./ggsci PORT 8000-8300 -m 7810 
  oracle 10814004        1  58 07:30:54      - 23:59 ./ggsci PORT 8000-8300 -m 7810 
  oracle 11797206        1  67 07:23:46      - 26:36 ./ggsci PORT 8000-8300 -m 7810 
  oracle 16908944 28901990  58 09:00:17      -  0:11 ./ggsci PORT 8000-8300 -m 7810 
  oracle 24052400        1  60 08:28:05      -  8:36 ./ggsci PORT 8000-8300 -m 7810 
  oracle 36962936 30670938   0   Mar 16      -  1:59 ./ggsci PORT 7815-8000 -m 7809 
  
根据上述判断,问题肯定出在 ggsyy 实例,查看该实例的 error log ,发现文件大小已经暴涨到 19 GB,
和之前的情况一模一样……

bjsczjdbzsj01:/home/oracle/ggsyy$ls -l ggs*log
-rw-r--r--    1 oracle   oinstall 19048559981 Mar 19 09:00 ggserr.log

用 tail -f 查看,发现从主机 emserver1.em.com 的 GUI 界面上不断循环地往 ggsyy 实例发出 ggssci 命令
bjsczjdbzsj01:/home/oracle/ggsyy$tail -f ggserr.log
2013-03-19 08:55:04  INFO    OGG-01053  Oracle GoldenGate Capture for Oracle, pzjts_ts.prm:  Recovery completed for target file ./dirdat/yj000137, at RBA 1469.
2013-03-19 08:55:04  INFO    OGG-01057  Oracle GoldenGate Capture for Oracle, pzjts_ts.prm:  Recovery completed for all targets.
2013-03-19 08:56:40  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:52978 (START GGSCI ).
2013-03-19 08:56:40  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 08:56:41  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:39867 (REPORT 590450 8005).
2013-03-19 08:59:10  ERROR   OGG-01224  Oracle GoldenGate Command Interpreter for Oracle:  Bad file number.
2013-03-19 08:59:10  ERROR   OGG-01668  Oracle GoldenGate Command Interpreter for Oracle:  PROCESS ABENDING.
2013-03-19 09:00:17  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:58712 (START GGSCI ).
2013-03-19 09:00:17  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 09:00:18  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:10031 (REPORT 16908944 8018).

bjsczjdbzsj01:/home/oracle/ggsyy$tail -f ggserr.log
2013-03-19 08:55:04  INFO    OGG-01053  Oracle GoldenGate Capture for Oracle, pzjts_ts.prm:  Recovery completed for target file ./dirdat/yj000137, at RBA 1469.
2013-03-19 08:55:04  INFO    OGG-01057  Oracle GoldenGate Capture for Oracle, pzjts_ts.prm:  Recovery completed for all targets.
2013-03-19 08:56:40  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:52978 (START GGSCI ).
2013-03-19 08:56:40  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 08:56:41  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:39867 (REPORT 590450 8005).
2013-03-19 08:59:10  ERROR   OGG-01224  Oracle GoldenGate Command Interpreter for Oracle:  Bad file number.
2013-03-19 08:59:10  ERROR   OGG-01668  Oracle GoldenGate Command Interpreter for Oracle:  PROCESS ABENDING.
2013-03-19 09:00:17  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:58712 (START GGSCI ).
2013-03-19 09:00:17  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 09:00:18  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:10031 (REPORT 16908944 8018).
2013-03-19 09:03:49  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:57373 (START GGSCI ).
2013-03-19 09:03:49  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 09:03:50  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:14929 (REPORT 8192366 8020).
2013-03-19 09:07:26  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GUI on host emserver1.em.com:38624 (START GGSCI ).
2013-03-19 09:07:26  INFO    OGG-00976  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started 'ggsci' process on port 0.
2013-03-19 09:07:27  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host loopback:25090 (REPORT 24838184 8039).

用 tail -200 查看,发现日志中有大量如下输出,上次也是这个输出撑爆了硬盘
bjsczjdbzsj01:/home/oracle/ggsyy$tail -200 ggserr.log
2013-03-19 08:33:26  WARNING OGG-01930  Oracle GoldenGate Capture for Oracle, pzj_cx9.prm:  Datastore error in 'dirbdb': BDB0060 PANIC: fatal region error detected; run recovery.
2013-03-19 08:33:26  WARNING OGG-01930  Oracle GoldenGate Capture for Oracle, pzj_cx9.prm:  Datastore error in 'dirbdb': BDB0060 PANIC: fatal region error detected; run recovery.
2013-03-19 08:33:26  WARNING OGG-01930  Oracle GoldenGate Capture for Oracle, pzj_cx9.prm:  Datastore error in 'dirbdb': BDB0060 PANIC: fatal region error detected; run recovery.
2013-03-19 08:35:26  WARNING OGG-01930  Oracle GoldenGate Capture for Oracle, pcqstqz1.prm:  Datastore error in 'dirbdb': BDB0060 PANIC: fatal region error detected; run recovery.

emserver1.em.com 主机正是此前安装 OEM 12c 失败的机器,之后换了一台新机器安装 OEM 12c,又在该主机上
已安装好的 weblogic 上部署了 GoldenGate Director 来监控 GoldenGate 进程,推断上述的 ggsci 命令可能是
director 监控发出的,打开 director 监控页面,发现 ggsyy 实例显示为红X,展开也看不到任何进程,也就是
没有配置成功,而且还是刚配的,打开 GoldenGate Directot admin tool 测试 ggsyy 实例的连接性,结果连接
超时,初步判断导致 Director 大量发出 ggsci 命令耗尽 CPU 资源的原因可能是由于 ggsyy 实例未配置成功引起
的,日志文件过大也可能会对数据库服务器的性能产生影响。

备份并清空日志
bjsczjdbzsj01:/home/oracle/ggsyy$tail -50000 ggserr.log > ggserr.log_bak_20130319

bjsczjdbzsj01:/home/oracle/ggsyy$cat /dev/null > ggserr.log

清空日志后,日志文件无任何输出
bjsczjdbzsj01:/home/oracle/ggsyy$tail -f ggserr.log
尝试重启 mgr 进程,查看是否能够正常输出日志

GGSCI (prod.oracle.com) 1> stop mgr
Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)? y

Sending STOP request to MANAGER ...
Request processed.
Manager stopped.

GGSCI (prod.oracle.com) 2> start mgr

Manager started.

2013-03-19 15:48:40  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): stop mgr.
2013-03-19 15:48:41  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host prod.oracle.com (STOP).
2013-03-19 15:48:41  WARNING OGG-00938  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager is stopping at user request.
2013-03-19 15:48:47  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): start mgr.
2013-03-19 15:48:47  INFO    OGG-00983  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager started (port 7809).

日志能够正常输出,说明日志清空操作安全。

清空日志后,尝试在 GoldenGate Director 中重新配置 ggsyy 实例的连接性,连接测试成功。
Director 监控页面中 ggsyy 实例目录显示为绿色,ggsyy 实例进程状态能够正确显示出来,

再通过 topas 查看主机 CPU 资源骤降。


转载请注明作者出处及原文链接:


http://blog.csdn.net/xiangsir/article/details/8693767












你可能感兴趣的:(Oracle,GoldenGate,Oracle,Troubleshooting)