问题现象
P595服务器中有一个DB2数据库,在连接时,系统PAGING SPACE资源耗尽,从而宕机。
问题分析
问题初步分析
1.如果非计算内存太多,计算内存太少,有可能是这个问题导致计算内存不够用,使用到PAGING SPACE。可以使用vmo -L maxperm%命令来检查文件系统缓存参数的设置
2.如果maxperm%设置为20%左右(对于DB2环境来说,比较适合),那么是不是DB2参数/对象本身分配的内存过多造成计算内存不够用。需要仔细检查实例参数、数据库参数是否设置不当。
3.如果上述原因排除,那么分析db2mtrk得到的数据,分析DB2数据库在运行过程中是否分配了过多的计算内存(内存参数、缓冲池可能使用了AUTOMATIC设置;在这种情况下,内存是按需分配的,理论上可以无限大),如果是,那么分析是由于DB2实例、数据库占用的内存过多,还是代理进程(应用程序)占用的内存过多(比如使用动态数组,在死循环中不停的填充动态数组,导致动态数组无限增长,无论多少内存都能耗光),如果存在这个问题,那么继续分析.
可以使用如下命令跟踪DB2数据库使用内存的情况。
db2mtrk -i -d -p -r 300 500
>> db2mtrk.out
#每5分钟收集一次,收集500次
4.如果上述原因排除,那么结合ipcs得到的数据,分析AIX操作系统中除了DB2使用的内存段之外,还有没有其他的非DB2进程在使用大内存段占用过多的内存,如果有,那么继续分析是否是它导致的.
服务器内存使用
内存:8G
文件系统缓存设置:maxperm%=30
操作系统约消耗1G内存
所以留给DB2(实例、数据库、代理)使用的内存约为:8G-8G*30%-1G=4.5G
检查实例、数据库参数
Database Configuration for Database XXXX
Database configuration release level
= 0x0b00
Database release level
= 0x0b00
Database territory
= CN
Database code page
= 1386
Database code set
= GBK
Database country/region code
= 86
Database collating sequence
= UNIQUE
Alternate collating sequence
(ALT_COLLATE) =
Database page size
= 4096
Dynamic SQL Query management
(DYN_QUERY_MGMT) = DISABLE
Discovery support for this database
(DISCOVER_DB) = ENABLE
Restrict access
= NO
Default query optimization class
(DFT_QUERYOPT) = 9
Degree of parallelism
(DFT_DEGREE) = 1
Continue upon arithmetic exceptions
(DFT_SQLMATHWARN) = NO
Default refresh age
(DFT_REFRESH_AGE) = 0
Default maintained table types for opt (DFT_MTTB_TYPES) = SYSTEM
Number of frequent values retained
(NUM_FREQVALUES) = 10
Number of quantiles retained
(NUM_QUANTILES) = 20
Backup pending
= NO
Database is consistent
= YES
Rollforward pending
= NO
Restore pending
= NO
Multi-page file allocation enabled
= YES
Log retain for recovery status
= NO
User exit for logging status
= NO
Self tuning memory
(SELF_TUNING_MEM) = ON
Size of database shared memory (4KB)
(DATABASE_MEMORY) = AUTOMATIC
Database memory threshold
(DB_MEM_THRESH) = 10
Max storage for lock list (4KB)
(LOCKLIST) = AUTOMATIC
Percent. of lock lists per application
(MAXLOCKS) = AUTOMATIC
Package cache size (4KB)
(PCKCACHESZ) = AUTOMATIC
Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = AUTOMATIC
Sort list heap (4KB)
(SORTHEAP) = AUTOMATIC
Database heap (4KB)
(DBHEAP) = 4137
Catalog cache size (4KB)
(CATALOGCACHE_SZ) = 579
Log buffer size (4KB)
(LOGBUFSZ) = 108
Utilities heap size (4KB)
(UTIL_HEAP_SZ) = 524288
Buffer pool size (pages)
(BUFFPAGE) = 1000
Max size of appl. group mem set (4KB) (APPGROUP_MEM_SZ) = 40933
Percent of mem for appl. group heap
(GROUPHEAP_RATIO) = 70
Max appl. control heap size (4KB)
(APP_CTL_HEAP_SZ) = 128
SQL statement heap (4KB)
(STMTHEAP) = 4096
Default application heap (4KB)
(APPLHEAPSZ) = 1024
Statistics heap size (4KB)
(STAT_HEAP_SZ) = 4384
Interval for checking deadlock (ms)
(DLCHKTIME) = 10000
Lock timeout (sec)
(LOCKTIMEOUT) = 11
Changed pages threshold
(CHNGPGS_THRESH) = 80
Number of asynchronous page cleaners
(NUM_IOCLEANERS) = AUTOMATIC
Number of I/O servers
(NUM_IOSERVERS) = AUTOMATIC
Index sort flag
(INDEXSORT) = YES
Sequential detect flag
(SEQDETECT) = YES
Default prefetch size (pages)
(DFT_PREFETCH_SZ) = AUTOMATIC
Track modified pages
(TRACKMOD) = OFF
Default number of containers
= 1
Default tablespace extentsize (pages)
(DFT_EXTENT_SZ) = 32
Max number of active applications
(MAXAPPLS) = AUTOMATIC
Average number of active applications
(AVG_APPLS) = AUTOMATIC
Max DB files open per application
(MAXFILOP) = 64
Log file size (4KB)
(LOGFILSIZ) = 40960
Number of primary log files
(LOGPRIMARY) = 50
Number of secondary log files
(LOGSECOND) = 50
Changed path to log files
(NEWLOGPATH) =
Path to log files
=
Overflow log path
(OVERFLOWLOGPATH) =
Mirror log path
(MIRRORLOGPATH) =
First active log file
=
Block log on disk full
(BLK_LOG_DSK_FUL) = NO
Percent max primary log space by transaction
(MAX_LOG) = 0
Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0
Group commit count
(MINCOMMIT) = 1
Percent log file reclaimed before soft chckpt (SOFTMAX) = 2000
Log retain for recovery enabled
(LOGRETAIN) = OFF
User exit for logging enabled
(USEREXIT) = OFF
HADR database role
= STANDARD
HADR local host name
(HADR_LOCAL_HOST) =
HADR local service name
(HADR_LOCAL_SVC) =
HADR remote host name
(HADR_REMOTE_HOST) =
HADR remote service name
(HADR_REMOTE_SVC) =
HADR instance name of remote server
(HADR_REMOTE_INST) =
HADR timeout value
(HADR_TIMEOUT) = 120
HADR log write synchronization mode
(HADR_SYNCMODE) = NEARSYNC
First log archive method
(LOGARCHMETH1) = OFF
Options for logarchmeth1
(LOGARCHOPT1) =
Second log archive method
(LOGARCHMETH2) = OFF
Options for logarchmeth2
(LOGARCHOPT2) =
Failover log archive path
(FAILARCHPATH) =
Number of log archive retries on error
(NUMARCHRETRY) = 5
Log archive retry Delay (secs)
(ARCHRETRYDELAY) = 20
Vendor options
(VENDOROPT) =
Auto restart enabled
(AUTORESTART) = ON
Index re-creation time and redo index build
(INDEXREC) = SYSTEM (RESTART)
Log pages during index build
(LOGINDEXBUILD) = OFF
Default number of loadrec sessions
(DFT_LOADREC_SES) = 1
Number of database backups to retain
(NUM_DB_BACKUPS) = 12
Recovery history retention (days)
(REC_HIS_RETENTN) = 366
TSM management class
(TSM_MGMTCLASS) =
TSM node name
(TSM_NODENAME) =
TSM owner
(TSM_OWNER) =
TSM password
(TSM_PASSWORD) =
Automatic maintenance
(AUTO_MAINT) = ON
Automatic database backup
(AUTO_DB_BACKUP) = OFF
Automatic table maintenance
(AUTO_TBL_MAINT) = ON
Automatic runstats
(AUTO_RUNSTATS) = ON
Automatic statistics profiling
(AUTO_STATS_PROF) = OFF
Automatic profile updates
(AUTO_PROF_UPD) = OFF
Automatic reorganization
(AUTO_REORG) = OFF
Database Manager Configuration
Node type = Enterprise Server Edition with local and remote clients
Database manager configuration release level
= 0x0b00
CPU speed (millisec/instruction)
(CPUSPEED) = 2.676617e-07
Communications bandwidth (MB/sec)
(COMM_BANDWIDTH) = 1.000000e+02
Max number of concurrently active databases
(NUMDB) = 8
Federated Database System Support
(FEDERATED) = NO
Transaction processor monitor name
(TP_MON_NAME) =
Default charge-back account
(DFT_ACCOUNT_STR) =
Java Development Kit installation path
(JDK_PATH) = /home/sqllib/java/jdk64
Diagnostic error capture level
(DIAGLEVEL) = 3
Notify Level
(NOTIFYLEVEL) = 3
Diagnostic data directory path
(DIAGPATH) = /home/sqllib/db2dump
Default database monitor switches
Buffer pool
(DFT_MON_BUFPOOL) = ON
Lock
(DFT_MON_LOCK) = ON
Sort
(DFT_MON_SORT) = ON
Statement
(DFT_MON_STMT) = ON
Table
(DFT_MON_TABLE) = ON
Timestamp
(DFT_MON_TIMESTAMP) = ON
Unit of work
(DFT_MON_UOW) = OFF
Monitor health of instance and databases
(HEALTH_MON) = ON
SYSADM group name
(SYSADM_GROUP) = DB2GRP1
SYSCTRL group name
(SYSCTRL_GROUP) =
SYSMAINT group name
(SYSMAINT_GROUP) =
SYSMON group name
(SYSMON_GROUP) =
Client Userid-Password Plugin
(CLNT_PW_PLUGIN) =
Client Kerberos Plugin
(CLNT_KRB_PLUGIN) =
Group Plugin
(GROUP_PLUGIN) =
GSS Plugin for Local Authorization
(LOCAL_GSSPLUGIN) =
Server Plugin Mode
(SRV_PLUGIN_MODE) = UNFENCED
Server List of GSS Plugins
(SRVCON_GSSPLUGIN_LIST) =
Server Userid-Password Plugin
(SRVCON_PW_PLUGIN) =
Server Connection Authentication
(SRVCON_AUTH) = NOT_SPECIFIED
Database manager authentication
(AUTHENTICATION) = SERVER
Cataloging allowed without authority
(CATALOG_NOAUTH) = NO
Trust all clients
(TRUST_ALLCLNTS) = YES
Trusted client authentication
(TRUST_CLNTAUTH) = CLIENT
Bypass federated authentication
(FED_NOAUTH) = NO
Default database path
(DFTDBPATH) = /home
Database monitor heap size (4KB)
(MON_HEAP_SZ) = 90
Java Virtual Machine heap size (4KB)
(JAVA_HEAP_SZ) = 2048
Audit buffer size (4KB)
(AUDIT_BUF_SZ) = 0
Size of instance shared memory (4KB)
(INSTANCE_MEMORY) = AUTOMATIC
Backup buffer default size (4KB)
(BACKBUFSZ) = 1024
Restore buffer default size (4KB)
(RESTBUFSZ) = 1024
Sort heap threshold (4KB)
(SHEAPTHRES) = 0
Directory cache support
(DIR_CACHE) = YES
Application support layer heap size (4KB)
(ASLHEAPSZ) = 15
Max requester I/O block size (bytes)
(RQRIOBLK) = 32767
Query heap size (4KB)
(QUERY_HEAP_SZ) = 1000
Workload impact by throttled utilities(UTIL_IMPACT_LIM) = 10
Priority of agents
(AGENTPRI) = SYSTEM
Max number of existing agents
(MAXAGENTS) = 400
Agent pool size
(NUM_POOLAGENTS) = 400
Initial number of agents in pool
(NUM_INITAGENTS) = 0
Max number of coordinating agents
(MAX_COORDAGENTS) = (MAXAGENTS - NUM_INITAGENTS)
Max no. of concurrent coordinating agents
(MAXCAGENTS) = MAX_COORDAGENTS
Max number of client connections
(MAX_CONNECTIONS) = MAX_COORDAGENTS
Keep fenced process
(KEEPFENCED) = YES
Number of pooled fenced processes
(FENCED_POOL) = MAX_COORDAGENTS
Initial number of fenced processes
(NUM_INITFENCED) = 0
Index re-creation time and redo index build
(INDEXREC) = RESTART
Transaction manager database name
(TM_DATABASE) = 1ST_CONN
Transaction resync interval (sec)
(RESYNC_INTERVAL) = 180
SPM name
(SPM_NAME) =
SPM log size
(SPM_LOG_FILE_SZ) = 256
SPM resync agent limit
(SPM_MAX_RESYNC) = 20
SPM log path
(SPM_LOG_PATH) =
TCP/IP Service name
(SVCENAME) = 60009
Discovery mode
(DISCOVER) = SEARCH
Discover server instance
(DISCOVER_INST) = ENABLE
Maximum query degree of parallelism
(MAX_QUERYDEGREE) = 7
Enable intra-partition parallelism
(INTRA_PARALLEL) = YES
Maximum Asynchronous TQs per query
(FEDERATED_ASYNC) = 0
No. of int. communication buffers(4KB)(FCM_NUM_BUFFERS) = AUTOMATIC
No. of int. communication channels
(FCM_NUM_CHANNELS) = AUTOMATIC
Node connection elapse time (sec)
(CONN_ELAPSE) = 10
Max number of node connection retries (MAX_CONNRETRIES) = 5
Max time difference between nodes (min) (MAX_TIME_DIFF) = 60
db2start/db2stop timeout (min)
(START_STOP_TIME) = 10
DB2诊断日志分析
问题数据库版本为V9.1,DB2从版本9开始支持数据库内存自动调优。本数据库大部分内存参数也是使用AUTOMATIC方式由数据库引擎自动计算决定(由db2stmm进程进行实际调整),通过上面的从诊断日志中摘取的片段也可以看出来,编号①~⑤表示5个日志事件,日志①②表示db2stmm进程将IBMDEFAULTBP缓冲池根据需要调整了2次大小(第一次从1000调整到1512、第二次从1512调整到2280),日志③表示db2stmm进程将SORTHEAP从61调整到65,到这时为止,数据库的内存使用没有任何问题,问题出在日志④上,有人人为的将IBMDEFAULTBP缓冲池从5160一下子调整到了1300000——为什么说是人为的呢?因为这次调整不是由db2stmm进程完成,而是由db2agent代理程序(应用程序在数据库服务器中的替身)完成 ,而IBMDEFAULTBP缓冲池是在Database_memory总大小中分配的,而Database_memory被设置为AUTOMATIC,所以db2stmm进程必须先将Database_memory调整到足够大,以容纳IBMDEFAULTBP缓冲池(1300000*4K 大约为 5G),于是日志⑤便证明了我们的推断——db2stmm进程将Database_memory从789400 (789400*4K 约等于3G) 一下子调整到了2373371(2373371*4K 约等于9G),而操作系统留给DB2使用的内存只有4.5G左右,所以这时数据库进程将使用PAGING SPACE(也就是虚拟内存,相当于物理内存的扩展)来满足需要,一旦进程使用到了PAGING SPACE,那么操作系统将发生大量的PAGE IN/PAGE OUT操作,PAGE IN/OUT操作将导致系统震荡,响应缓慢,操作系统出于自我保护,甚至会杀死DB2服务进程,最严重的情况是:当PAGING SPACE也用完时,系统出于自我保护目的,会自动宕机。我们这个数据库由于最后PAGING SPACE也被用完,最终导致了系统宕机。
问题解决
通过以上的分析,我们知道IBMDEFAULTBP设置的太大了(1300000*4K),连接数据库时导致系统宕机。所以我们下面设法将IBMDEFAULTBP改小。
操作步骤:
db2set DB2_OVERRIDE_BPF=10000
#使IBMDEFAULTBP使用这个值
db2 terminate
db2stop
db2start
db2 connect to db
#连上db
db2 alter bufferpool ibmdefaultbp numblockpages 0
#原来的块SIZE太大,我们这里禁用它
db2 force applications all
db2 connect to db
db2 alter bufferpool ibmdefaultbp immediate size 50000
#将SIZE改小
db2set DB2_OVERRIDE_BPF= #设置为空,还原回去
db2 terminate
db2 force applications all
db2stop
db2start
OK!!!!