第一部分:度量系统开销
脚本: top_level_waits.sql
rem **
rem
rem File: top_level_waits.sql
rem Description: Break down of top level WAITCLASS waits
rem
rem 描述:集群顶级等待事件
rem
rem
rem
rem
col time_cat format a20 heading "Time category"
col time_secs format 999,999.99 "Time (s)"
col pct format 99.99 "Time|pct"
set pagesize 10000
set lines 80
set echo on
SELECT wait_class time_cat, ROUND((time_secs), 2) time_secs,
ROUND((time_secs) * 100 / SUM(time_secs) OVER (), 2) pct
FROM (SELECT wait_class wait_class,
SUM(time_waited_micro) / 1000000 time_secs
FROM gv$system_event
WHERE wait_class <> 'Idle' AND time_waited > 0
GROUP BY wait_class
UNION
SELECT 'CPU', ROUND((SUM(VALUE) / 1000000), 2) time_secs
FROM gv$sys_time_model
WHERE stat_name IN ('background cpu time', 'DB CPU'))
ORDER BY time_secs DESC;
脚本说明:
集群等待时间占总数据库时间比例大于10%~20%需要DBA介入调查
Cluster PCT 大于 10%~20%
脚本:cluster_waits.sql
rem **
rem
rem File: cluster_waits.sql
rem 描述: Break out of cluster waits compared to other categories
rem
rem
rem
rem
column wait_type format a35 heading "Wait Type"
column lock_name format a12
column waits_1000 format 99,999,999 heading "Waits|\1000"
column time_waited_hours format 99,999.99 heading "Time|Hours"
column pct_time format 99.99 Heading "Pct of|Time"
column avg_wait_ms format 9,999.99 heading "Avg Wait|Ms"
set pagesize 10000
set lines 100
set echo on
WITH system_event AS
(SELECT CASE
WHEN wait_class = 'Cluster' THEN event
ELSE wait_class
END wait_type, e.
FROM gv$system_event e)
SELECT wait_type, ROUND(total_waits/1000,2) waits_1000 ,
ROUND(time_waited_micro/1000000/3600,2) time_waited_hours,
ROUND(time_waited_micro/1000/total_waits,2) avg_wait_ms ,
ROUND(time_waited_micro100
/SUM(time_waited_micro) OVER(),2) pct_time
FROM (SELECT wait_type, SUM(total_waits) total_waits,
SUM(time_waited_micro) time_waited_micro
FROM system_event e
GROUP BY wait_type
UNION
SELECT 'CPU', NULL, SUM(VALUE)
FROM gv$sys_time_model
WHERE stat_name IN ('background cpu time', 'DB CPU'))
WHERE wait_type <> 'Idle'
ORDER BY time_waited_micro DESC;
说明:由于集群的等待通常大部分直接由全局缓存请求等待组成,但是更多灾难性的全局缓存出现也是比较寻常,例如缺失,阻塞,全局缓存告诉缓存忙等待。上边这个脚本说明了通常展现的情况
查询出现gc cr/current block 2-way等gc等待事件需要dba介入调查。以下并没有出现gc等待事件
第二部分:减少全局缓存延迟
脚本:gc_waits.sql
col event format a30 heading "Wait event"
col total_waits format 999,999,999 heading "Total|Waits"
col time_waited_secs format 999,999,999 heading "Time|(secs)"
col avg_ms format 9,999.99 heading "Avg Wait|(ms)"
set pagesize 1000
set lines 80
set echo on
SELECT event, SUM(total_waits) total_waits,
ROUND(SUM(time_waited_micro) / 1000000, 2)
time_waited_secs,
ROUND(SUM(time_waited_micro)/1000 /
SUM(total_waits), 2) avg_ms
FROM gv$system_event
WHERE wait_class <> 'Idle'
AND( event LIKE 'gc%block%way'
OR event LIKE 'gc%multi%'
or event like 'gc%grant%'
OR event = 'db file sequential read')
GROUP BY event
HAVING SUM(total_waits) > 0
ORDER BY event;
说明:全局缓存一致读请求如(gc cr block 2-way 等) 平均超过了1ms 并且超过了数据文件顺序读的1/10的时间
第三部分:优化系统内部互联
脚本名: ksxpia.sql
rem **
rem
rem File: ksxpia.sql
rem Description: Private interconnect IP address
rem
rem
rem
col instance_number format 999 heading "Inst|#"
col host_name format a25 heading "Host|Name"
col network_interface format a5 heading "Net|IFace"
col private_ip format a12 heading "Private|IP"
set pages 1000
set echo on
SELECT instance_number, host_name, instance_name,
name_ksxpia network_interface, ip_ksxpia private_ip
FROM x$ksxpia
CROSS JOIN
v$instance
WHERE pub_ksxpia = 'N';
查询结果:sys用户
通过ping来查询rac之间的内部互联,从下图可以看到 odsdb2到odsdb1消耗了0.303毫秒
说明:在部署数据库的时候,如果延迟过高,考虑下这部分主要问题是将公用网络配置成了内部互联网络
内部互联问题的信号
以下脚本用来查询通过对比发送块,和接收块的数目显示丢失的块数
脚本:gc_miss_rate.sql
col value format 999,999,999,999
col name format a30
set echo on
SELECT name, SUM(VALUE) value
FROM gv$sysstat
WHERE name LIKE 'gc%lost'
OR name LIKE 'gc%received'
OR name LIKE 'gc%served'
GROUP BY name
ORDER BY name;
说明:
等待快丢失重传花费的时间记录在 gc cr request retry、gc cr block lost 和gc current block lost 上,这些等待事件关联的时间应该很低,与记录gc cr/current blocks received/served统计数据里的总块数比较,
通常要小于总数量的1%。
如果有很高的的块丢失,或者与块丢失的相关时间跟整个数据块时间比起来显得很显著,最有可能是硬件的问题,例如网卡没有安装好,网线折断,不合格的网络设备。
适度的块丢失,可能是内部互联负载过大。
第四部分 LMS等待
内部互联性是全局缓存延迟的核心,但是搞得全局缓存延迟是经常是oracle软件层次延迟的结果,远程实例lms服务贡献了全局缓存请求的大部分非网络延迟,它负责构建和返回请求的块,,下面查询了每个实例的当前度和一致性请求的LMS延时。
脚本:lms_latency.sql
rem **
rem
rem File: lms_latency.sql
rem Description: LMS latency breakdown
rem
rem
rem
rem
col instance_name format a12 heading "Instance"
col current_blocks_served format 999,999,999 heading "Current Blks|Served"
col avg_current_ms format 99.99 heading "Avg|CU ms"
col cr_blocks_served format 999,999,999 heading "CR Blks|Served"
col avg_cr_ms format 99.99 heading "Avg|Cr ms"
set pages 1000
set lines 80
set echo on
WITH sysstats AS (
SELECT instance_name,
SUM(CASE WHEN name LIKE 'gc cr%time'
THEN VALUE END) cr_time,
SUM(CASE WHEN name LIKE 'gc current%time'
THEN VALUE END) current_time,
SUM(CASE WHEN name LIKE 'gc current blocks served'
THEN VALUE END) current_blocks_served,
SUM(CASE WHEN name LIKE 'gc cr blocks served'
THEN VALUE END) cr_blocks_served
FROM gv$sysstat JOIN gv$instance
USING (inst_id)
WHERE name IN
('gc cr block build time',
'gc cr block flush time',
'gc cr block send time',
'gc current block pin time',
'gc current block flush time',
'gc current block send time',
'gc cr blocks served',
'gc current blocks served')
GROUP BY instance_name)
SELECT instance_name , current_blocks_served,
ROUND(current_time10/current_blocks_served,2) avg_current_ms,
cr_blocks_served,
ROUND(cr_time10/cr_blocks_served,2) avg_cr_ms
FROM sysstats;
说明:如果网络是灵敏和快速的,但是LMS延迟较高,可能是以下原因
1.过载的实例不能快速响应全局缓存的请求,特别是lms进程可能是请求数太多,或者cpu不足。
- io瓶颈 特别是redo io,正在降低全局缓存请求的响应速度。
当集群中一个或者多个出现超负载的情况时,高得全局缓存延迟可能发生。可能暗示需要关注集群内的负载均衡。 ----可能配置了负载均衡,我们不要负载均衡模式。
高延时的其他常见原因是在发送块给请求实例前,lms必须刷新未提交的变化到重做日志。
这个脚本计算出需要重做日志刷新的块传输比例和执行刷新需要消耗的lms时间的比例:
set echo on
WITH sysstat AS (
SELECT SUM(CASE WHEN name LIKE '%time'
THEN VALUE END) total_time,
SUM(CASE WHEN name LIKE '%flush time'
THEN VALUE END) flush_time,
SUM(CASE WHEN name LIKE '%served'
THEN VALUE END) blocks_served
FROM gv$sysstat
WHERE name IN
('gc cr block build time',
'gc cr block flush time',
'gc cr block send time',
'gc current block pin time',
'gc current block flush time',
'gc current block send time',
'gc cr blocks served',
'gc current blocks served')),
cr_block_server as (
SELECT SUM(flushes) flushes, SUM(data_requests) data_requests
FROM gv$cr_block_server )
SELECT ROUND(flushes100/blocks_served,2) pct_blocks_flushed,
ROUND(flush_time100/total_time,2) pct_lms_flush_time
FROM sysstat CROSS JOIN cr_block_server;
说明:小的块刷新比例(1%)花费总lms时间较大的比例(36%),指示需要重新调优日志io分布。上图指示的是19对应100,暂时不需要。
第五部分:集群负载均衡
集群中实例自启动以来有关cpu,数据库时间,和逻辑读的统计数据。
脚本:balance.sql
rem **
rem
rem File: balance.sql
rem Description: Cluster balance report
rem
rem
rem
rem
col instance_name format a8 heading "Instance|Name"
col db_time_pct format 99.99 heading "Pct of|DB Time"
col cpu_time_pct format 99.99 heading "Pct of|CPU Time"
col db_time_secs format 9,999,999.99 heading "DB Time|(secs)"
col cpu_time_secs format 9,999,999.99 heading "CPU Time|(secs)"
set lines 80
set pages 1000
set echo on
WITH sys_time AS (
SELECT inst_id, SUM(CASE stat_name WHEN 'DB time'
THEN VALUE END) db_time,
SUM(CASE WHEN stat_name IN ('DB CPU', 'background cpu time')
THEN VALUE END) cpu_time
FROM gv$sys_time_model
GROUP BY inst_id )
SELECT instance_name,
ROUND(db_time/1000000,2) db_time_secs,
ROUND(db_time100/SUM(db_time) over(),2) db_time_pct,
ROUND(cpu_time/1000000,2) cpu_time_secs,
ROUND(cpu_time100/SUM(cpu_time) over(),2) cpu_time_pct
FROM sys_time
JOIN gv$instance USING (inst_id);
说明:从上图可以看到,节点1负载较高。如果要采用负载均衡,需要修改tnsname.ora
查询服务负载展示各种各样的工作负载统计数据
查询通过对集群各个节点的cpu消耗进行分解,展示各个实例消耗的cpu在总的cpu消耗的占比,以及工作负载时如何在集群的各个节点中进行分布。
rem **
rem
rem File: service_stats.sql
rem Description: Report on service workload by instance
rem
rem
rem
col instance_name format a8 heading "Instance|Name"
col service_name format a15 heading "Service|Name"
col cpu_time format 99,999,999 heading "Cpu|secs"
col pct_instance format 999.99 heading "Pct Of|Instance"
col pct_service format 999.99 heading "Pct of|Service"
set lines 80
set pages 1000
set echo on
BREAK ON instance_name skip 1
COMPUTE SUM OF cpu_time ON instance_name
WITH service_cpu AS (SELECT instance_name, service_name,
round(SUM(VALUE)/1000000,2) cpu_time
FROM gv$service_stats
JOIN
gv$instance
USING (inst_id)
WHERE stat_name IN ('DB CPU', 'background cpu time')
GROUP BY instance_name, service_name )
SELECT instance_name, service_name, cpu_time,
ROUND(cpu_time * 100 / SUM(cpu_time)
OVER (PARTITION BY instance_name), 2) pct_instance,
ROUND( cpu_time
- 100
/ SUM(cpu_time) OVER (PARTITION BY service_name), 2)
pct_service
FROM service_cpu
WHERE cpu_time > 0
ORDER BY instance_name, service_name;
说明,从这个可以很清楚看到每个服务占用的CPU多少。
第六部分:度量全局缓存请求比例
以下脚本查询执行计算并且决定物理读和逻辑读的比例,也就是缓存区告诉缓存命中率
rem **
rem
rem File: gc_miss_rate.sql
rem Description: "Global cache ""miss rate"" by instance "
rem
rem
rem
rem
col instance_name format a10 heading "Instance|name"
col logical_reads format 999,999,999 heading "Logical|Reads"
col gc_blocks_recieved format 999,999,999 heading "GC Blocks|Received"
col physical_reads format 999,999,999 heading "Physical|Reads"
col phys_to_logical_pct format 99.99 heading "Phys/Logical|Pct"
col gc_to_logical_pct format 99.99 heading "GC/Logical|Pct"
set pagesize 10000
set lines 80
set echo on
WITH sysstats AS (
SELECT inst_id,
SUM(CASE WHEN name LIKE 'gc%received'
THEN VALUE END) gc_blocks_received,
SUM(CASE WHEN name = 'session logical reads'
THEN VALUE END) logical_reads,
SUM(CASE WHEN name = 'physical reads'
THEN VALUE END) physical_reads
FROM gv$sysstat
GROUP BY inst_id)
SELECT instance_name, logical_reads, gc_blocks_received, physical_reads,
ROUND(physical_reads100/logical_reads,2) phys_to_logical_pct,
ROUND(gc_blocks_received100/logical_reads,2) gc_to_logical_pct
FROM sysstats JOIN gv$instance
USING (inst_id);
分析:全局缓存/逻辑请求的最高比率的实例是最不繁忙的实例,实例越不忙,它需要的块越有可能存在于其他更忙得实例中。
为了判断哪个段带来最高比例的全局缓存行为,查询列举接收到全局缓存块数最多的段。
rem **
rem
rem File: top_gc_segments.sql
rem Description: Segments with the highest Global Cache activity
rem
rem
rem
col segment_name format a40
col gc_blocks_received format 999,999,999
col pct format 99.99
set pages 1000
set lines 80
set echo on
WITH segment_misses AS
(SELECT owner || '.' || object_name segment_name,
SUM(VALUE) gc_blocks_received,
ROUND( SUM(VALUE)* 100
/ SUM(SUM(VALUE)) OVER (), 2) pct
FROM gv$segment_statistics
WHERE statistic_name LIKE 'gc%received' AND VALUE > 0
GROUP BY owner || '.' || object_name)
SELECT segment_name,gc_blocks_received,pct
FROM segment_misses
WHERE pct > 1
ORDER BY pct DESC;
从以上可以看到查询列举接收到全局缓存块数最多的段