ASH stands for "active session history" and it's official introduction was in Oracle 10g.Oracle 10g introduced ASH and Oracle 11 buildsupon that foundation. ASH isa radically new way to gather performance data. “Radical?” You might ask. Yes, “radical” becauseASH onlytakessamples of data once a second and what happensbetween samples is lost. The lost data bothersmany at first but this unease quickly passes when the power of ASH is understood. ASH provides information necessary to solve some of the toughest performance problems. Before ASH this information was often too difficult and too expensive to get. Previous performance gathering techniques such as STATSPACK (and continued with the even more expensive AWR whose report is almost the same as STATSPACK) lacked the information to solve many performance bottlenecks. Session tracing provides much of the same information as ASH but is much more costly and has to be set up before a problem arises, proving to be impractical in many situations. Starting in Oracle 10g, ASH is always running, sampling every second, and saved for a week on disk (configurable), thus providing the data need to identify and solve a problem that may have only lasted seconds and days ago.
ASH is way of sampling the state of sessions connected to an Oracle database in order to monitory database load as well as drill down into any performance issues that may arise. ASH is a technology that can be applied to any system with connected users such as another database like SQL Server or an operating system or even applications and application servers. Oracle is the first system that I know of using ASH to collect performance data. ASH by default on an Oracle database, samples once every second and logs some 30 to 60 (depending on version) pieces of information on any session that is active at the time of sampling. One hurtle to accepting ASH as a model had been the question “how fast do we have to sample to get worth while data”. There has been a tendency (I admit having felt this) to want to sample super fast like 10-100 times a second (easily possible with a C program) but experience has shown that most monitoring and problem resolution can be easily accomplished with 1 second sampling. Higher rate of sampling serve only for rare cases that are better left to tracing. For 99.9% of the time, sampling once a second is sufficient to both show clearly the load on the system as well as provide detailed information on system activity that would otherwise be too difficult or prohibitively expensive to collect (due to the load caused by the data collection itself).
ASH is the collection of active session stated sampled every second. An active session is any session that has made a call to the database.
The commitment that Oracle took to ASH required a break from the compulsive quest to gather all the statistics possible at a 100% accuracy which was a futile and limiting strategy. By letting go the drive to collect everything all the time accuratly, Oracle was able to collect more information, more powerfully and with less over head. How can less be more? ASH took an understanding that the most powerful performance data can be collected not by collecting every thing but by collecting the most important information in a particularmanor.
Instead of collecting the exact statistics, ASH is a statistical approximation to the statistic counters. ASH samples every second the session states of all active sessions as well as the SQL the session is executing. This sampling produces a statistical approximation that is cheaper to collect and the multidimensional data allowing new and previouslyimpossiblediagnostics.
Intractable Performance Data Collection Problem
Previous to ASH, Oracle performance collections tried to collect as much statistics as often as possible.
This produced a couple of problems
1. Heisenberg affect
2. Overwhelming amount of data
For example, database tools tried to collect statistics on all the Sessions, SQL and Objects in the database.
Session information came from v$session.
SQL information came fro v$sql
Object information came from v$segstat (as of Oracle 9i, before 9i it wasn't available)
The collection of this information amounted to massive amounts of work. For example on a system with 150 sessions the number of values to collect would be 150 sessions x (800 wait events + 200 statitics) = 150,000 values to collect every collection ! That's just session information. For SQL, there could be tens of thousands and for each statement we might want to collect a couple dozen statistics. Same for objects - there could be 1000s of objects and for each we'd want to collect a dozen of statistics. The problem quickly becomes intractable.
There are optimizations of the collection that allowed some people todoggedlytry and collect these kind of statistics.For example for Sessions, SQL and Objects we could filter out any statistics that were zero but for anystatisticsthat wasn't zero we'd have collect it because we wouldn't know if it had changed since we last collected it. For sessions, Oracle does have a counter that tells whether the session has been active since we last collected but SQL and Objects don't have any such counter
Sessions
# sessions x (# wait events + statistics)
Example (150 x (800+200) = 150,000 )
SQL
v$sql
Could be 10000s
Takes out latches that compete with other sql executions
Objects
V$segstat 9i+
Could be 1000s of objects
How about other things like v$filestat, etc etc?
.
The solution? Every second query v$session which is an inexpesive view onto the actual C structures in Oracle shared memory. Only collect information for sessions that are active at the time of collection. Collecting only active sessions is a natural filter for everything we want to filter. This filters out not only unwanted sessions, but also serves as a guide to the SQL,Object and Files we also want to collect information on. We gather not only session information and state, but also gather what SQL they were executing, files and objects they were accessing, honing in directly to the information of most interest to us.
ASH:
Intelligently Collects Data
Samples once a second
Collects active sessions only
Collects the SQL being executed by the session and related information such as objects accessed, files read from etc
More activity, more data collected
Less activity, less data collected
Old methods:
collected everything
Obfuscated the problem, too many statistics too late
Costly
Too Granular – once an hour ?! Give me a break
Sample Every Second
Knowing everything – impossible and overwhelming
Knowing enough – possible and manageable
Sampling is like taking a motion picuture. We miss what happens between pictures but we get enough to know exactly what happened.
In the above diagram, the vertical lines represent the instant samples are take. Samples are taken once a second, thus for long running queries, over a second, we definitely capture the information. For short queries that are run often, we might miss a lot but we will capture a lot. For short queries that are rarely run, we will miss most of them. This kind of capture filters out queries that have little consequence on database load and focuses in on those queries that are important and put load on the database either because they are resource intensive or because they are run so often that they have a cumulative impact on resources.
If it happens a lot or for long ... it will be captured by ASH.
Wait events can be classified into 4 major groups
On Oracle databases the classification is easy thanks to a table called V$EVENT_NAME which has a filed "WAIT_CLASS" that tells the kind of wait.
One problem with Oracle's wait interface when it was introduced in version 7 was that there was no documentation on the wait events and to make matters worse, many of the wait events were "idle", ie the meant that the processes had no work to do and were just waiting. For example if I run SQL*PLUS but don't actually run any queries, then my session in the database will report that it's waiting for "SQL*Net message to client", meaning the session is ready and waiting to execute queries but I'm not giving it any. Once I submit a query then my state would change to either running on the CPU or some actual non-idle wait event like waiting for IO. But once my query finishes executing and the results are given back to the user in SQL*Plus then the session state in the database goes back to"SQL*Net message to client".The wait event"SQL*Net message to client" is only one of many idle wait events that don't signify bottlenecks but if I don't know they are idle events, then it might look like my database has huge bottlenecks. Thisconfusionbetween real wait events and idle wait events and the lack of documentation slowed the adoption of the wait interface by DBAs and performance analysts. Luckily now it's easy to get the list of idle wait events and create a list of events to ignore with the query:
select name
from v$event_name
where wait_class='Idle';
58 Rows
If on Oracle 9i or below, you can install statspack which has a tableSTATS$IDLE_EVENT that lists the idle events. These are events we can ignore when looking at overall database load.Theycould be relevant when looking at one session. For example, if a session is suppose to be running and we see lots of idle waits, its a signal that the application isinefficientlyusing the database. For example if the application is inserting 1000s of rows, but inserting one at a time, then a lot of time will be spent on communication between the application and the database session. In this case we will see the session spending a lot of time waiting on idle events when we'd expect it to be on CPU. The solution in that case would be to use batch processing. For example
Batching vs Single row operations
Single row operations, lots of communication , slowest
-- slowest : single row inserts with commit
insert into foo values ...;
insert into foo values ...;
insert into foo values ...;
insert into foo values ...;
Send all the info over the wire to the database, onecommunication
-- For Loop
FOR i IN 1..i LOOP
INSERT INTO foo VALUES (id(i), data(i));
END LOOP;
COMMIT;
Send all the info over the wire to the database, onecommunication, bulk insert
-- FORALL
FORALL i IN 1..i
INSERT INTO foo VALUES (id(i), data(i));
COMMIT;
seehttp://www.di.unipi.it/~ghelli/didattica/bdl/B19306_01/appdev.102/b14261/tuning.htm#i48876
CPU is a special case of "wait events". There is no wait event called "CPU" and when we are on CPU we aren't considered to be waiting but actually working. Working is considered a good thing (assuming the work is worth getting done). We might or might not be doing efficient work but that's a different discussion. CPU is one state that a sessions can spend it's time in though, so the CPU state belong in the list of "wait events" or better put "session states". CPU ispivotalin analysis of performance because the relative importance of time waited is all relative to how much time we spent on the CPU. If we spent little time both on CPU and wait time then we weren't doing much. If we spent a lot of time on CPU and little on wait time, then the waits don't matter but if we spent a lot of time waiting and little on CPU then we have an opportunity to improve throughput and response time.
How do we know if a session is on CPU? Oracle doesn't report any such session state directly but it does give us enough information to deduce it. In V$SESSION (or V$SESSION_WAIT before 10g) these fields let us know
wait_time !=0 and status='ACTIVE'
Both of these fields are in V$SESSION starting in 10g. Before 10g V$SESSION_WAIT has to be joined to V$SESSION to get both of these fields.
IO is a wait but it is a wait that all databases have to so at some point in time. Data has to be read of disk and reading off of disk causes IO wait time. Whether IO wait time is acceptable or not really depends on the application but we can make some general observations such as an optimize disk subsystem should be able to render reads in 10ms. If not and the system spends a lot of time waiting for IO then there is the opportunity to improve throughput. IO waits break down into groups such as IO done by sorts that overflow memory buffers and a written to disk in the temporary tables and read out again. If we see this happening we can investigate increasing memory sort area sizes. Finally we can check the db cache advice and see if there would be any benefit to change the buffer cache size. Once these areas have been checked, it really comes down to investigating the SQL and seeing if they can be optimized either through change the SQL or adding structures such as indexes or if the application logic or architecture can be optimized.
select name from V$EVENT_NAME where WAIT_CLASS=‘User I/O'
Finally pure waits. These are the waits that we should be able to eliminate from the system theoretically. The one exception to that might be "log file sync" which happens at commit time but can be minimized by having fast log file devices even going to solid state and limiting the commit frequency.
In general if we see other waits and they are important relative to the amount of CPU time spent then we have a clear opportunity for performance tuning.
Waits are classified by WAIT_CLASS
Select wait_class, count(*) from V$EVENT_NAME where WAIT_CLASS not in ('Idle','User I/O') group by WAIT_CLASS order by WAIT_CLASS;
WAIT_CLASS COUNT(*)
------------------------ ----------
Administrative 46
Application 12
Cluster 47
Commit 1
Concurrency 25
Configuration 23
Network 27
Other 590
Scheduler 2
System I/O 24
Over 800 waits, though almost 75% of the are "Other", ie they shouldn't happen in normal circumstances.
ASH
Not only does Oracle collect the active sessions, their session state and they SQL they are executing but Oracle collects a lot more information and stores it in a table V$ACTIVE_SESSION_HISTORY
SQL> v$active_session_history
Name Null? Type
----------------------------------------- -------- ----------------------------
SAMPLE_ID NUMBER
SAMPLE_TIME TIMESTAMP(3)
SESSION_ID NUMBER
SESSION_SERIAL# NUMBER
USER_ID NUMBER
SQL_ID VARCHAR2(13)
SQL_CHILD_NUMBER NUMBER
SQL_PLAN_HASH_VALUE NUMBER
SQL_OPCODE NUMBER
SERVICE_HASH NUMBER
SESSION_TYPE VARCHAR2(10)
SESSION_STATE VARCHAR2(7)
QC_SESSION_ID NUMBER
QC_INSTANCE_ID NUMBER
EVENT VARCHAR2(64)
EVENT_ID NUMBER
EVENT# NUMBER
SEQ# NUMBER
P1 NUMBER
P2 NUMBER
P3 NUMBER
WAIT_TIME NUMBER
TIME_WAITED NUMBER
CURRENT_OBJ# NUMBER
CURRENT_FILE# NUMBER
CURRENT_BLOCK# NUMBER
PROGRAM VARCHAR2(48)
MODULE VARCHAR2(48)
ACTION VARCHAR2(32)
CLIENT_ID VARCHAR2(64)
Looks like a lot of fields! Let's break it down into bite sizes logicalpieces:
Of these fields the basic information is
But all of the fields of ASH allow large number of options foraggregationand analysis such as
and starting in 10.2.0.3 Oracle as Package and Procedure so we can find the top Package and Procedure by all the resources listed above such as CPU, WAIT, IO or TIME.
ASH is multidimensional data that can be group in many ways.
Who is creating the load:
-- session identifiers : |
SQL Identfiers
SQL_ID |
Wait Information
EVENT |
Object information (IO, buffer busy wait, lock enqueue TX waits only)
CURRENT_OBJ# |
Top CPU consuming Session in last 5 minutes?
Select
session_id,
count(*)
from
v$active_session_history
where
session_state= ‘ON CPU‘ and
SAMPLE_TIME > sysdate – (5/(24*60))
group by
session_id
order by
count(*) desc;
SESSION_ID COUNT(*)
---------- ----------
257 299
263 62
256 32
264 9
277 3
258 1
Top Waiting Session in last 5 minutes
Select
session_id,
count(*)
from
v$active_session_history
where
session_state=‘WAITING’ and
SAMPLE_TIME > SYSDATE - (5/(24*60))
group by
session_id
order by
count(*) desc;
SESSION_ID COUNT(*)
---------- ----------
272 224
254 8
249 5
276 5
277 4
270 1
Top SQL by CPU usage, wait time and IO time
select
ash.SQL_ID ,
sum(decode(ash.session_state,'ON CPU',1,0)) "CPU",
sum(decode(ash.session_state,'WAITING',1,0)) -
sum(decode(ash.session_state,'WAITING', decode(en.wait_class, 'User I/O',1,0),0)) "WAIT" ,
sum(decode(ash.session_state,'WAITING', decode(en.wait_class, 'User I/O',1,0),0)) "IO" ,
sum(decode(ash.session_state,'ON CPU',1,1)) "TOTAL"
from v$active_session_history ash,
v$event_name en
where SQL_ID is not NULL and en.event#=ash.event#
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1)) desc
SQL_ID CPU WAITING IO TOTAL
------------- ---------- ---------- ---------- ----------
4c1xvq9ufwcjc 23386 0 0 23386
6wjw6rz5uvbp3 99 0 23 122
968dm8hr9qd03 97 0 22 119
938jp5gasmrah 90 0 25 115
cv8xnv81kf582 42 0 9 51
6p9bzu19v965k 21 0 0 21
5zu8pxnun66bu 15 0 0 15
db2jr13nup72v 9 0 0 9
7ks5gnj38hghv 8 0 0 8
Top SESSION by CPU usage, wait time and IO time
select
ash.session_id,
ash.session_serial#,
ash.user_id,
ash.program,
sum(decode(ash.session_state,'ON CPU',1,0)) "CPU",
sum(decode(ash.session_state,'WAITING',1,0)) -
sum(decode(ash.session_state,'WAITING',
decode(en.wait_class,'User I/O',1, 0 ), 0)) "WAITING" ,
sum(decode(ash.session_state,'WAITING',
decode(en.wait_class,'User I/O',1, 0 ), 0)) "IO" ,
sum(decode(session_state,'ON CPU',1,1)) "TOTAL"
from v$active_session_history ash,
v$event_name en
where en.event# = ash.event#
group by session_id,user_id,session_serial#,program
order by sum(decode(session_state,'ON CPU',1,1))
SESSION_ID SERIAL#USER_ID PROGRAMCPUWAITINGIO
---------- ------- ---------- ------------------------- ------- ---------- ----------
247619701sqlplus1169800
27710 oracle@labsfrh903 (LGWR)14210
27610 oracle@labsfrh903 (CKPT)19100
27810 oracle@labsfrh903 (DBW0)2900
28010 oracle@labsfrh903 (PMON)1900
254226175 Executor.exe1303
255128775 Executor.exe1105
257337295 Executor.exe1501
255134175 Executor.exe1402
Top SQL with Username and connection status
select
decode(nvl(to_char(s.sid),-1),-1,'DISCONNECTED','CONNECTED')
"STATUS",
topsession.session_id "SESSION_ID",
u.name "NAME",
topsession.program "PROGRAM",
max(topsession.CPU) "CPU",
max(topsession.WAITING) "WAITING",
max(topsession.IO) "IO",
max(topsession.TOTAL) "TOTAL"
from ( {previous query} ) topsession,
v$session s,
user$ u
where
u.user# =topsession.user_id and
/* outer join to v$session because the session might be disconnected */
topsession.session_id = s.sid (+) and
topsession.session_serial# = s.serial# (+)
group by topsession.session_id, topsession.session_serial#, topsession.user_id,
topsession.program, s.username,s.sid,s.paddr,u.name
order by max(topsession.TOTAL) desc
STATUS SESSION_ID NAME PROGRAM CPU WAITING IO
--------------- ---------- ---------- ------------------------- ----- ---------- ----
CONNECTED 247 CPU_Monger ChMgr304.exe 11704 0 0
CONNECTED 277 SYS oracle@labsfrh903 (LGWR) 14 19 0
CONNECTED 278 SYS oracle@labsfrh903 (DBW0) 29 0 0
CONNECTED 276 SYS oracle@labsfrh903 (CKPT) 18 9 0
CONNECTED 280 SYS oracle@labsfrh903 (PMON) 20 0 0
DISCONNECTED 255 SYSTEM Executor.exe 11 4 5
DISCONNECTED 257 SYSTEM Executor.exe 13 0 3
DISCONNECTED 255 SYSTEM Executor.exe 14 0 2
DISCONNECTED 257 SYSTEM Executor.exe 13 0 3
Note on Active or Waiting
SESSION_STATE
Waiting, on CPU
Based on WAIT_TIME
WAIT_TIME (v$session, v$session_wait, v$ash)
0 => waiting
>0 => CPU (value is time of last wait)
TIME_WAITED
Actual time waited for event
0 until wait finishes
Fix up values (no one else can do this)
ASH family of tables
a
a
Circular Buffer - 1M to 128M (~2% of SGA)
Flushed every hour to disk or when buffer 2/3 full (it protects itself so you can relax)
Avg row around 150bytes
3600 secs in an hour
~ ½ Meg per Active Session per hour
That’s generally over an hour of ASH
Dumping ASH to flat file
oradebug dump ashdump 5
Alter session set events ‘immediate trace name ashdump level 5’;
level 5 = # of minutes
loader file rdbms/demo/ashldr.ctl
statistics_level = Typical (default)
PARAMETER SESSION_VALUE INSTANCE_VAL
----------------------- ---------- ------------
_ash_sampling_interval 1000 1000
milliseconds
_ash_size 1048618 1048618
ASH buffer size
_ash_enable TRUE TRUE
Turn on/off ASH sampling, flushing and the V$ views on ASH
_ash_disk_write_enable TRUE TRUE
Flush to disk
_ash_disk_filter_ratio 10 10
write 1 in 10 points
_ash_sample_all FALSE FALSE
Sample including idle waits
10.2 added fields to ASH
Blocking Session !
BLOCKING_SESSION
BLOCKING_SESSION_STATUS
BLOCKING_SESSION_SERIAL#
Parameter Names
P1TEXT , P2TEXT, P3TEXT
Wait Grouping
WAIT_CLASS
WAIT_CLASS_ID
Transactions
XID
Cursor sharing
FORCE_MATCHING_SIGNATURE
10.2.0.2 added
PLSQL_ENTRY_OBJECT_ID
PLSQL_ENTRY_SUBPROGRAM_ID
PLSQL_OBJECT_ID
PLSQL_SUBPROGRAM_ID
connect to ALL_PROCEDURES with
where object_id = plsql_object_id
and subprogram_id = plsql_subprogram_id
11.1.0 added
Identifying SQL Execution -Is this same execution as last sample?
SQL_EXEC_ID
SQL_EXEC_START
SQL Execution Row Source -Identifies current row source within plan
SQL_PLAN_LINE_ID
SQL_PLAN_OPERATION
SQL_PLAN_OPTIONS
RAC block transfers -Remote instance id for Cache transfers, Which instance sourced requested block?
REMOTE_INSTANCE#
Recursive SQL
TOP_LEVEL_SQL_ID
TOP_LEVEL_SQL_OPCODE
CPU breakdown -Operation bit vector ,Capture non-timed operations
IN_CONNECTION_MGMT
IN_PARSE
IN_HARD_PARSE
IN_SQL_EXECUTION
IN_PLSQL_EXECUTION
IN_PLSQL_RPC
IN_PLSQL_COMPILATION
IN_JAVA_EXECUTION
IN_BIND
IN_CURSOR_CLOSE
ETC
CURRENT_ROW#
EVENT#
QC_SESSION_SERIAL#
CONSUMER_GROUP_ID
FLAGS
How much data does ASH Collect ?
1 CPU means max 1 Avg Active Session unless there is a bottleneck
Big site examples:
Oracle 4 way RAC internal apps
10,000 connected, 200 active
One Site
3000 connected, 30 Active
Site
12,000 connected, 100 active
a
Ash Across Versions
ASH is now at 93 fields in 11gR2, starting from an original 30 in 10gR1
Here is a spread sheet across 10.1.0, 10.2.0.1, 10.2.0.3,11.1, 11.2
http://spreadsheets.google.com/pub?key=t7c5YZ1hD_I25jLa2D6DdLg&output=html
for 11gR2 documentation see:
http://download.oracle.com/docs/cd/E11882_01/server.112/e10820/dynviews_1007.htm#I1030299
10.1 10.2 10.2.0.3 11.1 11.2
aveact.sql- day of data, 1 line per hour
"+" - represent CPU usage
"-" - represent wait time
"2" in the GRAPH is the number of CPUS on this machine
GRAPH is a graphical representation of AVEACT (AAS)
TM NPTS AVEACT GRAPH CPU WAITS
---------------- ------ ------- ---------------------- ---- -----
06-AUG 13:00:00 270 .33 +- 2 29 59
06-AUG 14:00:00 1040 2.24 ++--------2--- 341 1984
06-AUG 15:00:00 623 6.67 ++++------2---------- 438 3718
06-AUG 16:00:00 1088 2.59 ++--------2---- 335 2486
06-AUG 17:00:00 1104 1.26 ++----- 2 349 1043
06-AUG 18:00:00 1093 1.38 +++---- 2 663 842
06-AUG 19:00:00 1012 1.74 ++------- 2 373 1388
06-AUG 20:00:00 1131 .99 +---- 2 304 820
06-AUG 21:00:00 1111 1.22 ++----- 2 344 1012
06-AUG 22:00:00 1010 1.66 ++------ 2 414 1259
06-AUG 23:00:00 1120 1.08 +---- 2 298 913
07-AUG 00:00:00 1024 .83 +--- 2 273 576
07-AUG 01:00:00 1006 1.74 ++------- 2 319 1428
07-AUG 02:00:00 1090 2.47 ++--------2---- 347 2345
07-AUG 03:00:00 687 6.59 +++-------2---------- 382 4142
07-AUG 04:00:00 1004 1.95 ++++++--- 2 1299 659
07-AUG 05:00:00 1104 3.08 +++++-----2------ 1170 2226
07-AUG 06:00:00 1122 1.91 +++++++-- 2 1582 558
07-AUG 07:00:00 1115 1.06 +++--- 2 559 618
07-AUG 08:00:00 1140 .81 ++-- 2 403 519
07-AUG 09:00:00 1128 .88 ++--- 2 386 601
aveactn.sql-- day of data, 1 line per hour,show the top to waits as well as graph
TO_CHAR(STA SAMPLES AAS1 FIRST AAS2 SECOND GRAPH
----------- ---------- ----- --------------- ----- --------------- ------------------------------
23 10:00:00 647 .01 db file sequent .01 db file scatter 4
23 11:00:00 2773 .02 CPU .01 buffer busy wai 4
23 12:00:00 3029 .00 CPU .00 db file paralle 4
23 01:00:00 2761 .00 SQL*Net break/r .00 control file se 4
23 02:00:00 2258 .00 CPU .00 log file sync 4
23 03:00:00 1651 .03 buffer busy wai .01 CPU
aveactf.sql- same as above but finer frequency ( 1 hour of data, line for every minute)
aveactnf.sql- same as above but finer frequency( 1 hour of data, line for every minute)