- Posted by Gavin Soorma
- On July 17, 2009
- 0 Comments
Checks to be performed at the machine level (note the example is Red Hat Linux specific)
run queue should be ideally not more than the number of CPU’s on the machine
At the maximum it should never be more than twice the number of CPU’s.
This is denoted by the column ‘r’ in the vmstat output shown below
vmstat – 5 face:/opt/oracle>vmstat 5 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 4 1 488700 245704 178276 12513572 0 1 10 17 48 1365 40 12 43 5 2 0 488700 302568 178312 12514904 0 0 157 686 3354 4342 55 22 22 1 3 1 488700 257500 178508 12517896 0 0 616 221 1352 2132 21 7 64 9 2 2 488700 232348 178820 12525392 0 0 1550 274 3632 6091 29 9 42 20 3 0 488700 225040 178880 12527336 0 0 346 452 2494 3300 45 13 38 4 5 0 488700 225680 178916 12527884 0 0 79 269 2005 3847 29 11 58 2
CPU idle% < 10 ( id column) could indicate a machine that is having CPU resource issues
Note: How to find number of CPU’s on a LINUX machine?
cat /proc/cpuinfo |grep -w “processor” |wc –l
Swap columns si and so should ideally be 0 to indicate no swapping activity
face:/opt/oracle>free -m total used free shared buffers cached Mem: 15635 15476 159 0 173 12196 -/+ buffers/cache: 3106 12528 Swap: 3999 477 3522 We should be looking at the free and used values in the row denoted by “-/+ buffers/cache”
The ‘top’ command will help us identify the load average on the machine as well as any process that is consuming excessive CPU
face:/opt/oracle>top 09:04:01 up 243 days, 13:01, 10 users, load average: 2.80, 3.14, 3.13 813 processes: 797 sleeping, 8 running, 8 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 40.7% 0.0% 12.7% 0.0% 0.3% 2.3% 43.7% cpu00 42.8% 0.0% 13.4% 0.3% 1.1% 1.9% 40.0% cpu01 37.3% 0.0% 12.6% 0.0% 0.0% 1.7% 48.2% cpu02 47.5% 0.0% 14.1% 0.0% 0.1% 1.1% 36.9% cpu03 35.1% 0.0% 10.5% 0.0% 0.0% 4.5% 49.7% Mem: 16010560k av, 15808688k used, 201872k free, 0k shrd, 181324k buff 11368660k actv, 2257548k in_d, 265628k in_c Swap: 4095984k av, 488700k used, 3607284k free 12551192k cache PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 24062 oracle 18 0 1342M 1.3G 1328M S 14.0 8.5 370:13 2 oracle 6129 oracle 18 0 958M 953M 923M R 12.3 6.0 172:14 0 oracle 10077 oracle 25 0 149M 149M 145M R 4.5 0.9 0:00 2 oracle 9924 oracle 16 0 156M 155M 150M S 2.6 0.9 0:00 2 oracle 10038 oracle 17 0 140M 140M 136M S 2.5 0.8 0:00 1 oracle 10055 oracle 24 0 136M 135M 132M S 2.4 0.8 0:00 0 oracle 25529 oracle 17 0 782M 782M 775M S 1.7 5.0 3:58 0 oracle 8245 oracle 16 0 274M 274M 268M S 1.6 1.7 0:03 1 oracle 9751 oracle 16 0 70484 67M 62952 S 1.3 0.4 0:01 2 oracle 31879 oracle 16 0 830M 828M 815M R 1.0 5.2 46:06 3 oracle 9210 oracle 15 0 229M 229M 225M S 0.7 1.4 0:01 3 oracle
A load average greater than 5 or 10 could indicate a heavily utilized machine
CPU information for each CPU is also provided via the top command as well as information on the physical as well as virtual Memory that is available on the machine.
Information on the top CPU as well as Memory consuming processes is also provided along with the Process ID (PID). In a later section we will use this PID as a parameter for a SQL query to identify the SQL being executed by the same CPU consuming process.
Identify if any single PID is constantly appearing in the top output
The iowait column can also help us identify if there s any resource contention from the IO perspective.
A value above 40-50% would indicate I/O resource issues and require further investigation into the process that is causing this high I/O – or it could indicate a case of inefficient disk sub system or file layout in the database.
We can also view the state of the machine at a particular time of the day by running the sar command which will provide the system utilisation report since 12:00 AM on that particular day.
We can use the sar command to identify the machine state even for a particular day of the month
For example sar -f /var/log/sa/sa03 will report for the 3rd of the month
Checks to be performed at the database level
Identify with the user if the problem is a slow response or a hanging situation.
Establish a connection via SQL*NET using a non SYSDBA account to confirm that the listener is accepting client connections and the hanging is not due to the archive area getting 100% full.
Examine the alert log file for ‘Checkpoint Not Completed’ errors recorded at the time the performance problem is reported – this could indicate an I/O contention issue or inadequately sized redo log files which can also cause an application hang while the checkpoint completes.
Ensure that the mount point on which the Oracle software is not 100% full or the disks holding the controlfiles are also not 100% full.
Check the listener.log file if it exists and ensure that it is not > 2GB – on some Operating Systems like LINUX, there is a file size upper limit for the listener.log file after which client connections will not be accepted by the listener.
Check for locked sessions (see script check_lock.sql).
If a PID has been identified as a top CPU consuming process, check the SQL being executed by that particular PID (see script check_pid_sql.sql)
If the user provides a particular SID where a possible performance issue exists, check the SQL being executed by that SID ( see script check_sid_sql.sql)
If the user provides a particular Oracle username where a possible performance issue exists, check the SQL being executed by that Oracle user ( see script check_username_sql.sql)
Very Important – check the major wait events in the database (see script wait_events.sql)
Check the SID along with the events that each SID is waiting on (see script wait_events_sid.sql) – based on the SID, the SQL being executed by the waiting sessions can be obtained as well ( see script check_sid_sql.sql)
Refer document Resolving common Oracle Wait Events using the Wait Interface for details on important wait events along with their possible causes and resolution.
Check for any sessions continuously waiting for on a particular latch (see script check_latch.sql) and also refer document “Resolving common Oracle Wait Events using the Wait Interface.doc”.
What has changed?
Is there a measurable baseline regarding the “problem” query – when did it last perform well?
Has the database been upgraded recently?
Has any modifications been done to the database in terms of init.ora parameters?
Have any new indexes been added to the table or has the table structure changed?
Has the platform or database version changed?
Is this a period of unusual business activity? – like a monthly data load or one-off batch job
check_pid_sql.sql SET PAGESIZE 500 set long 500000 set head off select s.username su, substr(sa.sql_text,1,540) txt from v$process p, v$session s, v$sqlarea sa where p.addr=s.paddr and s.username is not null and s.sql_address=sa.address(+) and s.sql_hash_value=sa.hash_value(+) and spid=; check_sid_sql.sql SET PAGESIZE 500 PROMPT============================================================= PROMPT Current SQL statement this session executes PROMPT============================================================= col sql_text for a70 hea "Current SQL" select q.sql_text from v$session s , v$sql q WHERE s.sql_address = q.address and s.sql_hash_value + DECODE (SIGN(s.sql_hash_value), -1, POWER( 2, 32), 0) = q.hash_value AND s.sid= ; check_username_sql.sql set long 500000 SET PAGESIZE 500 select s.username su, substr(sa.sql_text,1,540) txt from v$process p, v$session s, v$sqlarea sa where p.addr=s.paddr and s.username is not null and s.sql_address=sa.address(+) and s.sql_hash_value=sa.hash_value(+) and s.username=upper('&username'); check_lock.sql set linesize 500 SET PAGESIZE 500 col waiting_session format 99999 heading 'Waiting|Session' col holding_session format 99999 heading 'Holding|Session' col mode_held format a20 heading 'Mode|Held' col mode_requested format a20 heading 'Mode|Requested' col lock_type format a20 heading 'Lock|Type' prompt blocked objects from V$LOCK and SYS.OBJ$ set lines 132 col BLOCKED_OBJ format a35 trunc select /*+ ORDERED */ l.sid , l.lmode , TRUNC(l.ctime/60) min_blocked , u.name||'.'||o.NAME blocked_obj from (select * from v$lock where type='TM' and sid in (select sid from v$lock where block!=0)) l , sys.obj$ o , sys.user$ u where o.obj# = l.ID1 and o.OWNER# = u.user# ; prompt blocked sessions from V$LOCK select /*+ ORDERED */ blocker.sid blocker_sid , blocked.sid blocked_sid , TRUNC(blocked.ctime/60) min_blocked , blocked.request from (select * from v$lock where block != 0 and type = 'TX') blocker , v$lock blocked where blocked.type='TX' and blocked.block = 0 and blocked.id1 = blocker.id1 ; prompt blockers session details from V$SESSION set lines 132 col username format a10 trunc col osuser format a12 trunc col machine format a15 trunc col process format a15 trunc col action format a50 trunc SELECT sid , serial# , username , osuser , machine FROM v$session WHERE sid IN (select sid from v$lock where block != 0 and type = 'TX') ; wait_events.sql SELECT count(*), event FROM v$session_wait WHERE wait_time = 0 AND event NOT IN ('smon timer','pmon timer','rdbms ipc message', 'SQL*Net message from client') GROUP BY event ORDER BY 1 DESC ; wait_events_sid.sql col username format a12 col sid format 9999 col state format a15 col event format a45 col wait_time format 99999999 set pagesize 800 set linesize 800 select s.sid, s.username, se.event from v$session s, v$session_wait se where s.sid=se.sid and se.event not like 'SQL*Net%' and se.event not like '%rdbms%' and s.username is not null order by 3; check_latch.sql select count(*), name latchname from v$session_wait, v$latchname where event='latch free' and state='WAITING' and p2=latch# group by name order by 1 desc;