CHAPTER 22: Database Troubleshooting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 22

Database Troubleshooting

Database troubleshooting is a vague and general term that is applied to a wide variety of topics. It can mean anything from investigating database connectivity issues to detailed performance tuning. In this chapter, I’ll cover the following troubleshooting activities:

Assessing database availability issues quickly
Identifying system performance issues with OS utilities
Querying data dictionary views to display resource-intensive SQL statements
Using Oracle performance tools to identify resource-consuming SQL statements
Identifying and resolving locking issues
Troubleshooting open-cursor issues
Investigating issues with the undo and temporary tablespaces

The prior list doesn’t encompass all the types of database troubleshooting and performance issues that you’ll encounter. Rather, it’s a sampling of the database problems that you’re likely to encounter and demonstrates useful techniques for resolving problems.

Quickly Triaging

When I get a call reporting some vague performance issues with a database, I perform a few quick standard checks to establish whether or not there are really problems. Probably half the time, it turns out to be something other than the database. Regardless, when somebody (developer, user, boss, and so on) reports an issue, the DBA must respond and determine if there is an issue or if the problem is with a nondatabase component of the system.

Tip Keep in mind that you should automate jobs that perform tasks such as verifying the database availability (see Chapter 21 for examples of automating DBA tasks). Automated jobs help you proactively handle issues so that they don’t turn into database downtime.

Checking Database Availability

The first few checks that I perform don’t require logging in to the database server. Rather, they can be performed remotely via SQL*Plus and OS commands. In fact, I perform all the initial checks remotely over the network; this establishes whether all the system components are working.

One quick check to determine whether the remote server is available, the database is up, the network is working, and the listener is accepting incoming connections is to connect via an SQL*Plus client to the remote database over the network. I usually have a standard database account and password that I create in all databases for use in such scenarios. Here is an example of connecting over the network to a remote database as the barts user with a password of l1sa; the network connect information is embedded directly into the connect string (where dwdb1 is the server, 1521 is the port, and dwrep1 is the database service name):

$ sqlplus barts/l1sa@'dwdb1:1521/dwrep1'

If a connection can be made, then the remote server is available, and the database and listener are up and working. At this point, I contact whomever reported a problem and see if the connectivity issue has to do with the application or with something other than the database.

If the prior SQL*Plus command doesn’t work, try to establish whether the remote server is available. In this example, the p ing command is issued to the remote server, named dwdb1:

$ ping dwdb1

If ping works, you should see output such as this:

64 bytes from dwdb1 (192.168.254.215): icmp_seq=1 ttl=64 time=0.044 ms

If ping doesn’t work, there is probably an issue with either the network or the remote server. If the remote server isn’t available, I usually try to contact a system administrator or network administrator.

If p ing does work, I then check to see if the remote server is reachable via the port that the listener is listening on. I use the t elnet command to accomplish this:

$ telnet IP <port>

In this example a network connection is attempted to the server’s IP address on the 1521 port:

$ telnet 192.168.254.215 1521

If the IP address is reachable on the specified port, you should see “Connected to . . .” in the output, like so:

Trying 192.168.254.215...
Connected to dwdb1.
Escape character is '^]'.

If the t elnet command doesn’t work, I contact the SA or the network administrator.

If the t elnet command does work, and there is network connectivity to the server on the specified port, then use the tnsping command to test network connectivity to the remote server and database, using Oracle Net. This example attempts to reach the DWREP1 remote service:

$ tnsping DWREP1

If successful, the output should contain the OK string, like so:

Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = DWDB1)
(PORT = 1521)) (CONNECT_DATA = (SERVICE_NAME = DWREP1)))
OK (20 msec)

If t nsping works, it means that the remote listener is up and working. It doesn’t necessarily mean that the database is up, so you may need to log in to the database server to investigate further. If tnsping doesn’t work, then the listener or the database is down or hung.

To further investigate issues, I log in directly to the server to perform additional checks, such as a mount point’s filling up.

ORACLE INSTANT CLIENT

I work with SAs and developers who sometimes need to test remote connectivity to a database but who don’t have access to an Oracle installation with the SQL*Plus executable. In these situations, I recommend that they download and use Oracle Instant Client. It has a very small footprint and takes just a few minutes to install. Here are the steps:

Download the Instant Client from the Technology Network area of the Oracle web site (http://otn.oracle.com).
Create a directory for storing the files.
Unzip the files to that directory.
Set the LD_LIBRARY_PATH and PATH variables to include the directory to which the files were unzipped.
Connect to a remote database, using the easy connect syntax:
$ sqlplus user/pass@'host:port/database_service_name'

This process allows you to access SQL*Plus without having to perform a large and cumbersome Oracle install. Instant Client is available for most hardware platforms (Windows, Mac, Linux, and Unix).

Investigating Disk Fullness

To further diagnose issues (such as running low on disk space), you need to log in directly to the remote server. Typically, you will need to log in as the owner of the Oracle software (usually the oracle OS account). When first logging in to a box, one issue that will cause a database to hang or have problems is a full mount point. The df command with the human readable -h switch assists with verifying disk fullness:

$ df -h

Any mount point that is full needs to be investigated. If the mount point that contains ORACLE_HOME becomes full, then you’ll receive errors such as this when connecting to the database:

Linux Error: 28: No space left on device

To fix issues with a full mount point, first identify files that can be either moved or removed. Generally, I start by looking for old trace files; often, there are old files that can be safely removed.

Locating the Alert Log and Trace Files

In Oracle Database 11g and higher the default alert log directory path has this structure:

ORACLE_BASE/diag/rdbms/LOWER(<db_unique_name>)/<instance_name>/trace

Note You can override the default directory path for the alert log by setting the DIAGNOSTIC_DEST initialization parameter.

Usually, the db_unique_name is the same as the instance_name. In RAC and Data Guard environments, however, the db_unique_name is often different from the instance_name. You can verify the directory path with this query:

SQL> select value from v$diag_info where name = 'Diag Trace';

The name of the alert log follows this format:

alert_<ORACLE_SID>.log

You can also locate the alert log from the OS (whether or not the database is started) via these OS commands:

$ cd $ORACLE_BASE
$ find . -name alert_<ORACLE_SID>.log

In the prior find command, you’ll need to replace the <ORACLE_SID> value with the name of your database.

Tip For Oracle Database 10g the standard location for the alert log is defined as the <ORACLE_BASE>/admin/<SID>/bdump directory. The easiest way to determine its location is via the following SQL command: show parameter background_dump_dest

As shown in Chapter 3, it’s advisable to set up an OS function that helps you navigate to the location of the alert log. Here is such a function (you’ll have to modify this to match your environment):

function bdump {
  if [ "$ORACLE_SID" = "O12C" ]; then
    cd /orahome/app/oracle/diag/rdbms/o12c/O12C/trace
  elif [ "$ORACLE_SID" = "O11R2" ]; then
    cd /orahome/app/oracle/diag/rdbms/o11r2/O11R2/trace
  fi
}

You can now type bdump and will be placed in the working directory that contains the database alert log. Once you have found the correct file, inspect the most recent entries for errors, and then look for trace files in the same directory:

$ ls -altr *.tr*

If any of these trace files are more than several days old, consider moving or removing them.

Removing Files

Needless to say, be very careful when removing files. When trying to resolve issues, the last thing you want to do is make things worse. Accidentally removing one critical file can be catastrophic. For any files you identify as candidates for deletion, consider moving (instead of deleting) them. If you have a mount point that has free space, move the files there, and leave them for a couple of days before removing them.

Tip Consider using the Automatic Diagnostic Repository Command Interpreter (ADRCI) utility to purge old trace files. For more details, see the Oracle Database Utilities Guide, which is available for download from the Technology Network area of the Oracle website (http://otn.oracle.com).

If you have identified files that can be removed, first create a list of the files that will be removed before you actually delete them. Minimally, I do this before removing any file:

$ ls -altr <file_name>

After viewing the results returned by the ls command, remove the file(s). This example uses the Linux/Unix rm command to permanently delete the file:

$ rm <file_name>

You can also remove files based on the age of the file. For example, say you determine that any trace files more than 2 days old can be safely deleted. Typically, the f ind command is used in conjunction with the rm command to accomplish this task. Before removing files, first display the results of the find command:

$ find . -type f -mtime +2 -name "*.tr*"

If you are satisfied with the list of files, then add the rm command to remove them:

$ find . -type f -mtime +2 -name "*.tr*" | xargs rm

In the prior line of code, the results of the find command are piped to the x args command, which executes the rm command for every file found by the find command. This is an efficient method for deleting files based on age. However, be very sure you know which files will be deleted.

Another file that sometimes consumes large amounts of space is the listener.log file. Because this file is actively written to by the listener process, you can’t simply remove it. If you need to preserve the contents of this file, first copy it to a backup location (that contains free disk space), and then truncate the file. In this example the listener.log file is copied to ora01/backups, and then the file is truncated, as follows:

$ cp listener.log /u01/backups

Next, use the cat command to replace the contents of the listener.log with the /dev/null file (which contains zero bytes):

$ cat /dev/null > listener.log

Inspecting the Alert Log

When dealing with database issues, the alert.log file should be one of the first files you check for relevant error messages. You can use either OS tools or the ADRCI utility to view the alert.log file and corresponding trace files.

Viewing the Alert Log via OS Tools

After navigating to the directory that contains the alert.log, you can see the most current messages by viewing the end (furthest down) of the file (in other words, the most current messages are written to the end of the file). To view the last 50 lines, use the tail command:

$ tail -50 alert_<SID>.log

You can continuously view the most current entries by using the f switch:

$ tail -f alert_<SID>.log

You can also directly open the alert.log with an OS editor (such as vi):

$ vi alert_<SID>.log

Sometimes, it’s handy to define a function that will allow you to open the alert.log, regardless of your current working directory. The next few lines of code define a function that locates and opens the alert.log with the v iew command:

#-----------------------------------------------------------#
# view alert log
  function valert {
  if [ "$ORACLE_SID" = "O12C" ]; then
    view /orahome/app/oracle/diag/rdbms/o12c/O12C/trace/alert_O12C.log
  elif [ "$ORACLE_SID" = "O11R2" ]; then
    view /orahome/app/oracle/diag/rdbms/o11r2/O11R2/trace/alert_O11R2.log
  fi
  } # valert
#-----------------------------------------------------------#

Usually, the prior lines of code are placed in a startup file so that the function is automatically defined when you log in to a server. Once the function is defined, you can view the a lert.log by typing this command:

$ valert

When inspecting the end of the a lert.log, look for errors that indicate these types of issues:

Archiver process hung, owing to inadequate disk space
File system out of space
Tablespace out of space
Running out of memory in the buffer cache or shared pool
Media error indicating that a data file is missing or damaged
Error indicating an issue with writing an archive redo log; for example,
```
ORA-19502: write error on file "/ora01/fra/o12c/archivelog/...
```

For a serious error message listed in the alert.log file, there is almost always a corresponding trace file. For example, here is the accompanying message for the prior error message:

Errors in file /orahome/app/oracle/diag/rdbms/o12c/O12C/trace/O12C_ora_5665.trc

Inspecting the trace file will often (but not always) provide additional insight into the issue.

Viewing the Alert Log Using the ADRCI Utility

With Oracle Database 11g or higher, you can use the ADRCI utility to view the contents of the alert.log file. Run the following command from the OS to start the ADRCI utility:

$ adrci

You should be presented with a prompt:

adrci>

Use the S HOW ALERT command to view the alert.log file:

adrci> show alert

If there are multiple Oracle homes on the server, then you will be prompted to choose which alert.log you want to view. The S HOW ALERT command will open the alert.log with the utility that has been set as the default editor for your OS. On Linux/Unix systems the default editor is derived from the OS EDITOR variable (which is usually set to a utility such as vi).

Tip When presented with the alert.log, if you are unfamiliar with vi, and want to exit, first press the Escape key, then press and hold down the Shift key while pressing the colon (:) key. Next, type q!. That should exit you out of the vi editor and back to the ADRCI prompt.

You can override the default editor within ADRCI, using the S ET EDITOR command. This example sets the default editor to emacs:

adrci> set editor emacs

You can view the last N number of lines in the a lert.log with the T AIL option. The following command shows the last 50 lines of the alert.log:

adrci> show alert -tail 50

If you have multiple Oracle homes, you may see a message such as this:

DIA-48449: Tail alert can only apply to single ADR home

The ADRCI utility doesn’t assume that you want to work with one Oracle home over another on a server. To specifically set the Oracle home for the ADRCI utility, first use the S HOW HOMES command to display all available Oracle homes:

adrci> show homes

Here is some sample output for this server:

diag/rdbms/o12c/O12C
diag/rdbms/o12cp/O12CP

Now, use the S ET HOMEPATH command. This sets the HOMEPATH to diag/rdbms/e64208/E64208:

adrci> set homepath  diag/rdbms/o12c/O12C

To continuously display the end of the file, use this command:

adrci> show alert -tail -f

Press Ctrl+C to break off continuously viewing the alert.log file. To display lines from the alert.log that contain specific strings, use the M ESSAGE_TEXT LIKE command. This example shows messages that contain the ORA-27037 string:

adrci> show alert -p "MESSAGE_TEXT LIKE '%ORA-27037%'"

You will be presented with a file that contains all the lines in the alert.log that match the specified string.

Tip See the Oracle Database Utilities Guide for full details on how to use the ADRCI utility.

Identifying Bottlenecks via OS Utilities

In the Oracle world there is sometimes a tendency to assume that you have a dedicated machine for one Oracle database. Furthermore, this database is the latest version of Oracle, fully patched, and monitored by a sophisticated graphical tool. This database environment is completely automated and kept trouble free through the use of visual tools that quickly pinpoint problems and efficiently isolate and resolve issues. If you live in this ideal world, then you probably don’t need any of the material in this chapter.

Let me paint a slightly different picture. I have an environment in which one machine has a dozen databases running on it. There’s a MySQL database, a PostgreSQL database, and a mix of Oracle version 10g, 11g, and 12c databases. Furthermore, many of these old databases are on nonterminal releases of Oracle and are therefore not supported by Oracle Support. There are no plans to upgrade any of these unsupported databases because the business can’t take the risk of potentially breaking the applications that depend on these databases.

So, what does one do in this type of environment when somebody reports that a database application is performing poorly? In this scenario it’s often something in a different database that is causing other applications on the box to behave poorly. It may not be an Oracle process or an Oracle database that is causing problems.

In this situation it’s almost always more effective to start investigating issues by using an OS tool. The OS tools are database agnostic. OS performance utilities help pinpoint where the most resources are consumed, regardless of database vendor or version.

In Linux/Unix environments there are several tools available for monitoring resource usage. Table 22-1 summarizes the most commonly used OS utilities for diagnosing performance issues. Being familiar with how these OS commands work and how to interpret the output will allow you to better diagnose server performance issues, especially when it’s a non-Oracle or even a nondatabase process that is tanking performance for every other application on the box.

Table 22-1. Performance and Monitoring Utilities

Tool	Purpose
vmstat	Monitors processes, CPU, memory, and disk I/O bottlenecks
top	Identifies sessions consuming the most resources
watch	Periodically runs another command
ps	Identifies highest CPU- and memory-consuming sessions; used to identify Oracle sessions consuming the most system resources
mpstat	Reports CPU statistics
sar	Displays CPU, memory, disk I/O, and network usage, both current and historical
free	Displays free and used memory
df	Reports on free disk space
du	Displays disk usage
iostat	Displays disk I/O statistics
netstat	Reports on network statistics

When diagnosing performance issues, it’s useful to determine where the OS is constrained. For instance, try to identify whether the issue is related to CPU, memory, or I/O, or a combination of these.

Identifying System Bottlenecks

Whenever there are application performance issues or availability problems, seemingly (from the DBA’s perspective), the first question asked is, “What’s wrong with the database?” Regardless of the source of the problem, the onus is often on the DBA to establish whether the database is behaving well. I usually approach this issue by determining which system-wide resources are being consumed. There are two Linux/Unix OS tools that are particularly useful for displaying system-wide resource usage:

vmstat
top

The vmstat (virtual memory statistics) tool is intended to help you quickly identify bottlenecks on your server. The top utility provides a dynamic, real-time view of system resource usage. These two utilities are discussed in the next two sections.

Using vmstat

The vmstat utility displays real-time performance information about processes, memory, paging, disk I/O, and CPU usage. This example shows using vmstat to display the default output, with no options specified:

$ vmstat
procs  -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache    si   so    bi    bo   in    cs us sy id wa
14  0  52340  25272   3068 1662704    0    0    63    76    9    31 15  1 84  0

Here are some general heuristics you can use when interpreting the output of vmstat:

If the wa (time waiting for I/O) column is high, this is usually an indication that the storage subsystem is overloaded.
If b (processes sleeping) is consistently greater than 0, then you may not have enough CPU processing power.
If so (memory swapped out to disk) and si (memory swapped in from disk) are consistently greater than 0, you may have a memory bottleneck.

By default only one line of server statistics is displayed when running vmstat (without supplying any options). This one line of output gives average statistics calculated from the last time the system was rebooted. This is fine for a quick snapshot. However, if you want to gather metrics over a period of time, use vmstat with this syntax:

$ vmstat <interval in seconds> <number of intervals>

While in this mode, vmstat reports statistics, sampling from one interval to the next. For example, if you wanted to report system statistics every 2 seconds, for ten intervals, you’d issue this command:

$ vmstat 2 10

You can also send the vmstat output to a file. This is useful for analyzing historical performance over a period of time. This example samples statistics every 5 seconds, for a total of 60 reports, and records the output in a file:

$ vmstat 5 60 > vmout.perf

Additionally, the vmstat utility can be used with the watch tool. The watch command is used to execute another program on a periodic basis. In this example, watch runs the vmstat command every 5 seconds and highlights on the screen any differences between each snapshot:

$ watch –n 5 –d vmstat
Every 5.0s: vmstat                                    Thu Aug  9 13:27:57 2007
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache  si   so    bi    bo   in    cs us sy id wa
 0  0    144  15900  64620 1655100   0    0     1     7   16     4  0  0 99  0

When running vmstat in watch -d (differences) mode, you’ll visually see changes on your screen, from snapshot to snapshot. To exit watch, press Ctrl+C.

Note that the default unit of measure for the memory columns of vmstat is kilobytes. If you want to view the memory statistics in megabytes, then use the S m (statistics in megabytes) option:

$ vmstat –S m

Tip Use the man vmstat or vmstat --help command for further documentation on this utility.

Using top

Another tool for identifying resource-intensive processes is the top command. Use this utility to quickly identify which processes are the highest consumers of resources on the server. By default, top will repetitively refresh (every 3 seconds) information regarding the most CPU-intensive processes. The simplest way to run top is as follows:

$ top

Here is a fragment of the output:

top - 13:34:32 up 19 min,  2 users,  load average: 0.05, 0.16, 0.24
Tasks: 176 total,   1 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.0%us,  0.7%sy,  0.0%ni, 91.6%id,  5.7%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    787028k total,   748744k used,    38284k free,     1836k buffers
Swap:  1605624k total,    31896k used,  1573728k free,   377668k cached
  
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
4683 root      20   0  279m  14m 7268 S  2.0  1.9   0:01.61 gnome-terminal
3826 root      20   0  218m 9944 4200 S  1.7  1.3   0:02.80 Xorg
3592 oracle    -2   0  601m  18m  15m S  0.7  2.4   0:09.36 ora_vktm_o12c

The process identifiers (PIDs) of the top-consuming sessions are listed in the first column (PID). You can use this information to see if a PID maps to a database process (see the next section, “Mapping an OS Process to an SQL Statement,” for details on mapping a PID to a database process).

While top is running, you can interactively change its output. For example, if you type the redirection character (>), this will move the column that top is sorting one position to the right. Table 22-2 lists some key features that you can use to alter the top display to the desired format.

Table 22-2. Commands to Interactively Change the top Output

Command	Function
Spacebar	Immediately refreshes the output
< or >	Moves the sort column one position to the left or to the right. By default, top sorts on the CPU column.
D	Changes the refresh time
R	Reverses the sort order
Z	Toggles the color output
H	Displays the Help menu
F or O	Chooses a sort column

Type q, or press Ctrl+C, to exit top. Table 22-3 describes several of the columns displayed in the default output of top.

Table 22-3. Column Descriptions of the top Output

Column	Description
PID	Unique process identifier
USER	OS username running the process
PR	Priority of the process
NI	Nice value or process. Negative value means high priority; positive value means low priority.
VIRT	Total virtual memory used by process
RES	Nonswapped physical memory used by process
SHR	Shared memory used by process
S	Process status
CPU	Processes percentage of CPU consumption since last screen refresh
MEM	Percentage of physical memory the process is consuming
TIME	Total CPU time used by process
TIME+	Total CPU time, showing hundredths of seconds, used by process
COMMAND	Command line used to start a process

You can also run top using the b (batch mode) option and send the output to a file for later analysis:

$ top –b > tophat.out

While running in batch mode, the top command will run until you kill it (with Ctrl+C) or until it reaches a specified number of iterations. You could run the previous top command in batch mode, with a combination of nohup and & to keep it running, regardless of whether you were logged in to the system. The danger there is that you might forget about it and eventually create a very large output file (and an angry SA).

If you have a particular process that you’re interesting in monitoring, use the p option to monitor a PID or the U option to monitor a username. You can also specify a delay and the number of iterations by using the d and -n options. The following example monitors the oracle user with a delay of 5 seconds, for 25 iterations:

$ top –u oracle –d 5 –n 25

Tip Use the man top or top --help command to list all the options available in your OS version.

Mapping an Operating System Process to an SQL Statement

When identifying OS processes, it’s useful to view which processes are consuming the greatest amount of CPU. If the resource hog is a database process, it’s also useful to map the OS process to a database job or query. To determine the ID of the processes consuming the most CPU resources, use a command such as ps, like so:

$ ps -e -o pcpu,pid,user,tty,args | sort -n -k 1 -r | head

Here is a snippet of the output:

14.6 24875 oracle   ?        oracleo12c (DESCRIPTION=(LOCAL=YES)(ADDRESS=...
 0.8 21613 oracle   ?        ora_vktm_o12c
 0.1 21679 oracle   ?        ora_mmon_o12c

From the output, you can see that the OS session 24875 is the top consumer of CPU resources. The output also indicates that the process is associated with the o12c database. With that information in hand, log in to the appropriate database, and use the following SQL statement to determine what type of program is associated with the OS process 24875:

select
  'USERNAME : ' || s.username|| chr(10) ||
  'OSUSER   : ' || s.osuser  || chr(10) ||
  'PROGRAM  : ' || s.program || chr(10) ||
  'SPID     : ' || p.spid    || chr(10) ||
  'SID      : ' || s.sid     || chr(10) ||
  'SERIAL#  : ' || s.serial# || chr(10) ||
  'MACHINE  : ' || s.machine || chr(10) ||
  'TERMINAL : ' || s.terminal
from v$session s,
     v$process p
where s.paddr = p.addr
and   p.spid  = &PID_FROM_OS;

When you run the command, SQL*Plus will prompt you for the value to use in place of &PID_FROM_OS. In this example, you’ll enter 24875. Here is the relevant output:

USERNAME : MV_MAINT
OSUSER   : oracle
PROGRAM  : sqlplus@speed2 (TNS V1-V3)
SPID     : 24875
SID      : 111
SERIAL#  : 899
MACHINE  : speed2
TERMINAL : pts/4

In this output the PROGRAM value indicates that an SQL*Plus session is the program consuming the inordinate amount of resources on the server. Next, run the following query to display the SQL statement associated with the OS PID (in this example, the server process identifier [SPID] is 24875):

select
  'USERNAME : ' || s.username || chr(10) ||
  'OSUSER   : ' || s.osuser   || chr(10) ||
  'PROGRAM  : ' || s.program  || chr(10) ||
  'SPID     : ' || p.spid     || chr(10) ||
  'SID      : ' || s.sid      || chr(10) ||
  'SERIAL#  : ' || s.serial#  || chr(10) ||
  'MACHINE  : ' || s.machine  || chr(10) ||
  'TERMINAL : ' || s.terminal || chr(10) ||
  'SQL TEXT : ' || sa.sql_text
from v$process p,
v$session s,
v$sqlarea sa
where p.addr = s.paddr
and s.username is not null
and s.sql_address = sa.address(+)
and s.sql_hash_value = sa.hash_value(+)
and p.spid= &PID_FROM_OS;

The results show the resource-consuming SQL as part of the output, in the S QL TEXT column. Here is a snippet of the output:

USERNAME : MV_MAINT
OSUSER   : oracle
PROGRAM  : sqlplus@speed2 (TNS V1-V3)
SPID     : 24875
SID      : 111
SERIAL#  : 899
MACHINE  : speed2
TERMINAL : pts/4
SQL TEXT : select a.table_name from dba_tables a,dba_indexes b,dba_objects c...

When you run multiple databases on one server and are experiencing server performance issues, it can sometimes be difficult to pinpoint which database and associated process are causing the problems. In these situations, you have to use an OS tool to identify the top resource-consuming sessions on the system.

In a Linux/Unix environment, you can use utilities such as ps, top, or vmstat to identify top-consuming OS processes. The p s utility is handy because it lets you identify processes consuming the most CPU or memory. The previous ps command identified the top-consuming CPU processes. Here, it’s used to identify the top Oracle memory–using processes:

$ ps -e -o pmem,pid,user,tty,args | grep -i oracle | sort -n -k 1 -r | head

Once you have identified a top-consuming process associated with a database, you can query the data dictionary views, based on the SPID, to identify what the database process is executing.

OS WATCHER

Oracle provides a collection of scripts that gather and store metrics for CPU, memory, disk, and network usage. On Linux/Unix systems, the OS Watcher tool suite automates the gathering of statistics, using tools such as top, vmstat, iostat, mpstat, netstat, and traceroute. This utility also has an optional graphical component for visually displaying performance metrics.

You can obtain OS Watcher from Oracle’s MOS web site. For the Linux/Unix version, see MOS note 301137.1 or the document titled “OS Watcher User Guide.” For details on the Windows version of OS Watcher, see MOS note 433472.1.

Finding Resource-Intensive SQL Statements

One of the best ways to isolate a poorly performing query is to have a user or developer complain about a specific SQL statement. In this situation there is no detective work involved. You can directly pinpoint the SQL query that is in need of tuning.

However, you don’t often have the luxury of a human’s letting you know specifically where to look when investigating performance issues. There are a number of methods for determining which SQL statements are consuming the most resources in a database:

Real-time execution statistics (new, starting with Oracle Database 11g)
Near real-time statistics
Oracle performance reports

These techniques are described in the next several sections.

Monitoring Real-Time SQL Execution Statistics

If you’re using Oracle Database 11g or higher, you can use the following query to select from the V$SQL_MONITOR to monitor the near real-time resource consumption of SQL queries:

select * from (
select a.sid session_id, a.sql_id
,a.status
,a.cpu_time/1000000 cpu_sec
,a.buffer_gets, a.disk_reads
,b.sql_text sql_text
from v$sql_monitor a
    ,v$sql b
where a.sql_id = b.sql_id
order by a.cpu_time desc)
where rownum <=20;

The output for this query doesn’t fit easily onto a page. Here is a subset of the output:

SESSION_ID SQL_ID        STATUS      CPU_SEC BUFFER_GETS  DISK_READS SQL_TEXT
---------- ------------- --------- ---------- ----------- ---------- ---------------
       139 d07nngmx93rq7 DONE          331.88        5708       3490 select count(*)
       130 9dtu8zn9yy4uc EXECUTING      11.55        5710        248 select task_name

In the query an inline view is used to first retrieve all records and organize them by CPU_TIME, in descending order. The outer query then limits the result set to the top 20 rows, using the R OWNUM pseudocolumn. You can modify the previous query to order the results by the statistic of your choice or to display only the queries that are currently executing. For example, the next SQL statement monitors currently executing queries, ordered by the number of disk reads:

select * from (
select a.sid session_id, a.sql_id, a.status
,a.cpu_time/1000000 cpu_sec
,a.buffer_gets, a.disk_reads
,substr(b.sql_text,1,35) sql_text
from v$sql_monitor a
    ,v$sql b
where a.sql_id = b.sql_id
and   a.status='EXECUTING'
order by a.disk_reads desc)
where rownum <=20;

The statistics in V$SQL_MONITOR are updated every second, so you can view resource consumption as it changes. These statistics are gathered by default if an SQL statement runs in parallel or consumes more than 5 seconds of CPU or I/O time.

The V$SQL_MONITOR view includes a subset of statistics contained in the V$SQL, V$SQLAREA, and V$SQLSTATS views. The V$SQL_MONITOR view displays real-time statistics for each execution of a resource-intensive SQL statement, whereas V$SQL, V$SQLAREA, and V$SQLSTATS contain cumulative sets of statistics resulting from several executions of an SQL statement.

Once the SQL statement execution ends, the runtime statistics are not immediately flushed from V $SQL_MONITOR. Depending on activity in your database, the statistics can be available for some time. If you have a very active database, however, the statistics can potentially be flushed soon after the query finishes.

Tip You can uniquely identify the execution of an SQL statement in V $SQL_MONITOR from a combination of the following columns: SQL_ID, SQL_EXEC_START, SQL_EXEC_ID.

You can also query views such as V$SQLSTATS to determine which SQL statements are consuming an inordinate amount of resources. For example, use the following query to identify the ten most resource-intensive queries, based on CPU time:

select * from(
select s.sid, s.username, s.sql_id
,sa.elapsed_time/1000000, sa.cpu_time/1000000
,sa.buffer_gets, sa.sql_text
from v$sqlarea sa
    ,v$session s
where s.sql_hash_value = sa.hash_value
and   s.sql_address    = sa.address
and   s.username is not null
order by sa.cpu_time desc)
where rownum <= 10;

In the prior query an inline view is used to first retrieve all records and sort the output by CPU_TIME, in descending order. The outer query then limits the result set to the top ten rows, using the R OWNUM pseudocolumn. The query can be easily modified to sort by a column other than CPU_TIME. For instance, if you want to report resource usage by BUFFER_GETS, simply substitute BUFFER_GETS for CPU_TIME in the ORDER BY clause. The CPU_TIME column is calculated in microseconds; to convert it to seconds, it is divided by 1,000,000.

Tip Keep in mind that V$SQLAREA contains statistics that are cumulative for the duration for a given session. So, if a session runs an identical query several times, the statistics for that connection will be the total for all the runs of a query. In contrast, V$SQL_MONITOR shows statistics that have accumulated for the current run of a given SQL statement. Therefore, each time a query runs, new statistics are reported for that query in V$SQL_MONITOR.

Running Oracle Diagnostic Utilities

Oracle provides several utilities for diagnosing database performance issues:

Automatic workload repository (AWR)
Automatic database diagnostic monitor (ADDM)
Active session history (ASH)
Statspack

AWR, ADDM, and ASH were introduced many years ago, in Oracle Database 10g. These tools provide advanced reporting capabilities that allow you to troubleshoot and resolve performance issues. These new utilities require an extra license from Oracle. The older Statspack utility is free and requires no license.

All these tools rely heavily on the underlying V$ dynamic performance views. Oracle maintains a vast collection of these views, which track and accumulate metrics of database performance. For example, if you run the following query, you’ll notice that for Oracle Database 12c, there are approximately 700 V$ views:

SQL> select count(*) from dictionary where table_name like 'V$%';
  
  COUNT(*)
----------
       714

The Oracle performance utilities rely on periodic snapshots gathered from these internal performance views. Two of the most useful views, with regard to performance statistics, are the V $SYSSTAT and V $SESSTAT views. The V$SYSSTAT view offers more than 400 types of database statistics. This V$SYSSTAT view contains information about the entire database, whereas the V$SESSTAT view contains statistics on individual sessions. A few of the values in the V$SYSSTAT and V$SESSTAT views represent the current usage of the resource. These values are

opened cursors current
logins current
session cursor cache current
work area memory allocated

The rest of the values are cumulative. The values in V $SYSSTAT are cumulative for the entire database, from the time the instance was started. The values in V $SESSTAT are cumulative per session, from the time the session was started. Some of the more important performance-related cumulative values are

CPU used
consistent gets
physical reads
physical writes

For the cumulative statistics the way to measure periodic usage is to note the value of a statistic at a starting point, then note the value again at a later point in time and capture the delta. This is the approach used by the Oracle performance utilities, such as AWR and Statspack. Periodically, Oracle will take snapshots of the dynamic wait interface views and store them in a repository.

The following sections detail how to access AWR, ADDM, ASH, and Statspack via the SQL command line.

Tip You can access AWR, ADDM, and ASH from the Enterprise Manager. If you have access to the Enterprise Manager, you will find the interface fairly intuitive and visually helpful.

Using AWR

An AWR report is good for viewing the entire system’s performance and identifying the top resource-consuming SQL queries. Run the following script to generate an AWR report:

SQL> @?/rdbms/admin/awrrpt

From the AWR output, you can identify top resource-consuming statements by examining the “SQL Ordered by Elapsed Time” or “SQL Ordered by CPU Time” section of the report. Here is some sample output:

SQL ordered by CPU Time                  DB/Inst: O12C/o12c  Snaps: 1668-1669
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> %Total - CPU Time      as a percentage of Total DB CPU
-> %CPU   - CPU Time      as a percentage of Elapsed Time
-> %IO    - User I/O Time as a percentage of Elapsed Time
-> Captured SQL account for   3.0E+03% of Total CPU Time (s):               0
-> Captured PL/SQL account for  550.9% of Total CPU Time (s):               0
  
    CPU                   CPU per           Elapsed
  Time (s)  Executions    Exec (s) %Total   Time (s)   %CPU    %IO    SQL Id
---------- ------------ ---------- ------ ---------- ------ ------ -------------
       3.2            1       3.24  930.5       10.5   30.9   74.9 93jktd5vtxb98

As of Oracle Database 10g, Oracle will automatically take a snapshot of your database once an hour and populate the underlying AWR tables that store the statistics. By default, 7 days of statistics are retained.

You can also generate an AWR report for a specific SQL statement by running the awrsqrpt.sql report. When you run the following script, you will be prompted for the SQL_ID of the query of interest:

SQL> @?/rdbms/admin/awrsqrpt.sql

Using ADDM

The ADDM report provides useful information on which SQL statements are candidates for tuning. Use the following SQL script to generate an ADDM report:

SQL> @?/rdbms/admin/addmrpt

Look for the section of the report labeled “SQL Statements Consuming Significant Database Time.” Here is some sample output:

FINDING 2: 29% impact (65043 seconds)
-------------------------------------
SQL statements consuming significant database time were found.
  
   RECOMMENDATION 1: SQL Tuning, 6.7% benefit (14843 seconds)
      ACTION: Investigate the SQL statement with SQL_ID "46cc3t7ym5sx0" for

The ADDM report analyzes data in the AWR tables to identify potential bottlenecks and high resource-consuming SQL queries.

Using ASH

The ASH report allows you to focus on short-lived SQL statements that have been recently run and that may have only executed for a briefly. Run the following script to generate an ASH report:

SQL> @?/rdbms/admin/ashrpt

Search the output for the section labeled “Top SQL.” Here is some sample output:

Top SQL with Top Events           DB/Inst: O12C/o12c  (Jan 30 14:49 to 15:14)
  
                                                        Sampled #
                 SQL ID             Planhash        of Executions     % Activity
----------------------- -------------------- -------------------- --------------
Event                          % Event Top Row Source                    % RwSrc
------------------------------ ------- --------------------------------- -------
          dx5auh1xb98k5           1677482778                    1          46.57
CPU + Wait for CPU               46.57 HASH JOIN                           23.53

The previous output indicates that the query is waiting for CPU resources. In this scenario the problem may be that another query is consuming the CPU resources.

When is the ASH report more useful than the AWR or ADDM report? The AWR and ADDM output shows top-consuming SQL in terms of total database time. If the SQL performance problem is transient and short-lived, it may not appear on the AWR and ADDM reports. In these situations an ASH report is more useful.

Using Statspack

If you don’t have a license to use the AWR, ADDM, and ASH reports, the free Statspack utility can help you identify poorly performing SQL statements. Run the following script as SYS to install Statspack:

SQL> @?/rdbms/admin/spcreate.sql

The prior script creates a PERFSTAT user that owns the Statspack repository. Once crated, then connect as the PERFSTAT user, and run this script to enable the automatic gathering of Statspack statistics:

SQL> @ ?/rdbms/admin/spauto.sql

After some snapshots have been gathered, you can run the following script as the P ERFSTAT user to create a Statspack report:

SQL> @?/rdbms/admin/spreport.sql

Once the report is created, search for the section labeled “SQL Ordered by CPU.” Here is some sample output:

SQL ordered by CPU  DB/Inst: O12C/o12c  Snaps: 30-31
-> Total DB CPU (s):           1,432
-> Captured SQL accounts for  100.5% of Total DB CPU
-> SQL reported below exceeded  1.0% of Total DB CPU
  
    CPU                  CPU per             Elapsd                     Old
  Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
   1430.41            1    1430.41   99.9    1432.49             482  690392559
Module: SQL*Plus
select a.table_name from my_tables

Tip See the ORACLE_HOME/rdbms/admin/spdoc.txt file for Statspack documentation.

Detecting and Resolving Locking Issues

Sometimes, a developer or application user will report that a process that normally takes seconds to run is now taking several minutes and doesn’t appear to be doing anything. In these situations the problem is usually one of the following:

Space-related issue (e.g., the archive redo destination is full and has suspended all transactions).
A process has a lock on a table row and is not committing or rolling back, thus preventing another session from modifying the same row.

In this scenario, I first check the alert.log to see if there are any obvious issues that have occurred recently (such as a tablespace’s not being able to allocate another extent). If there is nothing obvious in the alert.log file, I run an SQL query to look for locking issues. The query listed here is a more sophisticated version of the lock-detecting script introduced in Chapter 3. This query shows information such as the locking session SQL statement and the waiting SQL statement:

set lines 80
col blkg_user form a10
col blkg_machine form a10
col blkg_sid form 99999999
col wait_user form a10
col wait_machine form a10
col wait_sid form 9999999
col obj_own form a10
col obj_name form a10
col blkg_sql form a50
col wait_sql form a50
--
select
 s1.username    blkg_user, s1.machine     blkg_machine
,s1.sid         blkg_sid, s1.serial#     blkg_serialnum
,s1.process     blkg_OS_PID
,substr(b1.sql_text,1,50) blkg_sql
,chr(10)
,s2.username    wait_user, s2.machine     wait_machine
,s2.sid         wait_sid, s2.serial#     wait_serialnum
,s2.process     wait_OS_PID
,substr(w1.sql_text,1,50) wait_sql
,lo.object_id   blkd_obj_id
,do.owner       obj_own, do.object_name obj_name
from v$lock          l1
    ,v$session       s1
    ,v$lock          l2
    ,v$session       s2
    ,v$locked_object lo
    ,v$sqlarea       b1
    ,v$sqlarea       w1
    ,dba_objects     do
where s1.sid = l1.sid
and s2.sid = l2.sid
and l1.id1 = l2.id1
and s1.sid = lo.session_id
and lo.object_id = do.object_id
and l1.block = 1
and s1.prev_sql_addr = b1.address
and s2.sql_address = w1.address
and l2.request > 0;

The output from this query does not fit well on one page. When running this query, you will have to format it so that it on your screen. Here is some sample output, indicating that the SALES table is locked and that another process is waiting for the lock to be released:

BLKG_USER  BLKG_MACHI  BLKG_SID BLKG_SERIALNUM BLKG_OS_PID
---------- ---------- --------- -------------- ------------------------
BLKG_SQL                                           C WAIT_USER  WAIT_MACHI
-------------------------------------------------- - ---------- ----------
WAIT_SID WAIT_SERIALNUM WAIT_OS_PID
-------- -------------- ------------------------
WAIT_SQL                                           BLKD_OBJ_ID OBJ_OWN
-------------------------------------------------- ----------- ----------
OBJ_NAME
----------
MV_MAINT   speed2            32            487 26216
update sales set sales_amt=100 where sales_id=1      MV_MAINT   speed2
     116            319 25851

This situation is typical when applications don’t explicitly issue a COMMIT or ROLLBACK at appropriate times in the code. This leaves a lock on a row and prevents a transaction from continuing until the lock is released. In this scenario, you can try to locate the user that is blocking the transaction and see if the user needs to click a button that says something like “Commit your changes.” If that’s not possible, you can manually kill one of the sessions. Keep in mind that terminating a session may have unforeseen effects (such as rolling back data that a user thought was committed).

If you decide to kill one of the user sessions, you need to identify the SID and serial number of the session you want to terminate. Once identified, use the A LTER SYSTEM KILL SESSION statement to terminate the session:

SQL> alter system kill session '32,487';

Again, be careful when killing sessions. Ensure that you know the impact of killing a session and thereby rolling back any active transactions currently open in that session.

The other way to kill a session is to use an OS command such as kill. In the prior output, you can identify the OS processes from the BLKG_OS_PID and WAIT_OS_PID columns. Before you terminate a process from the OS, ensure that the process isn’t critical. For this example, to terminate the blocking OS process, first check the blocking PID:

$ ps -ef | grep 26216

Here is some sample output:

oracle   26222 26216  0 16:49 ?        00:00:00 oracleo12c

Next, use the kill command, as shown:

$ kill -9  26216

The kill command will unceremoniously terminate a process. Any open transactions associated with the process will be rolled back by the Oracle process monitor.

Resolving Open-Cursor Issues

The OPEN_CURSORS initialization parameter determines the maximum number of cursors a session can have open. This setting is per session. The default value of 50 is usually too low for any application. When an application exceeds the number of open cursors allowed, the following error is thrown:

ORA-01000: maximum open cursors exceeded

Usually, the prior error is encountered when

O PEN_CURSORS initialization parameter is set too low
developers write code that doesn’t close cursors properly

To investigate this issue, first determine the current setting of the parameter:

SQL> show parameter open_cursors;

If the value is less than 300, consider setting it higher. I typically set this value to 1,000 for busy OLTP systems. You can dynamically modify the value while your database is open, as shown:

SQL> alter system set open_cursors=1000;

If you’re using an spfile, consider making the change both in memory and in the spfile, at the same time:

SQL> alter system set open_cursors=1000 scope=both;

After setting O PEN_CURSORS to a higher value, if the application continues to exceed the maximum value, you probably have an issue with code that is not properly closing cursors.

If you work in an environment that has thousands of connections to the database, you may want to view only the top cursor-consuming sessions. The following query uses an inline view and the pseudocolumn R OWNUM to display the top 20 values:

select * from (
select a.value, c.username, c.machine, c.sid, c.serial#
from v$sesstat  a
    ,v$statname b
    ,v$session  c
where a.statistic# = b.statistic#
and   c.sid        = a.sid
and   b.name       = 'opened cursors current'
and   a.value     != 0
and   c.username IS NOT NULL
order by 1 desc,2)
where rownum < 21;

If a single session has more than 1,000 open cursors, then the code is probably written such that the cursors aren’t closing. When the limit is reached, somebody should inspect the application code to determine if a cursor is not being closed.

Tip I recommend that you query V$SESSION instead of V$OPEN_CURSOR to determine the number of open cursors. V$SESSION provides a more accurate count of the cursors currently open.

Troubleshooting Undo Tablespace Issues

Problems with the undo tablespace are usually of the following nature:

ORA-01555: snapshot too old
ORA-30036: unable to extend segment by ... in undo tablespace 'UNDOTBS1'

The prior errors can be caused by many different issues, such as incorrect sizing of the undo tablespace or poorly written SQL or PL/SQL code.

Determining if Undo Is Correctly Sized

Suppose you have a long-running SQL statement that is throwing an ORA-01555: snapshot too old error, and you want to determine if adding space to the undo tablespace might help alleviate the issue. Run this next query to identify potential issues with your undo tablespace. The query checks for issues that have occurred within the last day:

select to_char(begin_time,'MM-DD-YYYY HH24:MI') begin_time
,ssolderrcnt    ORA_01555_cnt, nospaceerrcnt  no_space_cnt
,txncount       max_num_txns, maxquerylen    max_query_len
,expiredblks    blck_in_expired
from v$undostat
where begin_time > sysdate - 1
order by begin_time;

Here is some sample output. Part of the output has been omitted to fit this on the page:

BEGIN_TIME       ORA_01555_CNT NO_SPACE_CNT MAX_NUM_TXNS BLCK_IN_EXPIRED
---------------- ------------- ------------ ------------ ---------------
01-31-2013 08:21             0            0           51               0
01-31-2013 08:31             0            0            0               0
01-31-2013 12:11             0            0          629             256

The ORA_01555_CNT column indicates the number of times your database has encountered the ORA-01555: snapshot too old error. If this column reports a nonzero value, you need to do one or more of the following tasks:

Ensure that code does not contain COMMIT statements within cursor loops.
Tune the SQL statement throwing the error so that it runs faster.
Ensure that you have good statistics (so that your SQL runs efficiently).
Increase the UNDO_RETENTION initialization parameter.

The NO_SPACE_CNT column displays the number of times space was requested in the undo tablespace. In this example there were no such requests. If the NO_SPACE_CNT is reporting a nonzero value, however, you may need to add more space to your undo tablespace.

A maximum of 4 days’ worth of information is stored in the V$UNDOSTAT view. The statistics are gathered every 10 minutes, for a maximum of 576 rows in the table. If you’ve stopped and started your database within the last 4 days, this view will only contain information from the time you last started your database.

Another way to get advice on the undo tablespace sizing is to use the Oracle Undo Advisor, which you can invoke by querying the PL/SQL D BMS_UNDO_ADV package from a SELECT statement. The following query displays the current undo size and the recommended size for an undo retention setting of 900 seconds:

select
  sum(bytes)/1024/1024                  cur_mb_size
 ,dbms_undo_adv.required_undo_size(900) req_mb_size
from dba_data_files
where tablespace_name =
  (select
    value
   from v$parameter
   where name = 'undo tablespace'),

Here is some sample output:

CUR_MB_SIZE REQ_MB_SIZE
----------- -----------
      36864       20897

The output shows that the undo tablespace currently has 36.8GB allocated to it. In the prior query, you used 900 seconds as the amount of time to retain information in the undo tablespace. To retain undo information for 900 seconds, the Oracle Undo Advisor estimates that the undo tablespace should be 20.8GB. In this example the undo tablespace is sized adequately. If it were not sized adequately, you would have to either add space to an existing data file or add a data file to the undo tablespace.

Here is a slightly more complex example of using the Oracle Undo Advisor to find the required size of the undo tablespace. This example uses PL/SQL to display information about potential issues and recommendations for fixing the problem:

SET SERVEROUT ON SIZE 1000000
DECLARE
 pro    VARCHAR2(200);
 rec    VARCHAR2(200);
 rtn    VARCHAR2(200);
 ret    NUMBER;
 utb    NUMBER;
 retval NUMBER;
BEGIN
  DBMS_OUTPUT.PUT_LINE(DBMS_UNDO_ADV.UNDO_ADVISOR(1));
  DBMS_OUTPUT.PUT_LINE('Required Undo Size (megabytes): ' || DBMS_UNDO_ADV.REQUIRED_UNDO_SIZE (900));
  retval := DBMS_UNDO_ADV.UNDO_HEALTH(pro, rec, rtn, ret, utb);
  DBMS_OUTPUT.PUT_LINE('Problem:   ' || pro);
  DBMS_OUTPUT.PUT_LINE('Advice:    ' || rec);
  DBMS_OUTPUT.PUT_LINE('Rational:  ' || rtn);
  DBMS_OUTPUT.PUT_LINE('Retention: ' || TO_CHAR(ret));
  DBMS_OUTPUT.PUT_LINE('UTBSize:   ' || TO_CHAR(utb));
END;
/

If no issues are found, a 0 will be returned for the retention size. Here is some sample output:

Finding 1:The undo tablespace is OK.
Required Undo Size (megabytes): 64
Problem:   No problem found
Advice:
Rational:
Retention: 0
UTBSize:   0

Viewing SQL That Is Consuming Undo Space

Sometimes, a piece of code does not commit properly, which results in large amounts of space being allocated in the undo tablespace and never being released. Sooner or later, you’ll get the ORA-30036 error, indicating that the tablespace can’t extend. Usually, the first time a space-related error is thrown, I simply increase the size of one of the data files associated with the undo tablespace.

However, if an SQL statement continues to run and fills up the newly added space, then the issue is probably with a poorly written application. For example, a developer may not have appropriate commit statements in the code.

In these situations it’s helpful to identify which users are consuming space in the undo tablespace. Run this query to report on basic information regarding space allocated on a per-user basis:

select s.sid, s.serial#, s.osuser, s.logon_time
,s.status, s.machine
,t.used_ublk, t.used_ublk*16384/1024/1024 undo_usage_mb
from v$session     s
    ,v$transaction t
where t.addr = s.taddr;

If you want to view the SQL statement associated with a user consuming undo space, then join to V$SQL, as shown:

select s.sid, s.serial#, s.osuser, s.logon_time, s.status
,s.machine, t.used_ublk
,t.used_ublk*16384/1024/1024 undo_usage_mb
,q.sql_text
from v$session     s
    ,v$transaction t
    ,v$sql         q
where t.addr = s.taddr
and s.sql_id = q.sql_id;

If you need more information, such as the name and status of the rollback segment, run a query that joins to the V$ROLLNAME and V$ROLLSTAT views, like so:

select s.sid, s.serial#, s.username, s.program
,r.name undo_name, rs.status
,rs.rssize/1024/1024 redo_size_mb
,rs.extents
from v$session     s
    ,v$transaction t
    ,v$rollname    r
    ,v$rollstat    rs
where s.taddr = t.addr
and t.xidusn  = r.usn
and r.usn     = rs.usn;

The prior queries allow you to pinpoint which users are responsible for space allocated within the undo tablespace. This can be especially useful when there is code that is not committing at appropriate times and that is excessively consuming undo space.

Handling Temporary Tablespace Issues

Issues with temporary tablespaces are somewhat easy to spot. For example, when the temporary tablespace runs out of space, the following error will be thrown:

ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

When you see this error, you need to determine if there’s not enough space in the temporary tablespace or if a rare runaway SQL query has temporarily consumed an inordinate amount of temporary space. Both of these issues are discussed in the following sections.

Determining if Temporary Tablespace Is Sized Correctly

The temporary tablespace is used as a sorting area on disk when a process has consumed the available memory and needs more space. Operations that require a sorting area include

index creation
SQL sorting operations
temporary tables and indexes
temporary LOBs
temporary B-trees

There is no exact formula for determining if your temporary tablespace is sized correctly. It depends on the number and types of queries, index build operations, and parallel operations and on size of your memory sort space (PGA). You’ll have to monitor your temporary tablespace while there is a load on your database to establish its usage patterns. If you are using Oracle Database 11g or higher, run the following query to show both the allocated and free space within the temporary tablespace:

select tablespace_name
,tablespace_size/1024/1024 mb_size
,allocated_space/1024/1024 mb_alloc
,free_space/1024/1024      mb_free
from dba_temp_free_space;

Here is some sample output:

TABLESPACE_NAME    MB_SIZE   MB_ALLOC    MB_FREE
--------------- ---------- ---------- ----------
TEMP                   200        200        170

If the FREE_SPACE (MB_FREE) value drops to near 0, there are SQL operations in your database consuming most of the available space. The FREE_SPACE (MB_FREE) column is the total free space available, including space currently allocated and available for reuse.

If you are using an Oracle Database 10g database, run this query to view the space being used in your temporary tablespace:

select tablespace_name
,sum(bytes_used)/1024/1024 mb_used
from v$temp_extent_pool
group by tablespace_name;

Here is some sample output:

TABLESPACE_NAME    MB_USED
--------------- ----------
TEMP                   120

If the amount used is getting near your current allocated amount, you may need to allocate more space to the temporary tablespace data files. Run the following query to view the temporary data file names and allocated sizes:

SQL> select name, bytes/1024/1024 mb_alloc from v$tempfile;

Here is some typical output:

NAME                                       MB_ALLOC
---------------------------------------- ----------
/u01/dbfile/o12c/temp01.dbf                     400
/u01/dbfile/o12c/temp02.dbf                     100
/u01/dbfile/o12c/temp03.dbf                     100

When first creating a database, if I have no idea as to “correct” size of the temporary tablespace, I’ll usually size this tablespace at approximately 2GB. If I’m building a data warehouse–type database, I might size the temporary tablespace at approximately 20GB. You’ll have to monitor your temporary tablespace with the appropriate SQL and adjust the size as necessary.

Viewing SQL That Is Consuming Temporary Space

When Oracle throws the ORA-01652: unable to extend temp error, this may be an indication that your temporary tablespace is too small. However, Oracle may throw that error if it runs out of space because of a one-time event, such as a large index build. You’ll have to decide whether a one-time index build or a query that consumes large amounts of sort space in the temporary tablespace warrants adding space.

To view the space a session is using in the temporary tablespace, run this query:

SELECT s.sid, s.serial#, s.username
,p.spid, s.module, p.program
,SUM(su.blocks) * tbsp.block_size/1024/1024 mb_used
,su.tablespace
FROM v$sort_usage    su
    ,v$session       s
    ,dba_tablespaces tbsp
    ,v$process       p
WHERE su.session_addr = s.saddr
AND   su.tablespace   = tbsp.tablespace_name
AND   s.paddr         = p.addr
GROUP BY
 s.sid, s.serial#, s.username, s.osuser, p.spid, s.module,
 p.program, tbsp.block_size, su.tablespace
ORDER BY s.sid;

If you determine that you need to add space, you can either resize an existing data file or add a new one. To resize a temporary tablespace data file, use the ALTER DATABASE TEMPFILE...RESIZE statement. The following command resizes a temporary data file to 12GB:

SQL> alter database tempfile '/u01/dbfile/o12c/temp02.dbf' resize 12g;

You can add a data file to a temporary tablespace, as follows:

SQL> alter tablespace temp add tempfile '/u02/dbfile/o12c/temp04.dbf' size 2g;

Summary

A senior DBA must be adept at efficiently determining the source of database unavailability and performance problems. The ability to identify and resolve problems defines a professional-level DBA. Anyone can google a topic (there’s nothing worse than being on a trouble call with a manager who is googling and recommending random solutions). Finding the appropriate solution and confidently applying it in a production database environment is how you add tremendous value.

Diagnosing issues sometimes requires some system and network administrator skills. Additionally, an effective DBA must know how to leverage the Oracle data dictionary to identify problems. As part of your strategy, you should also proactively monitor for the common sources of database unavailability. Ideally, you’ll be aware of the problem before anybody else and will proactively solve the issue.

No book can cover every troubleshooting activity. This chapter includes some of the most common techniques for identifying problems and dealing with them. Often, basic OS utilities will help you ascertain the source of a hung database. In almost every scenario the alert.log and corresponding trace files should be inspected. Finding the root cause of a problem is often the hardest task. Use a consistent and methodical approach, and you’ll be much more successful in diagnosing and resolving issues.

I have tried to convey techniques and methods that will help you survive even the most chaotic database environments. To summarize these thoughts, a DBA’s manifesto, of sorts:

Automate and monitor through scripts and schedulers. Be the first to know when something is broken.
Strive for repeatability and efficiency in processes and scripts. Be consistent.
Keep it simple. If a module is more than a page long, it’s too long. Don’t implement a script or feature that another DBA won’t be able to understand or maintain. Sometimes, the simple solution is the correct solution.
Remain calm regardless of the disaster. Be respectful.
Don’t be afraid to seek or take advice. Welcome feedback and criticism. Listen to others. Entertain the thought that you might be wrong.
Take advantage of graphical tools, but always know how to implement a feature manually.
Expect failure, predict failure, prepare for failure. You don’t know what will go wrong, but you do know something will go wrong. Be happy that you prepared for failure. The best lessons are painful.
Test and document your operating procedures. This will help you stay calm(er) and focused when in stressful database-down situations.
Don’t write code to implement a feature that the database vendor has already provided a solution for (replication, disaster recovery, backup and recovery, and so on).
Become proficient with SQL, procedural SQL, and OS commands. These skills separate the weak from the strong. The best DBAs posses both SA and developer expertise.
Continually investigate new features and technology. Learning is a never ending process. Question everything, reevaluate, and look for a better way. Verify your solutions with repeatable, peer-reviewed tests. Document and freely share your knowledge.
Do what it takes to get the job done. You compete with the world now. Work harder and smarter.

The job of a DBA can be quite rewarding. It can also be very painful and stressful. Hopefully, the techniques documented in this book will get you from being mostly stressed to an occasionally happy state.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 22: Database Troubleshooting

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 22: Database Troubleshooting