In this chapter, we will look at using the tools available for monitoring the performance of an Oracle 11g RAC cluster at the Oracle RAC level. In the next chapter, we'll look at performance monitoring under Linux. At an Oracle Database level, we concentrate on the tools and features provided under the framework of the 11g Manageability Infrastructure. These tools and features range from Oracle Enterprise Manager Database Control to the 11g Automatic Diagnostic Repository (ADR).
Within an Oracle 11g database, the Manageability Infrastructure is a collection of database features that enable you to realize self-tuning and self-management functionality directly within the Oracle database itself. The primary aspect of the Manageability Infrastructure is the automated gathering and processing of base statistics related to database operations. A key enabling feature of the Manageability Infrastructure is the background process called the Manageability Monitor (MMON
). This process performs most of the manageability-related tasks on the database, a central part of which is to capture and store database statistics. The MMON
process captures base statistics every 60 seconds. It also accesses the SGA directly, so it works efficiently and with minimal overhead. These statistics are compiled and initially stored in memory to be accessed by dynamic performance views. Subsequently, they are flushed to disk in the form of snapshots. Both the in-memory and on-disk collections of statistics are referred to as the Automatic Workload Repository (AWR). A number of performance views with the prefix DBA_HIST_
can be used to access the data; DBA_HIST_SNAPSHOT
, for example, shows information regarding all of the snapshots that have been taken.
The MMON
process is also assisted by the Manageability Monitor Light (MMNL
) process. It is used primarily for performing tasks related to the Active Session History (ASH). The ASH contains session statistics sampled at one-second intervals to record what the sessions are waiting for. The information is stored in memory, where it can be accessed in the view V$ACTIVE_SESSON_HISTORY
. The in-memory view is implemented as a rolling buffer, and older information is overwritten as newer information is added. Therefore, new information is also written to disk by the MMNL
process every 30 minutes. By default, seven days of ASH information is retained on disk, and it can be seen in the view DBA_HIST_ACTIVE_SESS_HISTORY
.
The MMON
process also computes and stores metric values based on the statistics themselves, and these can be seen directly in the views V$SYSMETRIC
and V$SERVICEMETRIC
. For example, V$SYSMETRIC
contains information such as the calculated buffer cache hit ratio and user transactions per second.
Given the features enabled by the Manageability Infrastructure, our focus for performance monitoring in an Oracle 11g RAC Database environment is on the tools that take advantage of this infrastructure. Therefore, we will cover performance monitoring using the standard Oracle graphical interface environment called Oracle Enterprise Manager Database Control. We will also review captured performance data with AWR reports and ASH, and then look at using the Automatic Database Diagnostic Monitor (ADDM) and analyzing the performance of individual statements using AWR SQL reports. Next, we will cover performance monitoring using SQL*Plus. Finally, we will look at some of the performance monitoring features and fault diagnosis available with the Automatic Diagnostic Repository (ADR).
Oracle Enterprise Manager 11g Database Control is supplied and installed with the Oracle Database software, so it is available for all Oracle 11g RAC environments. Database Control is installed and configured wholly within the cluster to manage that cluster only, which distinguishes it from the separate Oracle Enterprise Manager 11g Grid Control that is installed and configured on a standalone management server. Database Control can be used to manage multiple Oracle and non-Oracle software environments.
If you are creating a new Oracle 11g database, and you wish to use Database Control, then we recommend that you install and configure it using DBCA. Provided that no Grid Control Management Agent is previously installed, DBCA will provide you with the option to automatically install, configure, and start all of the components you require, including a Management Repository schema owned by the user SYSMAN
within the SYSAUX
tablespace of the target database, a local Management Service (including an HTTP server), and Management Agents on all nodes.
EM Database Control uses the EM Agents installed on the nodes and associated Management Services to communicate with the database, instances, and other processes. You can use the emctl
command to check whether these are currently running using the emctl
command, assuming you have set your ORACLE_UNQNAME
to your database name beforehand. The following example shows that EM is currently not running:
[oracle@london1 ˜]$ emctl status dbconsole Oracle Enterprise Manager 11g Database Control Release 11.2.0.1.0 Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved. https://london1.example.com:1158/em/console/aboutApplication Oracle Enterprise Manager 11g is not running. ------------------------------------------------------------------ Logs are generated in directory /u01/app/oracle/product/11.2.0/dbhome_1/london1_PROD/sysman/log
If the Management Agent is not running, then you can start it, as follows:
[oracle@london1 ˜]$ emctl start dbconsole Oracle Enterprise Manager 11g Database Control Release 11.2.0.1.0 Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved. https://london1.example.com:1158/em/console/aboutApplication Starting Oracle Enterprise Manager 11g Database Control .......... started. ------------------------------------------------------------------ Logs are generated in directory /u01/app/oracle/product/11.2.0/dbhome_1/london1_PROD/sysman/log
The Management Agent can be stopped again using the stop
argument.
By default, DBCA configures Database Control on each node in the cluster. However, after installation, the Database Control Console is only started on the node where you ran DBCA. On the remaining nodes, only the Management Agent component is started, and these nodes upload their data to the Database Control Console node.
If you are upgrading an existing Oracle database, and have created a database without the use of DBCA or wish to reconfigure your Database Control, you can do so using the emca
utility.
A wide range of configuration options can be performed with emca
. The most straightforward method to install Database Control is to call the command emca
, specifying -config dbcontrol db
, while also ensuring that you use the -cluster
argument.
In common with the way you configure Database Control with DBCA, you will be running the Database Control Console on the node where you ran emca
only. Subsequently, calling emctl start dbconsole
and emctl stop dbconsole
will start and stop the Management Service on this node only. The same commands will start and stop the Management Agent only on the remaining nodes in the cluster.
If the Database Control Console could run on only one node, it would clearly not be a high-availability configuration. Therefore, you can reconfigure the Console to start on a different node (or on more than one if you desire) by using the command emca -reconfig dbcontrol -cluster
. By default, running this command without additional arguments starts the Console on the node that you ran the command from, and it directs the Management Agents on all other nodes in the cluster to upload their data to this Console. You can customize the process by providing the arguments EM_NODE
and EM_SID_LIST
to start the Console on the list of nodes specified and instruct the Management Agents listed to upload data to these nodes. You can view the configuration at any time using emca -displayConfig dbcontrol -cluster
, as in this example:
[oracle@london1 ˜]$ emca -displayConfig dbcontrol -cluster STARTED EMCA at Feb 15, 2010 5:56:27 PM EM Configuration Assistant, Version 11.2.0.0.2 Production Copyright (c) 2003, 2005, Oracle. All rights reserved. Enter the following information: Database unique name: PROD Service name: PROD1 Do you wish to continue? [yes(Y)/no(N)]: Y Feb 15, 2010 5:56:42 PM oracle.sysman.emcp.EMConfig perform INFO: This operation is being logged at /u01/app/oracle/cfgtoollogs/emca/PROD/emca_2010_02_15_17_56_26.log. Feb 15, 2010 5:56:44 PM oracle.sysman.emcp.EMDBPostConfig showClusterDBCAgentMessage INFO: **************** Current Configuration **************** INSTANCE NODE DBCONTROL_UPLOAD_HOST ---------- ---------- --------------------- PROD london1 london1.example.com PROD london2 london1.example.com Enterprise Manager configuration completed successfully FINISHED EMCA at Feb 15, 2010 5:56:44 PM
Finally, if you have also added a new node to the cluster, you may use the command emca -addInst db
to add it to the database configuration. Similarly, you can use the command emca -deleteInst db
to remove it, as long as you remember to run the command to remove the instance from a different node in the cluster.
Upon logging in to Database Control at the URL for the specified node, such as https://london1.example.com:1158/em
, you are presented at the highest level with two tabs to view the database configuration or the cluster configuration, as shown in Figure 12-1. Your browser requires the Adobe Flash plugin to view the EM performance graphs.
The Cluster configuration provides details on the hosts in the cluster and clicking the Performance link drills down to a view of CPU, Memory, and I/O Utilization on all of the nodes in the cluster. It also provides a summary of the status of those nodes. You can use this Performance page to observe whether the cluster workload is evenly distributed across the cluster, as well as whether particular nodes are more heavily utilized than others. Similarly, the Interconnects
link provides information on the Private Interconnect Transfer Rate in MB/sec for the cluster as a whole, and it shows the Total Error rate for the individual nodes. You should use this information in conjunction with the Linux operating system network performance utilities to ensure that the interconnect performance is not having a disproportionate impact on the performance of the cluster.
The Database tab (see Figure 12-2) shows the performance of the cluster from a database, as opposed to the perspective from the host and operating system.
The first two graphs on the Performance tab show the Cluster Host Load Average and the Global Cache Block Access Latency. The third graph shows the number of Average Active Sessions throughout the reporting period. The final graph on the Performance tab shows performance from the perspective of Database Throughput, I/O, Parallel Execution, Services, and Instances.
At the bottom of the Performance tab, there are six Additional Monitoring Links. The page you should use to monitor your inter-instance communication is the Cluster Cache Coherency page, as shown in Figure 12-3.
The Cluster Cache Coherency page includes information on Global Cache Block Access Latency (this information also appears on the Performance tab), Global Cache Block Transfer Rate, and Global Cache Block Transfers and Physical Reads. Selecting the appropriate radio button makes the information available in both tabular numeric and graphical form. The Global Cache Block Access Latency is a useful initial statistic for observing the performance of your cluster. For releases of RAC, Oracle has typically recommended that the latency should be less than ten milliseconds before investigating potential performance issues; however, ideally you should expect these values to be no more than two milliseconds. If the latency is exceeding more than two milliseconds, we recommend investigating the performance in context of the workload, as well as looking at the latency of the entire hardware components of your cluster (see Chapter 4 for information on processing your workload more efficiently to reduce inter-instance traffic).
In addition to the overall interconnect latency, the metrics you will refer to most when reviewing your RAC's performance are observable within the Global Cache Block Transfer Rate view. In this view, you can observe the operation of Cache Fusion across the cluster. The constituent metrics are GC CR Blocks Received
and GC Current Blocks Received
.
The first metric, GC CR Blocks Received
, reflects the implementation of read consistency across the entire cluster. In other words, when a SELECT
statement is in operation on one node in the cluster, the session will request the data block that contained the data when the query began if another session on any node in the cluster has changed those rows since the query started. If the consistent read request cannot be satisfied from the local buffer cache, but it can be satisfied by shipping a data block from another instance in the cluster, then it is recorded in this metric.
The second metric, GC Current Blocks Received
, shows the inter-instance transfer of data blocks that are in a current state. A data block is in a current state when it has been created or modified by an INSERT
, UPDATE
, or DELETE
statement. The most recent data reflects that this action resides in the buffer cache of an instance. If this current block is requested and successfully transferred directly to another instance in the cluster, it is recorded in this metric.
The Global Cache Block Transfers and Physical Reads graph compares the number of block transfers across the interconnect with the number of physical reads. Block transfers across the interconnect are usually more efficient than physical reads, so they are favored by RAC, where possible.
The Global Cache Block Transfers and Physical Reads graph is the Active Session History (ASH) from a Cluster Cache Coherency perspective. Figure 12-4 shows an example of this graph. This data can be viewed in realtime or by dragging the shaded box to observe historical data. in the graph directly below on the same page the workload being processed is shown in relation to the events generated. Clicking a particular Cache Coherency event will drill down to a chart and summary data focused upon that particular event.
In addition to viewing the Active Session History dynamically with EM, you can also generate reports for information related to time periods for which you are interested. The Active Session History (ASH) is a component of the AWR repository. In addition to reports that can be viewed with EM, the following types of reports can be generated directly: AWR reports, ASH reports, and Automatic Database Diagnostic Monitor (ADDM) reports. We will detail how to do so in the following sections; doing so will enable you to generate these reports, even if the EM graphical interface is not available in a particular environment.
In this section, we describe the AWR and show how to generate reports from its contents. AWR snapshots of the current values of various database statistics are generated by the MMON
background process using non-SQL kernel calls. AWR reporting tools allow you to generate a report in text or HTML format using either EM or SQL*Plus to compare the delta between two snapshots. This can help you determine the database workload during the intervening period.
The AWR is created automatically if you have created the RAC database using DBCA
, as detailed in Chapter 10. Alternatively, you can create the AWR manually using the script, $ORACLE_HOME/rdbms/admin/catawr.sql
.
A set of tables is created for the AWR under the SYS schema in the SYSAUX
tablespace, and the AWR is implemented using three distinct sets of tables in the data dictionary. There are five control tables: WRM$_WR_CONTROL
, WRM$_DATABASE_INSTANCE
, WRM$_SNAPSHOT
, WRM$SNAP_ERROR
, and WRM$_BASELINE
. There is also a DBA_HIST_%
view for each of these tables.
There is also a set of tables to hold the snapshots. These tables have the WRH$
prefix, such as WRH$_SEG_STAT
and WRH$_SQL_PLAN
. There are approximately 60 of these tables, and each table has a DBA_HIST_%
view with a similar name, such as DBA_HIST_SEG_STAT
and DBA_HIST_SQL_PLAN
.
Baselines are held in a separate set of tables, each of which has the same name as the snapshot table, but with a _BL
suffix. For example, the snapshot table is called WRH$_LATCH
, while the baseline table is called WRH$_LATCH_BL
. There are no DBA_HIST_%
views for the baseline tables.
At regular intervals, the MMON
background processes trigger a snapshot of the current state of the database. The actual snapshot is performed by a MMON
slave processes named M000
. By default, snapshots are performed every 60 minutes and retained in the workload repository for seven days, after which they are automatically purged.
AWR snapshots are performed if the STATISTICS_LEVEL
parameter is set to the default value of TYPICAL
or to ALL
. To disable AWR snapshots, you must set the STATISTICS_LEVEL
parameter to BASIC
; however, we do not recommended doing so because it disables all other forms of statistics gathering.
You can check the current snapshot interval and retention time with this snippet:
SQL> SELECT snap_interval, retention FROM dba_hist_wr_control;
You can modify these values using the MODIFY_SNAPSHOT_SETTINGS
procedure in DBMS_WORKLOAD_REPOSITORY
. For example, you can use this line to change the snapshot interval to 30 minutes and set the retention period to two weeks:
SQL> EXECUTE dbms_workload_repository.modify_snapshot_settings - > (interval => 30, retention => 20160);
Both the interval and retention parameters should be specified in minutes. If you specify an interval of 0
, Oracle will set the snapshot interval to 1 year; if you specify a retention period of 0
, Oracle will set the retention period to 100 years.
You can take a snapshot manually at any time for the particular instance you are connected to using the CREATE_SNAPSHOT
procedure at the SQL*Plus prompt, as in this example:
SQL> EXECUTE dbms_workload_repository.create_snapshot;
This procedure can accept a single parameter specifying the flush level for the snapshot, which is either TYPICAL
or ALL
. You can also create snapshots from within EM; you can access this functionality by navigating from the Administration page to the Workload Repository page, and then to the Snapshots page.
When you have a stable workload, you can take a baseline, which records the difference between two snapshots. First, identify a suitable pair of snapshot IDs using the DBA_HIST_SNAPSHOT
view:
SQL> SELECT snap_id, instance_number, startup_time, > begin_interval_time,end_interval_time > FROM dba_hist_snapshot;
Both snapshots must belong to the same instance. They must also have the same instance start-up time. For example, assuming that the start and end snapshots for the period of interest are 234
and 236
, you can generate a baseline:
SQL> EXECUTE dbms_workload_repository.create_baseline - > ( - > start_snap_id => 234, - > end_snap_id => 236, - > baseline_name => 'Morning Peak' - > );
You can check the current baselines in the DBA_HIST_BASELINE
view. Unlike automatic snapshots, which will be purged at the end of the retention period, baselines will not be removed automatically. Baselines can be removed manually using the DROP_BASELINE
procedure:
SQL> EXECUTE dbms_workload_repository.drop_baseline - > ( - > baseline_name => 'Morning Peak' - > );
You can extract reports from the AWR tables with the script, $ORACLE_HOME/rdbms/admin/awrrpt.sql
. As shown in the following code example, this script asks whether you want the report output to be created in HTML or text format. It then asks you to specify the number of days of snapshots that you wish to view. Based on your response, it displays a list of the snapshots taken during this period. The script prompts you for the start and end snapshots for the report, as well as for a file name to write the report to:
SQL> @?/rdbms/admin/awrrpt Current Instance ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ DB Id DB Name Inst Num Instance ----------- ------------ -------- ------------ 121667269 PROD 1 PROD1 Specify the Report Type ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Would you like an HTML report, or a plain text report? Enter 'html' for an HTML report, or 'text' for plain text Defaults to 'html' Enter value for report_type: Type Specified: html Instances in this Workload Repository schema ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ DB Id Inst Num DB Name Instance Host ------------ -------- ------------ ------------ ------------ 121667269 2 PROD PROD2 london2.exam
ple.com * 121667269 1 PROD PROD1 london1.exam ple.com Using 121667269 for database Id Using 1 for instance number Specify the number of days of snapshots to choose from ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Entering the number of days (n) will result in the most recent (n) days of snapshots being listed. Pressing <return> without specifying a number lists all completed snapshots. Enter value for num_days: Listing all Completed Snapshots Snap Instance DB Name Snap Id Snap Started Level ------------ ------------ --------- ------------------ ----- PROD1 PROD 110 08 Feb 2010 12:46 1 111 08 Feb 2010 14:00 1 ... 278 15 Feb 2010 10:53 1 279 15 Feb 2010 10:58 1 Specify the Begin and End Snapshot Ids ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Enter value for begin_snap:278 Begin Snapshot Id specified: 278 Enter value for end_snap: 279 End Snapshot Id specified: 279 Specify the Report Name ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ The default report file name is awrrpt_1_278_279.html. To use this name, press <return> to continue, otherwise enter an alternative. Enter value for report_name: Using the report name awrrpt_1_278_279.html ... Report written to awrrpt_1_278_279.html
If no file name is specified, a default name based on the start and end snapshots will be used. Keep in mind that you cannot create valid reports for a period during which an instance shutdown has been performed.
The awrrpt.sql
script will only generate an AWR report for the instance to which you are connected. In a RAC environment, if you wish to generate an AWR report for another node in the cluster, you should use the awrrpti.sql
script instead. This script prompts you for the same information as the standard script. However, it also gives you the option of specifying both the Database ID and the instance. Thus it lets you generate a report for any instance in the cluster without needing to run the command from the individual nodes:
SQL> @?/rdbms/admin/awrrpti ... Instances in this Workload Repository schema ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ DB Id Inst Num DB Name Instance Host ------------ -------- ------------ ------------ ------------ 121667269 2 PROD PROD2 london2.exam dple.com * 121667269 1 PROD PROD1 london1.exam ple.com Enter value for dbid:121667269 Using for database Id Enter value for inst_num: 2 ...
You have now generated an AWR report to review the statistics captured across a period of time for an instance. If you're reviewing the performance, you can and should generate reports for all instances within the cluster. The statistics contained within a report are for one instance only, and this instance and the host environment will be recorded directly under the section heading, "WORKLOAD REPOSITORY report for":
WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Startup Time Release RAC ------------ ----------- ------------ -------- --------------- ----------- --- PROD 121667269 PROD1 1 12-Feb-10 13:36 11.2.0.1.0 YES Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- london1.example. Linux x86 64-bit 8 8 2 15.66
Please take some time to examine the "Load Profile" section of this report. In particular, you should make sure that it covers the expected period and that meaningful activity has taken place on the database during this period.
In the following sections, we concentrate on the differences between an AWR report for a single-instance database and a RAC database. Much of the content of an AWR report is identical for both.
Within an AWR report, there are two main sections containing RAC statistics. The first section appears immediately after the "Top 5 Timed Foreground Events" and includes Global Cache Load Profile, Global Cache Efficiency Percentages, Global Cache and Enqueue Services (Workload Characteristics, Global Cache, and Enqueue Services), Messaging Statistics, and the Cluster interconnect. The second section appears at the end of the report, and it includes Global Messaging Statistics, Global CR Served Stats and Global CURRENT Served Stats, Global Cache Transfer Statistics and Transfer Times, Interconnect Latency, and Throughput and Dynamic Remastering Stats.
The "Top 5 Timed Foreground Events" section will normally show the amount of CPU consumed in addition to the top four wait events. For a given workload, this list of events should be relatively stable; you should investigate any significant variation. You will generally see CPU at the top of the list. However, CPU usage is not necessarily an indication of a healthy system, as the application may be CPU-bound. The most common wait events are db file sequential read
, db file scattered read
, db file parallel read
, and log file sync
.
For RAC databases, watch for wait events related to Global Cache Service (GCS), which are identified by the prefix gc
or the Cluster
wait. In a RAC environment, you should expect to see GCS events, but they should not be consuming the majority of the time on the system. The following example shows the Top 5 Timed Foreground Events from a RAC system that is experiencing a degree of inter-instance activity:
Top 5 Timed Foreground Events ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ---------- ----------- ------ ------ ---------- DB CPU 588 28.4 enq: TX - row lock contention 41,498 293 7 14.2 Applicatio log file sync 126,489 231 2 11.1 Commit gc current block 2-way 409,784 230 1 11.1 Cluster gc cr block 2-way 407,726 220 1 10.6 Cluster
If your global cache events are prominent—and, in particular, if they appear above CPU time
in the report—then this is an indication that you should drill down in the RAC statistics further to identify whether the amount of global cache traffic can be reduced.
Immediately after the "Top 5 Timed Foreground Events" section, the first RAC-specific section summarizes the number of instances for the start and end snapshots, as in this example:
Begin End ----- ----- Number of Instances: 2 2
Differing values indicate that an instance has been started or stopped during the snapshot period. If the number of instances is equal at the end of the period, this does not mean that one or more instances have joined or left the cluster during the snapshot period. Starting and stopping instances causes a higher than normal level of inter-instance messaging; therefore, the report should be treated with caution.
This section of the AWR report presents a summary of the traffic across the interconnect in terms of blocks exchanged by GCS and messages exchanged by both GCS and Global Enqueue Service (GES). Consider this example:
Global Cache Load Profile ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Per Second Per Transaction --------------- --------------- Global Cache blocks received: 3,510.45 5.81 Global Cache blocks served: 3,526.89 5.84 GCS/GES messages received: 7,447.04 12.32 GCS/GES messages sent: 6,774.80 11.21 DBWR Fusion writes: 578.61 0.96 Estd Interconnect traffic (KB) 59,076.44
Pay particular attention to the estimated interconnect traffic per second value at the end of this section. You should reference this value against the bandwidth of your network interconnect to ensure that your configuration has sufficient capacity for your requirements. In this example, utilization is 59MB/sec of a gigabit Ethernet interconnect, which has a theoretical maximum limit of 100MB/sec. Thus, this example shows a reasonably high level of utilization.
This section of the AWR report shows the percentage of blocks accessed from local cache, remote cache, and disk. In an optimum system, the percentage of local cache accesses should approach 100%, while the percentage of remote cache accesses and disk accesses should both approach 0%. The following example shows a level of efficiency in terms of the percentages that you should expect to see:
Global Cache Efficiency Percentages (Target local+remote 100%) ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Buffer access - local cache %: 94.38 Buffer access - remote cache %: 5.52 Buffer access - disk %: 0.10
Bear in mind that it may take longer to read a block from disk than from a remote cache, although advancements in disk technology such as Solid State Disks (SSDs), are likely to reduce the margin of advantage towards the interconnect (see Chapter 4 for more information on SSDs). Therefore, you should concentrate on reducing both the amount of disk I/O and the global cache communication.
This section of the AWR report describes the average times required to perform various GCS and GES tasks:
Global Cache and Enqueue Services - Workload Characteristics ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Avg global enqueue get time (ms): 0.3
Avg global cache cr block receive time (ms): 1.2 Avg global cache current block receive time (ms): 1.0 Avg global cache cr block build time (ms): 0.0 Avg global cache cr block send time (ms): 0.0 Global cache log flushes for cr blocks served %: 4.5 Avg global cache cr block flush time (ms): 1.6 Avg global cache current block pin time (ms): 0.1 Avg global cache current block send time (ms): 0.0 Global cache log flushes for current blocks served %: 3.2 Avg global cache current block flush time (ms): 2.0
Of these statistics, the most significant are the enqueue get time
, which should ideally be below 1 ms; and the global cache cr
and current block receive
times.
The "Global Cache and Enqueue Services" section of the AWR report describes average time to exchange different categories of inter-instance messages, as in this example:
Global Cache and Enqueue Services - Messaging Statistics ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Avg message sent queue time (ms): 0.1 Avg message sent queue time on ksxp (ms): 0.6 Avg message received queue time (ms): 0.0 Avg GCS message process time (ms): 0.0 Avg GES message process time (ms): 0.0 % of direct sent messages: 65.85 % of indirect sent messages: 33.18 % of flow controlled messages: 0.96
The "Cluster Interconnect" section of the report should be used to confirm that the interconnect network is running across the configured private interconnect, as opposed to a public network:
Cluster Interconnect ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Begin End -------------------------------------------------- ----------- Interface IP Address Pub Source IP Pub Src ---------- --------------- --- ------------------------------ --- --- --- eth1 192.168.1.1 N Oracle Cluster Repository
The "Foreground Wait Class" enables you to observe wait events in a grouped manner, as opposed to individual wait events. Thus it provides a simpler diagnosis on where performance issues may lie. In the following example, nearly 40% of all wait events are related to cluster activity; these events are likely to be impacting performance:
Foreground Wait Class DB/Inst: PROD/PROD1 Snaps: 278-279 Avg %Time Total Wait wait Wait Class Waits -outs Time (s) (ms) %DB time -------------------- --------------- ----- ---------------- -------- --------- Cluster 1,068,216 0 814 1 39.3 DB CPU 588 28.4 Application 41,503 0 293 7 14.2 Commit 126,489 0 231 2 11.1 User I/O 17,300 0 73 4 3.5 Configuration 198 0 26 131 1.3 Other 1,050,015 0 20 0 1.0 Concurrency 8,493 2 4 0 0.2 Network 141,279 0 0 0 0.0 System I/O 87 0 0 0 0.0
The "Wait Event Histogram" section enables you to observe all RAC-related wait events. It shows the time consumed by the wait events, which are grouped into classes by the time consumed. These wait events range from less than 1 millisecond to longer than a second. In the following example, you can see that block busy and congested related events consume the most time on an individual event basis:
Total Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s -------------------------- ----- ----- ----- ----- ----- ----- ----- ----- --- ... gc cr block 2-way 407.3 91.3 5.9 2.4 .4 .0 .0 .0 gc cr block busy 27.9K 25.4 50.1 16.8 3.3 2.3 1.8 .3 gc cr block congested 644 66.5 9.0 15.5 8.9 .2 gc cr block lost 2 100.0 gc cr failure 377 91.5 5.6 2.7 .3 gc cr grant 2-way 7309 91.8 5.6 2.2 .3 .0 gc cr grant congested 13 53.8 7.7 15.4 23.1 gc cr multi block request 77.3K 97.4 1.8 .8 .1 .0 .0 .0 gc current block 2-way 410.2 91.1 6.1 2.4 .3 .0 .0 .0 .0 gc current block busy 18.6K 19.3 31.0 16.7 14.4 13.4 3.0 2.1 .0 gc current block congested 669 68.5 7.9 14.5 9.0 .1 gc current block lost 1 100.0 ...
The AWR report contains a series of sections listing SQL statements in the library cache that exceed predefined thresholds. The report does not show the statements per se, but it does show their statement ID numbers. Of particular interest to RAC users are the SQL statements ordered in the "By Cluster Wait Time" section, which details the amount of time a statement was involved in waits for cluster resources. The cluster wait time is reported as a percentage of total elapsed time for the statement, along with the elapsed time, CPU time, and number of executions.
For example, the statement being reported on in the following extract has experienced 384.3 seconds of cluster wait time over 61,113 executions, and the cluster wait time represents 47% of total elapsed time, which is significant enough to warrant further investigation. To do so, you should note the SQL Id used as input into the awrsqlrpt.sql
script, as explained later in this chapter:
SQL ordered by Cluster Wait Time DB/Inst: PROD/PROD1 Snaps: 278-279 Cluster Elapsed Wait Time (s) Executions %Total Time(s) %Clu %CPU %IO SQL Id -------------- ---------- ------ ---------- ------ ------ ------ ------------- 384.3 61,113 47.8 804.5 47.2 37.8 1.7 16dhat4ta7xs9 Module: wish8.5@loadgen1 (TNS V1-V3) begin neword(:no_w_id,:no_max_w_id,:no_d_id,:no_c_id,:no_o_ol_cnt, :no_c_discount,:no_c_last,:no_c_credit,:no_d_tax,:no_w_tax,:no_d_next_o_id, TO_DATE(:timestamp,'YYYYMMDDHH24MISS')); END;
It is possible to observe the RAC-related segment statistics when viewed in combination with the SQL related data. These statistics include: global cache buffer busy, CR blocks received, and current blocks received. In the following example, you can observe that a large degree of the Global Cache Busy related statistics are generated by SQL running against the WAREHOUSE table and the CR and Current Blocks against the STOCK table:
Segments by Global Cache Buffer Busy DB/Inst: PROD/PROD1 Snaps: 278-279 GC Tablespace Subobject Obj. Buffer % of Owner Name Object Name Name Type Busy Capture -------- ---------- -------------------- ---------- ----- ------------ ------- TPCC TPCCTAB WAREHOUSE TABLE 6,199 55.66 TPCC TPCCTAB INORD INDEX 3,440 30.89 TPCC TPCCTAB ORDERS TABLE 630 5.66 TPCC TPCCTAB IORDL INDEX 441 3.96 TPCC TPCCTAB DISTRICT TABLE 251 2.25 Segments by CR Blocks Received DB/Inst: PROD/PROD1 Snaps: 278-279 CR Tablespace Subobject Obj. Blocks Owner Name Object Name Name Type Received %Total -------- ---------- -------------------- ---------- ----- ------------ ------- TPCC TPCCTAB STOCK TABLE 321,799 54.74
TPCC TPCCTAB CUSTOMER TABLE 97,031 16.51 TPCC TPCCTAB INORD INDEX 61,230 10.42 TPCC TPCCTAB IORDL INDEX 38,203 6.50 TPCC TPCCTAB ORDERS TABLE 17,544 2.98 Segments by Current Blocks Received DB/Inst: PROD/PROD1 Snaps: 278-279 Current Tablespace Subobject Obj. Blocks Owner Name Object Name Name Type Received %Total -------- ---------- -------------------- ---------- ----- ------------ ------- TPCC TPCCTAB STOCK TABLE 220,543 45.90 TPCC TPCCTAB IORDL INDEX 57,394 11.94 TPCC TPCCTAB INORD INDEX 42,573 8.86 TPCC TPCCTAB DISTRICT TABLE 38,989 8.11 TPCC TPCCTAB CUSTOMER TABLE 37,746 7.86
For RAC instances, a subsection of the "Dictionary Cache Stats" section reports the number of GES requests, conflicts, and releases for each object type, as in this example:
Dictionary Cache Stats (RAC) DB/Inst: PROD/PROD1 Snaps: 278-279 GES GES GES Cache Requests Conflicts Releases ------------------------- ------------ ------------ ------------ dc_awr_control 2 2 0 dc_objects 10 0 6 dc_segments 10 12 0 dc_sequences 26 3 0
In this report, you should look for an excessive numbers of GES conflicts, especially for sequences.
For RAC instances, a subsection of the "Library Cache Activity" section reports the number of GES lock requests, pin requests, pin releases, invalidation requests, and invalidations. Here's an example of that subsection:
Library Cache Activity (RAC) DB/Inst: PROD/PROD1 Snaps: 278-279 GES Lock GES Pin GES Pin GES Inval GES Invali- Namespace Requests Requests Releases Requests dations --------------- ------------ ------------ ------------ ----------- ----------- BODY 0 69,471 69,471 0 0 CLUSTER 3 3 3 0 0
DBLINK 20 0 0 0 0 EDITION 18 18 18 0 0 INDEX 84 84 84 0 0 QUEUE 0 223 223 0 0 RULESET 0 4 4 0 0 SCHEMA 16 0 0 0 0 SUBSCRIPTION 0 2 2 0 0 TABLE/PROCEDURE 1,089 211,136 211,136 0 0 TRIGGER 0 89 89 0 0
In the event that any serious Global Messaging problems appear in the "Top 5 Timed Foreground Events" or in the "Wait Events" sections, the "Global Messaging Statistics" section can be used to investigate further, as in this example:
Global Messaging Statistics DB/Inst: PROD/PROD1 Snaps: 278-279 Statistic Total per Second per Trans --------------------------------- ---------------- ------------ ------------ acks for commit broadcast(actual) 99,813 328.0 0.5 acks for commit broadcast(logical 191,511 629.2 1.0 broadcast msgs on commit(actual) 154,497 507.6 0.8 broadcast msgs on commit(logical) 159,559 524.3 0.9 broadcast msgs on commit(wasted) 36,032 118.4 0.2 broadcast on commit wait time(ms) 6,075 20.0 0.0 broadcast on commit waits 9,799 32.2 0.1
The "Global CR Served Statistics" section contains the statistics for the number of blocks served by GCS. These statistics give an overview of GCS activity during the reporting period, and they are derived from the V$CR_BLOCK_SERVER
dynamic performance view, as in this example:
Global CR Served Stats DB/Inst: PROD/PROD1 Snaps: 278-279 Statistic Total ------------------------------ ------------------ CR Block Requests 566,707 CURRENT Block Requests 29,833 Data Block Requests 566,710 Undo Block Requests 3,317 TX Block Requests 26,341
The "Global Current Served Statistics" section of the AWR report contains histograms of the GCS operations required to support current read requests, including block pinning, flushing redo to disk, and write operations. The report is derived from the V$CURRENT_BLOCK_SERVER
dynamic performance view. You should check this view if you believe that Cache Fusion is causing a bottleneck on systems with high levels of DML activity, as in this example:
Global CURRENT Served Stats DB/Inst: PROD/PROD1 Snaps: 278-279 -> Pins = CURRENT Block Pin Operations -> Flushes = Redo Flush before CURRENT Block Served Operations -> Writes = CURRENT Block Fusion Write Operations Statistic Total % <1ms % <10ms % <100ms % <1s % <10s ---------- ------------ -------- -------- -------- -------- -------- Pins 477,147 99.65 0.32 0.02 0.01 0.00 Flushes 15,128 51.65 44.53 3.81 0.00 0.00 Writes 176,102 0.23 15.83 26.38 57.18 0.38
The "Global Cache Transfer Statistics" section of the report gives an overview of Global Cache transfer activity for each class of database buffer, as in the following example. This data is derived from the V$INSTANCE_CACHE_TRANSFER
dynamic performance view; it is especially useful in diagnosing whether cache transfer activity is affected by contention or a high level of system load. You can use this data to analyze both circumstances. For Clusters with more than two nodes, these statistics can also be used to determine the impact of messaging between multiple nodes in the cluster:
Global Cache Transfer Stats DB/Inst: PROD/PROD1 Snaps: 278-279 CR Current ----------------------------- ----------------------------- Inst Block Blocks % % % Blocks % % % No Class Received Immed Busy Congst Received Immed Busy Congst ---- ----------- -------- ------ ------ ------ -------- ------ ------ ------ 2 data block 556,975 95.6 3.2 1.2 478,247 95.8 3.7 .5 2 undo header 27,929 66.7 33.2 .1 1,717 52.6 47.4 .0 2 undo block 2,894 68.6 31.3 .0 0 N/A N/A N/A 2 Others 181 97.8 1.7 .6 653 98.9 .6 .5 ... Global Cache DB/Inst: PROD/PROD1 Snaps: 278-279 -> Blocks Lost, 2-hop and 3-hop Average times in (ms) -> ordered by CR + Current Blocks Received desc CR Avg Time (ms) Current Avg Time (ms) ---------------------------- ---------------------------- Src Block Lost Inst Class Time Immed 2hop 3hop Immed 2hop 3hop ---- -------- ----- ---------- -------- -------- ---------- -------- -------- 2 data blo 1.1 1.1 N/A 0.8 0.8 N/A 2 undo hea 0.6 0.6 N/A 0.7 0.7 N/A 2 undo blo 0.6 0.6 N/A N/A N/A N/A 2 others 0.6 0.6 N/A 0.6 0.6 N/A
The "Interconnect Statistics" section enables you to observe the latency and throughput of your interconnect against the potential of the hardware selected for your cluster (see Chapter 4 for more information on this). You can use these statistics to determine whether the physical interconnect infrastructure is a bottleneck in your environment.
The following example shows the Interconnect Statistics section of the report. It provides ping latency statistics, throughput by client statistics, and device statistics:
Interconnect Ping Latency Stats DB/Inst: PROD/PROD1 Snaps: 278-279 Target 500B Pin Avg Latency Stddev 8K Ping Avg Latency Stddev Instance Count 500B msg 500B msg Count 8K msg 8K msg --------- -------- ----------- ----------- -------- ----------- ----------- 1 24 .14 .03 24 .12 .03 2 24 .49 .56 24 .59 .54 Interconnect Throughput by Client DB/Inst: PROD/PROD1 Snaps: 278-279 -> Throughput of interconnect usage by major consumers -> All throughput numbers are megabytes per second Send Receive Used By Mbytes/sec Mbytes/sec ---------------- ----------- ----------- Global Cache 24.93 24.78 Parallel Query .00 .00 DB Locks 1.88 2.04 DB Streams .00 .00 Other .00 .00
The "Dynamic Remastering Statistics" section gives you information on the internal Oracle workload in the mastering or ownership of the status of the Database objects. You also get information on Oracle workload reconfiguration, which you can use to help reduce the amount of messaging required by your workload, as in this example:
Dynamic Remastering Stats DB/Inst: PROD/PROD1 Snaps: 278-279 -> times are in seconds -> Affinity objects - objects mastered due to affinity at begin/end snap per Begin End Name Total Remaster Op Snap Snap ------------------------- ------------ ------------- -------- -------- remaster ops 1 1.00 remastered objects 2 2.00 replayed locks received 92,841 92,841.00 replayed locks sent 131,129 131,129.00 resources cleaned 0 0.00 remaster time (s) 3.3 3.30 quiesce time (s) 1.6 1.55
freeze time (s) 0.0 0.01 cleanup time (s) 0.4 0.36 replay time (s) 0.6 0.58 fixwrite time (s) 0.2 0.19 sync time (s) 0.6 0.60 affinity objects N/A 7 5
The Active Session History (ASH) is a component of the AWR repository. ASH samples all sessions once per second and records information about those sessions that are currently waiting. This information is used by ADDM to classify any problems that have been identified.
For example, ADDM may be aware that a significant amount of time is being consumed waiting for I/O through waits for db file sequential read
. ASH can identify the specific files and blocks that are causing the waits. This data is used by ADDM to produce more accurate recommendations.
ASH acquires information directly by sampling the session state objects. The default sampling interval is 1,000 milliseconds (1 second). ASH only records information about active sessions. It does not include information about recursive sessions or sessions waiting for idle events.
The information collected by ASH is flushed to disk periodically. By default, only one out of every ten active session samples is flushed to disk. Information flushed from the ASH buffer to disk is written to the workload repository history table, WRH$ACTIVE_SESSION_HISTORY
, which is owned by SYS
and stored in the SYSAUX
tablespace.
Previously, you saw how you can use the EM interface to dynamically view ASH-based information. Additionally, a script is provided to report on the contents of the ASH repository. You can find this script in $ORACLE_HOME/rdbms/admin/ashrpt.sql
. The script should be run in SQL*Plus, as in this example:
SQL> @?/rdbms/admin/ashrpt Current Instance ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ DB Id DB Name Inst Num Instance ----------- ------------ -------- ------------ 121667269 PROD 1 PROD1 Specify the Report Type ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Enter 'html' for an HTML report, or 'text' for plain text Defaults to 'html' Enter value for report_type: Type Specified: html Instances in this Workload Repository schema ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ DB Id Inst Num DB Name Instance Host ------------ -------- ------------ ------------ ------------
121667269 2 PROD PROD2 london2.exam ple.com * 121667269 1 PROD PROD1 london1.exam ple.com Defaults to current database Using database id: 121667269 Enter instance numbers. Enter 'ALL' for all instances in a RAC cluster or explicitly specify list of instances (e.g., 1,2,3). Defaults to current instance. Using instance number(s): 1 ... Summary of All User Input ------------------------- Format : HTML DB Id : 121667269 Inst num : 1 Begin time : 15-Feb-10 11:03:01 End time : 15-Feb-10 11:18:02 Slot width : Default Report targets : 0 Report name : ashrpt_1_0215_1118.html
Much as when generating an AWR, by default ashrpt.sql
will generate a report for the local instance only. To generate a report for an alternate instance, you can use the ashrpti.sql
script. The script will generate output in HTML or text format, and you can specify a time-based interval for the duration of the period covered by the report. You specify this interval in minutes, which is 15 minutes by default. The following example shows the heading section of the generated ASH report:
ASH Report For PROD/PROD1 DB Name DB Id Instance Inst Num Release RAC Host ------------ ----------- ------------ -------- ----------- --- ------------ PROD 121667269 PROD1 1 11.2.0.1.0 YES london1.exam CPUs SGA Size Buffer Cache Shared Pool ASH Buffer Size ---- ---------------- ------------------ ------------------ ------------------ 8 8,155M (100%) 6,656M (81.6%) 1,344M (16.5%) 15.5M (0.2%)
The first section of the ASH report describes the environment in which the report was created, including the database, instance, release, node, and the number of CPUs. It also describes the sizes of the SGA, buffer cache, shared pool, and ASH buffer. The remainder of the report details the information specific to sessions active during the period of time for which the report was generated, and you should look for cluster-related activity within this session information.
The Automatic Database Diagnostic Monitor (ADDM) uses data captured in the AWR to diagnose database performance, identify any problems, and suggest potential solutions. ADDM is built directly into the kernel, which minimizes any performance overhead.
ADDM analyzes database performance holistically. In other words, it considers all activity across the database before making recommendations about specific areas of the workload.
ADDM runs automatically after each AWR snapshot, and the results are saved in the database. If ADDM detects any issues, then alerts are generated that can be inspected in the EM tools.
You can also run an ADDM report manually using the following script:
$ORACLE_HOME/rdbms/admin/addmrpt.sql
This script is similar to those used to generate AWR reports. It prompts for a start and end snapshot, as well as for the name of an output file. If no output file name is specified, then the report will be written to a file called:
addmrpt_<instance>_<start_snapshot>_<stop_snapshot>.txt
For example, the file name with its bracketed information filled in might look like this:
addmrpt_1_278_279.txt
Although the procedure for running ADDM reports is similar to that used with AWR reports, the content of an ADDM report is significantly different.
The DBMS_ADVISOR
and DBMS_ADDM
packages allow you to control ADDM. The simplest way to access the DBMS_ADVISOR
package is from EM, which shows a complete performance overview, including any recommendations, on a single page. The addmprt.sql
script also calls the DBMS_ADVISOR
package, allowing you to access ADDM manually. You can access the DBMS_ADVISOR
API directly using PL/SQL calls. In addition to DBMS_ADVISOR
, the DBMS_ADDM
package enables the generation of reports for all instances of a RAC environment.
ADDM stores information in a set of tables owned by SYS
in the SYSAUX
tablespace. The base tables have a prefix of WRI$%
. These tables are accessible through a number of views with the prefix, DBA_ADVISOR_%
.
ADDM needs at least two snapshots in the AWR before it can perform any analysis. The ADDM report header describes the analysis period, database and instance names, hostname, and database version:
ADDM Report for Task 'TASK_875' -------------------------------
Analysis Period --------------- AWR snapshot range from 278 to 279. Time period starts at 15-FEB-10 10.39.53 AM Time period ends at 15-FEB-10 10.58.19 AM Analysis Target --------------- Database 'PROD' with DB ID 121667269. Database version 11.2.0.1.0. ADDM performed an analysis of instance PROD1, numbered 1 and hosted at london1.example.com. Activity During the Analysis Period ----------------------------------- Total database time was 2068 seconds. The average number of active sessions was 1.87. Summary of Findings ------------------- Description Active Sessions Recommendations Percent of Activity ---------------------------------------- ------------------- -------------- 1 Top SQL Statements 1.1 | 59.02 8 2 Global Cache Messaging .72 | 38.46 1 3 Top Segments by "User I/O" and "Cluster" .71 | 37.74 5 4 Buffer Busy - Hot Objects .29 | 15.73 0 5 Row Lock Waits .26 | 14.17 2 6 Commits and Rollbacks .22 | 11.72 2
The header specifies the measured time period, and the ADDM report is based on this statistic. The report header also summarizes the average database load during the snapshot period and a summary of the findings.
The DBMS_ADDM package provides similar functionality, except for multiple instances in a RAC environment. For example, calling DBMS_ADDM.ANALYZE_DB
with the database name and start and end snapshot ids enables the analysis of all of the instances in the cluster. DBMS_ADDM.ANALYZE_INST
does the analysis on one instance in the cluster, while DBMS_ADDM.ANALYZE_PARTIAL
does the analysis on a selected number of instances in the cluster. For example, the following excerpt shows the generation of a partial report for instances 1, 2, and 3 of a 4-node cluster:
Begin :name := 'Partial ADDM RAC'; dbms_addm.analyze_partial(:name,'1,2,3',1012,1387); End;
set long 1000000 pagesize 0; select dbms_addm.get_report(:name) from dual;
An ADDM report is divided up into findings, which are listed in descending order of their perceived impact on database time. The following example shows a finding from an ADDM report. The finding highlights RAC-based wait events under the "Cluster wait" category. In this case, the recommendation was to investigate an UPDATE
statement responsible for a significant proportion of the cluster wait time:
Finding 2: Global Cache Messaging Impact is 2.62 active sessions, 38.46% of total activity. --------------------------------------------------------- Inter-instance messaging was consuming significant database time on this instance. Recommendation 1: Application Analysis Estimated benefit is .72 active sessions, 38.46% of total activity. ------------------------------------------------------------------- Action Look at the "Top SQL Statements" finding for SQL statements consuming significant time on Cluster waits. For example, the UPDATE statement with SQL_ID "82tfppq8s0dc2" is responsible for 20% of Cluster wait during the analysis period. Symptoms That Led to the Finding: --------------------------------- Wait class "Cluster" was consuming significant database time. Impact is .74 active sessions, 39.33% of total activity.
If your performance monitoring investigations with AWR and ADDM reports reveal particular SQL Statements to investigate, you can view these in detail with this script:
$ORACLE_HOME/rdbms/admin/awrsqrpt.sql
You run the script just as you would to generate an AWR, except that in this case you specify a particular SQL ID. For example, this ID might be identified by the AWR report's "SQL Statements" section or an ADDM report:
SQL> @?/rdbms/admin/awrsqrpt ... Specify the Begin and End Snapshot Ids ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Enter value for begin_snap: 278 Begin Snapshot Id specified: 278
Enter value for end_snap: 279 End Snapshot Id specified: 279 Specify the SQL Id ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜ Enter value for sql_id: 16dhat4ta7x ...
Viewing the SQL Report provides a summary of the AWR Report capture period, the SQL ID, and the Elapsed time. In following example, the SQL ID identified in the AWR report is a simple UPDATE
statement:
WORKLOAD REPOSITORY SQL Report Snapshot Period Summary DB Name DB Id Instance Inst Num Startup Time Release RAC ------------ ----------- ------------ -------- --------------- ----------- --- PROD 121667269 PROD1 1 12-Feb-10 13:36 11.2.0.1.0 YES Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 278 15-Feb-10 10:53:14 55 2.9 End Snap: 279 15-Feb-10 10:58:18 55 2.9 Elapsed: 5.07 (mins) DB Time: 34.47 (mins) SQL Summary DB/Inst: PROD/PROD1 Snaps: 278-279 Elapsed SQL Id Time (ms) ------------- ---------- g5u7xuchhfu62 293,546 Module: wish8.5@loadgen1 (TNS V1-V3) UPDATE WAREHOUSE SET W_YTD = W_YTD + :B2 WHERE W_ID = :B1 ...
The remainder of the report details the SQL Plan statistics, the Execution Plan, and the full SQL Statement, which is useful when longer statements are truncated within the AWR report.
In our examples, we have used the information from the AWR to pinpoint a workload with significant cluster wait class activity. This activity is due to a number of UPDATE
statements being processed by a several sessions simultaneously running the same UPDATE
statement on multiple nodes at the same time.
Since early versions of Oracle 7, the two main sources of information for performance tuning an Oracle database have been system statistics and wait events. Both have been extended with additional statistics and events in subsequent releases. In releases up to and including 11g, the kernel has been much more thoroughly instrumented. This has resulted in a large increase in the number of latches and wait events that are individually reported. Execution time information is also available in the form of time-based metrics.
Before reviewing the SQL*Plus queries, we will briefly review the relationship between the X$
tables, GV$
views, and V$
views. All are compiled into the Oracle executable and exist for both single-instance and RAC databases.
X$
tables present instance or session memory structures as tables. Although they still exist, RAC-specific X$
tables are not populated for single-instance databases; instead, X$
tables are defined in V$FIXED_TABLE
.
GV$
views are also built into the Oracle executable. Therefore, they exist for all databases (including single-instance ones), and they reference the X$
tables.
The standard V$
views are also built into the Oracle executable. In general, V$
views do not directly reference X$
tables; instead, they generally use them indirectly through GV$
views, using queries of the following form:
SELECT <column_list> FROM gv$<view_name> WHERE inst_id = USERENV ('INSTANCE'),
Both the GV$
and V$
views are defined in V$FIXED_VIEW_DEFINITION
.
In a RAC environment, a number of additional synonyms for GV$
and V$
views are created by the script, $ORACLE_HOME/rdbms/admin/catclust.sql
.
GV$
views include the instance number. When they are queried, the query is executed separately on all active instances using parallel execution. The results are merged on the instance initiating the query. GV$
views are not meaningful in a few cases because the views are derived from the control files. Because the control files are shared between all instances, the GV$
views return with duplicate sets of rows from each instance. For these views, we recommend using the local V$
dynamic performance views instead.
System statistics are maintained in each instance. They are reported in a number of dynamic performance views, including V$SYSSTAT
, V$SESSTAT
, and V$MYSTAT
. In Oracle 11g, a subset of service statistics is also reported at a session level in V$SERVICE_STATS
.
Segment statistics were introduced in Oracle 9.2. They provide a powerful mechanism you can use to identify which objects are subject to I/O or contention. Prior to the introduction of segment statistics, this information could only be obtained using an event 10046 level 8 trace or by polling the V$SYSTEM_WAIT
dynamic performance view. The level of granularity is the object, so you may still need to use the 10046 trace if you need to identify a specific block or set of blocks.
Segment statistics are maintained in all databases when the value of the STATISTICS_LEVEL
parameter is TYPICAL
or ALL
. Three statistics report global cache activity at the object level, and you can use these to quickly pinpoint the segments in your database that are experiencing the highest levels of inter-instance activity: gc cr blocks received
, gc current blocks received
, and gc buffer busy
. You should be familiar with the first two statistics from the charts displayed in EM. The additional statistic, gc buffer busy
, shows the segments that are experiencing levels of contention between instances.
The following example shows the global cache activity for the table segments of a schema named TPCC:
SQL> li 1 SELECT 2 table_name AS "Table Name", 3 gc_buffer_busy AS "Buffer Busy", 4 gc_cr_blocks_received AS "CR Blocks Received", 5 gc_current_blocks_received AS "Current Blocks Received" 6 FROM 7 ( 8 SELECT table_name FROM dba_tables 9 WHERE owner = 'TPCC' 10 ) t, 11 ( 12 SELECT object_name,value AS gc_buffer_busy 13 FROM v$segment_statistics 14 WHERE owner = 'TPCC' 15 AND object_type = 'TABLE' 16 AND statistic_name = 'gc buffer busy' 17 ) ss1, 18 ( 19 SELECT object_name,value AS gc_cr_blocks_received 20 FROM v$segment_statistics 21 WHERE owner = 'TPCC' 22 AND object_type = 'TABLE' 23 AND statistic_name = 'gc cr blocks received' 24 ) ss2, 25 ( 26 SELECT object_name,value AS gc_current_blocks_received 27 FROM v$segment_statistics 28 WHERE owner = 'TPCC' 29 AND object_type = 'TABLE' 30 AND statistic_name = 'gc current blocks received' 31 ) ss3 32 WHERE t.table_name = ss1.object_name 33 AND t.table_name = ss2.object_name 34* AND t.table_name = ss3.object_name SQL> / Table Name Buffer Busy CR Blocks Received Current Blocks Received ---------- ----------- ------------------ ----------------------- CUSTOMER 16 546039 180391 DISTRICT 537 10130 200367 HISTORY 238 11840 7000 ITEM 0 0 0 WAREHOUSE 13172 88643 113499 STOCK 159 1649497 1215391
ORDERS 1775 102103 94959 7 rows selected.
Pay particular attention to the buffer busy statistics, which can help you identify contention between instances. In this example, the STOCK table is responsible for the highest degree of inter-instance messaging. However, the WAREHOUSE table is responsible for the highest degree of contention, which confirms the findings previously noted in the sections that covered AWR, ASH, and ADDM.
If contention is high, it means there are a large number of rows per block, and sessions are updating multiple rows in that block, as opposed to a single row. Thus you may benefit from distributing the rows across a greater number of blocks by increasing the value of PCTFREE
for the object in question.
You can investigate GCS activity in terms of consistent reads and current reads. For example, you can use this formula to monitor the performance of consistent reads:
(gc cr block receive time X 10) / (gc cr blocks received)
The preceding formula returns the average time for consistent read block requests for the instance in milliseconds, and it is the GC CR metric of the Global Cache Block Transfer Rate that you observe within the EM RAC performance views. This is the most important value for determining interconnect performance because it usually reflects the largest constituent component of interconnect traffic.
Requests for global resources for data blocks originate in the buffer cache of the requesting instance. Before a request enters the GCS request queue, Oracle allocates data structures in the SGA to track the state of the request. It also collects statistics on these resource structures. The following query and output shows the consistent reads receive time, the blocks received, and the average latency:
SQL> li 1 SELECT 2 gc_cr_block_receive_time AS "Receive Time", 3 gc_cr_blocks_received AS "Blocks Received", 4 (gc_cr_block_receive_time * 10) / 5 gc_cr_blocks_received AS "Average Latency (MS)" 6 FROM 7 ( 8 SELECT value AS gc_cr_block_receive_time FROM v$sysstat 9 WHERE name = 'gc cr block receive time' 10 ), 11 ( 12 SELECT value AS gc_cr_blocks_received FROM v$sysstat 13 WHERE name = 'gc cr blocks received' 14* ) SQL> / Receive Time Blocks Received Average Latency (MS) ------------ --------------- -------------------- 365718 3378788 1.08239404
The latency of a consistent block request is the time elapsed between the original request and the receipt of the consistent block image at the local instance. Using a Gigabit Ethernet interconnect, this value should normally be less than 2ms and should not exceed 10ms. However, this value can be affected by the system configuration and volume. In this example, you see that the average latency is 1ms; therefore, it lies within expectations. If latencies exceed 10ms, you should first use the Linux operating system utilities, such as netstat
(see Chapter 13 for more information on this utility), to determine whether there are any network configuration issues, such as network packet send and receive errors. You should also use operating system utilities such as top
and sar
to measure the load on the nodes themselves.
If you have double-checked that the interconnect is configured correctly, and you are still experiencing high average latencies, you might consider reducing the value of DB_FILE_MULTIBLOCK_READ_COUNT
. This parameter specifies the number of blocks a process will request in a single operation. The process will have to wait for all blocks to be returned; therefore, higher values may cause longer waits. Before you adjust this parameter, carefully assess its potential impact on the entire workload.
High average latencies may also be caused by a high number of incoming requests or multiple nodes dispatching requests to the LMS process. You can calculate the average LMS service time as follows:
average LMS service time = average latency
- average time to build consistent read block
- average time to wait for log flush
- average time to send completed block
Average latency is calculated as shown previously. Average time to build a consistent read block is calculated as follows:
gc cr block build time / gc cr block served
Similarly, average time spent waiting for a redo log flush is calculated as follows:
gc cr block flush time / gc cr blocks served
Average time to send a completed block is calculated as follows:
gc cr block send time / gc cr blocks served
The following query can be used to calculate the average LMS service time for consistent block reads:
SQL> li 1 SELECT 2 average_latency AS "Average Latency", 3 average_build_time AS "Average Build Time", 4 average_flush_time AS "Average Flush Time", 5 average_send_time AS "Average Send Time", 6 average_latency - average_build_time - average_flush_time - average_send _time 7 AS "Average LMS Service Time" 8 FROM
9 ( 10 SELECT 11 (gc_cr_block_receive_time * 10) / gc_cr_blocks_received AS average_latency, 12 (gc_cr_block_build_time * 10) / gc_cr_blocks_served AS average_build_time, 13 (gc_cr_block_flush_time * 10) / gc_cr_blocks_served AS average_flush_time, 14 (gc_cr_block_send_time * 10) / gc_cr_blocks_served AS average_send_time 15 FROM 16 ( 17 SELECT value AS gc_cr_block_receive_time FROM v$sysstat 18 WHERE name = 'gc cr block receive time' 19 ), 20 ( 21 SELECT value AS gc_cr_blocks_received FROM v$sysstat 22 WHERE name = 'gc cr blocks received' 23 ), 24 ( 25 SELECT value AS gc_cr_block_build_time FROM v$sysstat 26 WHERE name = 'gc cr block build time' 27 ), 28 ( 29 SELECT value AS gc_cr_block_flush_time FROM v$sysstat 30 WHERE name = 'gc cr block flush time' 31 ), 32 ( 33 SELECT value AS gc_cr_block_send_time FROM v$sysstat 34 WHERE name = 'gc cr block send time' 35 ), 36 ( 37 SELECT value AS gc_cr_blocks_served FROM v$sysstat 38 WHERE name = 'gc cr blocks served' 39 ) 40* ) SQL> / Average Latency Average Build Time Average Flush Time Average Send Time --------------- ------------------ ------------------ ----------------- Average LMS Service Time ------------------------ 1.11967715 .012249408 .113152249 .030138924 .964136572
The difference between the average latency time and the sum of the average build, flush, and send times represents the time spent in the LMS service and the time spent transmitting the messages across the interconnect. In the example just shown, the LMS Service Time is within expected boundaries.
You can also use SQL*Plus to determine the current block activity, which is the GC Current metric of the Global Cache Block Transfer Rate from the EM RAC performance views. You can calculate the overall average latency involved in processing requests for current blocks using the following formula:
(gc current block receive time X 10) / (gc current blocks received)
The latency is returned using the following query:
SQL> li 1 SELECT 2 gc_current_block_receive_time AS "Receive Time", 3 gc_current_blocks_received AS "Blocks Received", 4 (gc_current_block_receive_time * 10) / gc_current_blocks_received 5 AS "Average (MS)" 6 FROM 7 ( 8 SELECT value AS gc_current_block_receive_time 9 FROM v$sysstat 10 WHERE name = 'gc current block receive time' 11 ), 12 ( 13 SELECT value AS gc_current_blocks_received 14 FROM v$sysstat 15 WHERE name = 'gc current blocks received' 16* ) SQL> / Receive Time Blocks Received Average (MS) ------------ --------------- ------------ 462110 4514548 1.02360192
You can calculate the amount of overall latency that can be attributed to the LMS process using the following snippet:
average LMS service time = average latency - average time to pin current blocks - average time to wait for log flush - average time to send completed block
The average latency is calculated as follows:
gc current block receive time / gc current blocks received
Similarly, the average time to pin current blocks is calculated as follows:
gc current block pin time / gc current block served
The average time spent waiting for a redo log flush is calculated as follows:
gc current block flush time / gc current blocks served
And the average time to send a completed block is calculated as follows:
gc current block send time / gc current blocks served
You can use the following query to calculate the average LMS service time for current block reads:
SQL> li 1 SELECT 2 average_latency AS "Average Latency", 3 average_pin_time AS "Average Pin Time", 4 average_flush_time AS "Average Flush Time", 5 average_send_time AS "Average Send Time", 6 average_latency - average_pin_time - average_flush_time - average_send_time 7 AS "Average LMS Service Time" 8 FROM 9 ( 10 SELECT 11 (gc_current_block_receive_time * 10) / gc_current_blocks_received 12 AS average_latency, 13 (gc_current_block_pin_time * 10) / gc_current_blocks_served 14 AS average_pin_time, 15 (gc_current_block_flush_time * 10) / gc_current_blocks_served 16 AS average_flush_time, 17 (gc_current_block_send_time * 10) / gc_current_blocks_served 18 AS average_send_time 19 FROM 20 ( 21 SELECT value AS gc_current_block_receive_time FROM v$sysstat 22 WHERE name = 'gc current block receive time' 23 ), 24 ( 25 SELECT value AS gc_current_blocks_received FROM v$sysstat 26 WHERE name = 'gc current blocks received' 27 ), 28 ( 29 SELECT value AS gc_current_block_pin_time FROM v$sysstat 30 WHERE name = 'gc current block pin time' 31 ), 32 ( 33 SELECT value AS gc_current_block_flush_time FROM v$sysstat 34 WHERE name = 'gc current block flush time' 35 ), 36 ( 37 SELECT value AS gc_current_block_send_time FROM v$sysstat 38 WHERE name = 'gc current block send time' 39 ), 40 ( 41 SELECT value AS gc_current_blocks_served FROM v$sysstat 42 WHERE name = 'gc current blocks served'
43 ) 44* ) SQL> / Average Latency Average Pin Time Average Flush Time Average Send Time --------------- ---------------- ------------------ ----------------- Average LMS Service Time ------------------------ 1.06117092 .148803422 .104628541 .028298863 .779440094
High latency values may indicate server or interconnect performance issues; however, you should review your current block statistics against the possibility of contention for data between instances.
GES manages all the non-Cache Fusion intrainstance and inter-instance resource operations. High GES workload request rates can adversely affect performance. To calculate the average global enqueue time in milliseconds, use the following statement:
SQL> li 1 SELECT 2 global_enqueue_get_time AS "Get Time", 3 global_enqueue_gets_sync AS "Synchronous Gets", 4 global_enqueue_gets_async AS "Asynchronous Gets", 5 (global_enqueue_get_time * 10) / 6 (global_enqueue_gets_sync + global_enqueue_gets_async) 7 AS "Average (MS)" 8 FROM 9 ( 10 SELECT value AS global_enqueue_get_time 11 FROM v$sysstat 12 WHERE name = 'global enqueue get time' 13 ), 14 ( 15 SELECT value AS global_enqueue_gets_sync 16 FROM v$sysstat 17 WHERE name = 'global enqueue gets sync' 18 ), 19 ( 20 SELECT value AS global_enqueue_gets_async 21 FROM v$sysstat 22 WHERE name = 'global enqueue gets async' 23* ) SQL> / Get Time Synchronous Gets Asynchronous Gets Average (MS) ---------- ---------------- ----------------- ------------ 358597 27946887 115393 .127786124
Synchronous gets are usually locking events, whereas asynchronous gets are usually caused by nonblocking inter-instance process activity.
You can obtain further information about global enqueue activity caused by statement parsing and execution from the V$LIBRARYCACHE
dynamic performance view. This view reports GES activity for locks, pins, and invalidations, as in this example:
SQL> li 1 SELECT 2 namespace AS "Namespace", 3 dlm_lock_requests AS "Lock Requests", 4 dlm_pin_requests AS "Pin Requests", 5 dlm_pin_releases AS "Pin Releases", 6 dlm_invalidation_requests AS "Invalidation Requests", 7 dlm_invalidations AS "Invalidations" 8* FROM v$librarycache SQL> / Namespace Lock Requests Pin Requests Pin Releases Invalidation Requests --------------- ------------- ------------ ------------ -------------------- Invalidations --------------- SQL AREA 0 0 0 0 0 TABLE/PROCEDURE 640720 5551128 5551128 0 0 BODY 2 1965888 1965888 0 0 TRIGGER 0 70915 70915 0 0 INDEX 14849 14849 14849 0 0 CLUSTER 2479 2503 2503 0 0
If you see excessively high values in these columns, solutions include pinning packages in the shared pool using the KEEP
procedure in the DBMS_SHARED_POOL
package. Also, you might investigate methods of keeping cursors open either within the application or by using the SESSION_CACHED_CURSORS
parameter.
You can obtain additional information about global enqueue activity caused by statement pinning dictionary cache objects from the V$ROWCACHE
dynamic performance view. This view reports GES lock requests, conflicts, and releases, as in this example:
SQL> li 1 SELECT 2 parameter AS "Cache Name", 3 dlm_requests AS "Requests", 4 dlm_conflicts AS "Conflicts", 5 dlm_releases AS "Releases" 6* FROM v$rowcache SQL> / Cache Name Requests Conflicts Releases -------------------------------- ---------- ---------- ---------- dc_rollback_segments 99 0 0 dc_free_extents 0 0 0 dc_used_extents 0 0 0 dc_segments 7335 469 4158 dc_tablespaces 20 0 0 dc_tablespace_quotas 0 0 0 dc_files 6 0 0 dc_users 352 0 152 dc_objects 8712 106 1177 dc_global_oids 171 0 27 dc_constraints 2135 0 0 dc_sequences 733 43 5 dc_histogram_defs 56671 103 1828 kqlsubheap_object 0 0 0 dc_table_scns 121 0 0 ...
You should look for high values in the Conflicts
column. In the preceding example, there have been 43 conflicts for sequences. Uncached sequences are one of the most common causes of performance issues in RAC.
In Oracle 11g, most of the statistics related to lock conversions can be obtained from V$GES_CONVERT_LOCAL
and V$GES_CONVERT_REMOTE
. These views show the number of lock conversions on the local system and on remote systems, respectively:
SQL> li 1 SELECT convert_type,average_convert_time, convert_count 2* FROM v$ges_convert_local SQL> /
CONVERT_TYPE AVERAGE_CONVERT_TIME CONVERT_COUNT ---------------- -------------------- ------------- NULL -> SS 0 0 NULL -> SX 2 3 NULL -> S 1 24064 NULL -> SSX 5 627 NULL -> X 3 166358 SS -> SX 0 0 SS -> S 0 0 SS -> SSX 0 0 SS -> X 0 0 SX -> S 0 0 SX -> SSX 0 0 SX -> X 0 0 S -> SX 0 0 S -> SSX 0 0 S -> X 1 1693 SSX -> X 1 12
Lock conversions are essential to the efficient operation of a RAC database; moreover, they are not necessarily harmful. However, it is important to check that lock conversions, like lock requests, are not being blocked by instances holding incompatible locks.
To check which instances are currently blocking other instances, use the following query:
SQL> li 1 SELECT 2 dl.inst_id, 3 s.sid, 4 p.spid, 5 dl.resource_name1, 6 decode (substr (dl.grant_level,1,8), 7 'KJUSERNL','Null', 8 'KJUSERCR','Row-S (SS)', 9 'KJUSERCW','Row-X (SX)', 10 'KJUSERPR','Share', 11 'KJUSERPW','S/Row-X (SSX)', 12 'KJUSEREX','Exclusive', 13 request_level) as grant_level, 14 decode(substr(dl.request_level,1,8), 15 'KJUSERNL','Null', 16 'KJUSERCR','Row-S (SS)', 17 'KJUSERCW','Row-X (SX)', 18 'KJUSERPR','Share', 19 'KJUSERPW','S/Row-X (SSX)', 20 'KJUSEREX','Exclusive', 21 request_level) as request_level, 22 decode(substr(dl.state,1,8), 23 'KJUSERGR','Granted','KJUSEROP','Opening', 24 'KJUSERCA','Cancelling', 25 'KJUSERCV','Converting' 26 ) as state,
27 sw.event, 28 sw.seconds_in_wait sec 29 FROM 30 gv$ges_enqueue dl, 31 gv$process p, 32 gv$session s, 33 gv$session_wait sw 34 WHERE blocker = 1 35 AND (dl.inst_id = p.inst_id AND dl.pid = p.spid) 36 AND (p.inst_id = s.inst_id AND p.addr = s.paddr) 37 AND (s.inst_id = sw.inst_id AND s.sid = sw.sid) 38* ORDER BY sw.seconds_in_wait DESC SQL> / INST_ID SID SPID RESOURCE_NAME1 ---------- ---------- ------------------------ ------------------------------ GRANT_LEVEL REQUEST_LEVEL STATE ------------- ------------- ---------- EVENT SEC ---------------------------------------------------------------- ---------- 2 127 5520 [0x19][0x2],[RS] Exclusive Exclusive rdbms ipc message 2 1 131 28215 [0x1c000a][0x18a12],[TX] Exclusive Exclusive gc cr request 0 2 126 5496 [0x110012][0x5659b],[TX] Exclusive Exclusive ges remote message
To discover which sessions are currently being blocked, first find this line in the preceding query:
> WHERE blocker = 1
Once you locate the line, change it to this instead:
> WHERE blocked = 1
In addition to proactively monitoring your Oracle Database performance, you should also be aware of any faults or errors in the software environment that may assist you in diagnosing the underlying reason for performance issues. In an Oracle 11g RAC environment, the Oracle Database Fault Diagnosability Infrastructure provides a central point for such fault diagnosis. Within this infrastructure, we will focus on the Automatic Diagnostic Repository (ADR), Trace Files, and the Oracle Database Health Monitor.
The ADR provides a framework for diagnosing faults in the Oracle Database software environment, even when the Database itself is unavailable. For this reason, the ADR is maintained as a directory structure external to the Database. The root of this directory structure is known as the ADR base, and it is defined by the DIAGNOSTIC_DEST
init.ora parameter. If this parameter isn't set, then DIAGNOSTIC_DEST
defaults to the ORACLE_BASE
directory, as shown here:
SQL> show parameter diag NAME TYPE VALUE ---------------------------- ----------- -------------------------- diagnostic_dest string /u01/app/oracle
All of the nodes in the cluster may share a single ADR base on a shared cluster file system; however, this cluster file system should be separate from the Oracle Database environment. Therefore, OCFS2 would be preferable to ACFS because the ADR must remain available, even if the ASM instance is not. Alternatively, they can be maintained separately on all of the nodes in the cluster. Underneath the ADR base, each instance upon a node has an ADR home. Typically, in a RAC environment based on non-shared storage, this will mean that there is an ADR home for the ASM instance and an ADR home for the Database instance. All ADR homes are located under the diag
directory and rdbms
for the Database instance and asm
for the ASM instance. The actual ADR home is defined by the database and instance names. You can set the ADR home at the operating system to a particular instance by setting the ADR_HOME environment variable. For example, this snippet sets the ADR home to a RAC database instance:
[oracle@london1 ˜]$ export ADR_HOME=/u01/app/oracle/diag/rdbms/prod/PROD1
ADR homes share a common directory structure, and you can view the directory structure for a particular instance in the view V$DIAG_INFO
or for all instances in GV$DIAG_INFO
:
SQL> select name, value from v$diag_info; NAME ---------------------------------------------------------------- VALUE ---------------------------------------------------------------- Diag Enabled TRUE ADR Base /u01/app/oracle ADR Home /u01/app/oracle/diag/rdbms/prod/PROD1 Diag Trace /u01/app/oracle/diag/rdbms/prod/PROD1/trace Diag Alert /u01/app/oracle/diag/rdbms/prod/PROD1/alert Diag Incident /u01/app/oracle/diag/rdbms/prod/PROD1/incident Diag Cdump /u01/app/oracle/diag/rdbms/prod/PROD1/cdump
Health Monitor /u01/app/oracle/diag/rdbms/prod/PROD1/hm Default Trace File /u01/app/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_2404.trc Active Problem Count 1 Active Incident Count 1
The ADR is managed and the collected information viewed with the adrci
command-line utility:
[oracle@london1 PROD1]$ adrci ADRCI: Release 11.2.0.1.0 - Production on Wed Feb 17 09:40:32 2010 Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved. ADR base = "/u01/app/oracle" adrci>
If you have not set the ADR_HOME
environment variable, then the adrci
commands will report on all ADR homes underneath the ADR base directory. However, some commands do not support multiple ADR homes, as in this example:
adrci> show control; DIA-48448: This command does not support multiple ADR homes
For commands that don't support multiple ADR homes, it is necessary to either set the ADR_HOME
environment variable externally or within adrci
using the set homepath
command. The show control
command details information about the ADR environment for an incident. For example, the variables SHORTP_POLICY
and LONGP_POLICY
show the incident files and metadata retention polices, which are set to 720 hours or 30 days and 8760 hours or 365 days, respectively:
adrci> show control ADR Home = /u01/app/oracle/diag/rdbms/prod/PROD1: ************************************************************************* ADRID SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME LAST_AUTOPRG_TIME LAST_MANUPRG_TIME ADRDIR_VERSION ADRSCHM_VERSION ADRSCHMV_SUMMARY ADRALERT_VERSION CREATE_TIME -------------------- -------------------- -------------------- -------------------------------------- ---------------------------------------- -------------------------------------- -------------------- -------------------- ------------------ -------------------- -------------------------------------- 2978996791 720 8760 2010-02-01 14:40:45.803438 +00:00 2010-02-15 02:08:41.049771 +00:00 1 2 76 1 2010-02-01 14:40:45.803438 +00:00 1 rows fetched
These retention policies can be user-modified for a particular environment.
One interesting feature of adrci
is that it provides a central tool to view the database alert log. If the ADR Home is not set, you have the option of selecting which alert log that you wish to view, as in this example:
adrci> show alert Choose the alert log from the following homes to view: 1: diag/asm/+asm/+ASM1 2: diag/tnslsnr/london1/listener_scan2 3: diag/tnslsnr/london1/listener 4: diag/tnslsnr/london1/listener_scan1 5: diag/tnslsnr/london1/listener_scan3 6: diag/rdbms/prod/PROD1 Q: to quit
Alternatively, if the ADR Home is set, you will view the alert log for that instance. For example, the show alert command also supports standard Linux type options to tail the alert log for message in real time. The following example shows the alert log revealing an issue with "Checkpoint not complete" messages that will impact on performance (you can learn more about the reasons for this in Chapter 4).
adrci> show alert -tail -f 2010-02-16 22:00:39.917000 +00:00 Checkpoint not complete Current log# 2 seq# 1668 mem# 0: +DATA/prod/onlinelog/group_2.262.709828935 2010-02-16 22:00:42.928000 +00:00 Thread 1 advanced to log sequence 1669 (LGWR switch) Current log# 1 seq# 1669 mem# 0: +DATA/prod/onlinelog/group_1.261.709828935 2010-02-16 22:01:09.935000 +00:00 Thread 1 cannot allocate new log, sequence 1670 Checkpoint not complete Current log# 1 seq# 1669 mem# 0: +DATA/prod/onlinelog/group_1.261.709828935 2010-02-16 22:01:12.945000 +00:00 Thread 1 advanced to log sequence 1670 (LGWR switch) Current log# 2 seq# 1670 mem# 0: +DATA/prod/onlinelog/group_2.262.709828935
The alert log read by adrci
is in XML format. for this reason, there is also a text version of the alert log in the trace
directory that can be read with the standard Linux commands, as in this example:
/u01/app/oracle/diag/rdbms/prod/PROD1/trace/alert_PROD1.log
You may have also observed that the example information from V$DIAG_INFO
revealed an active incident. You can view this incident and the time that it was raised with the show incident
command, as in this example:
adrci> show incident ADR Home = /u01/app/oracle/diag/rdbms/prod/PROD1: *********************************************************************** INCIDENT_ID PROBLEM_KEY CREATE_TIME -------------------- --------------------------------------------------------- ---------------------------------------- 27729 ORA 29770 2010-02-09 00:45:35.477000 +00:00 1 rows fetched
You can also use adrci
to create a package for this incident and create a zip file that includes all of the related information for it; this can be done either for your own benefit or to upload data to Oracle support:
adrci> ips create package incident 27729 Created package 1 based on incident id 27729, correlation level typical adrci> ips generate package 1 in /tmp cp: omitting directory `/u01/app/11.2.0/grid/log/london1/racg/racgmain' cp: omitting directory `/u01/app/11.2.0/grid/log/london1/racg/racgeut' cp: omitting directory `/u01/app/11.2.0/grid/log/london1/racg/racgevtf' Generated package 1 in file /tmp/ORA29770_20100217094458_COM_1.zip, mode complete
In this next example, the incident trace file reveals that the incident is indeed cluster related. The file also provides a focus for investigating why performance was impacted at the point in time that the incident was raised:
ORA-29770: global enqueue process LMS0 (OSID 7419) is hung for more than 70 seconds ========= Dump for incident 27729 (ORA 29770) ======== ----- Beginning of Customized Incident Dump(s) ----- ================================================== === LMS0 (ospid: 7419) Heartbeat Report ================================================== LMS0 (ospid: 7419) has no heartbeats for 100 sec. (threshold 70 sec) : Not in wait; last wait ended 64 secs ago. : last wait_id 1398345 at 'latch: gc element'. ============================== Dumping PROCESS LMS0 (ospid: 7419) States ...
In addition to reactive incident response and monitoring, the ADR also provides a framework for proactively running verification checks against the database with a tool called Health Monitor. Health checks run a diagnosis to detect and report on faults such as datafile corruptions, and the list of checks can be viewed in V$HM_CHECK
:
1* select name from v$hm_check SQL> /
NAME ---------------------------------------------------------------- HM Test Check DB Structure Integrity Check CF Block Integrity Check Data Block Integrity Check Redo Integrity Check Logical Block Check Transaction Integrity Check Undo Segment Integrity Check No Mount CF Check Mount CF Check CF Member Check All Datafiles Check Single Datafile Check Tablespace Check Check Log Group Check Log Group Member Check Archived Log Check Redo Revalidation Check IO Revalidation Check Block IO Revalidation Check Txn Revalidation Check Failure Simulation Check Dictionary Integrity Check ASM Mount Check ASM Allocation Check ASM Disk Visibility Check ASM File Busy Check
Health checks can be run automatically in response to incidents. However, some health checks can be run manually with the DBMS.HM package, as in this example:
SQL> exec dbms_hm.run_check('DB Structure Integrity Check'), PL/SQL procedure successfully completed.
Details of the health check can be viewed with adrci
:
adrci> show hm_run; ... ********************************************************** HM RUN RECORD 3 ********************************************************** RUN_ID 61 RUN_NAME HM_RUN_61 CHECK_NAME DB Structure Integrity Check NAME_ID 2 MODE 0 START_TIME 2010-02-17 10:54:29.982586 +00:00 RESUME_TIME <NULL>
END_TIME 2010-02-17 10:54:29.985681 +00:00 MODIFIED_TIME 2010-02-17 10:54:29.985681 +00:00 TIMEOUT 0 FLAGS 0 STATUS 5 SRC_INCIDENT_ID 0 NUM_INCIDENTS 0 ERR_NUMBER 0 REPORT_FILE <NULL> ...
You can use adrci
to create a report and view the findings:
adrci> create report hm_run HM_RUN_61; adrci> show report hm_run HM_RUN_61; <?xml version="1.0" encoding="US-ASCII"?> <HM-REPORT REPORT_ID="HM_RUN_61"> <TITLE>HM Report: HM_RUN_61</TITLE> <RUN_INFO> <CHECK_NAME>DB Structure Integrity Check</CHECK_NAME> <RUN_ID>61</RUN_ID> <RUN_NAME>HM_RUN_61</RUN_NAME> <RUN_MODE>MANUAL</RUN_MODE> <RUN_STATUS>COMPLETED</RUN_STATUS> <RUN_ERROR_NUM>0</RUN_ERROR_NUM> <SOURCE_INCIDENT_ID>0</SOURCE_INCIDENT_ID> <NUM_INCIDENTS_CREATED>0</NUM_INCIDENTS_CREATED> <RUN_START_TIME>2010-02-17 10:54:29.982586 +00:00</RUN_START_TIME> <RUN_END_TIME>2010-02-17 10:54:29.985681 +00:00</RUN_END_TIME> </RUN_INFO> <RUN_PARAMETERS/> <RUN-FINDINGS/> </HM-REPORT>
You can also view the generated report under the ADR hm directory which in this example is/u01/app/oracle/diag/rdbms/prod/PROD1/hm
, where the file HMREPORT_HM_RUN_61.hm
was generated. This example didn't produce any findings, but if you do encounter findings in a run, their impact can also be seen in the V$HM_FINDING
view
SQL> select description, damage_description from v$hm_finding where run_id = 62; ... Datafile 4: '+DATA/prod/datafile/users.259.709828865' needs media recovery Some objects in tablespace USERS might be unavailable ...
An additional feature provided by the ADR is the ability to view and search the database trace files gathered in the trace directory under the ADR home:
/u01/app/oracle/diag/rdbms/prod/PROD1/trace
The command show tracefile
lists all of the trace files in this directory. adrci
also supports searching for strings within trace file names and showing trace files related to a particular incident:
adrci> show tracefile -I 27729 diag/rdbms/prod/PROD1/incident/incdir_27729/PROD1_lmhb_7430_i27729.trc
Once a trace file has been found, its contents can also be viewed under adrci
, as in this example:
adrci> show trace PROD1_lmhb_7430_i27729.trc ...
In addition to viewing system-generated trace files, you can also generate trace files of your own applications. The DBMS_MONITOR
package provides a single API interface from which you can enable and disable trace for a specific session, module, action, or client identifier. In addition, DBMS_MONITOR
provides other functionality to enable more granular collection of statistics for specified modules, actions, and client identifiers.
To enable tracing in a specific session, you first need to identify the SID and, optionally, the serial number of the session. This information can be obtained from V$SESSION
:
SQL> SELECT sid, serial# FROM v$session WHERE username = 'TPCC';
Note that you may need to join V$SESSION
to other dynamic performance views, such as V$SQL
, to identify the session of interest.
To enable tracing for a specific session, use the SESSION_TRACE_ENABLE
procedure. For example, you can use the following line to enable tracing for a session with a SID of 164
:
SQL> EXECUTE dbms_monitor.session_trace_enable (session_id=>164);
This command will immediately enable tracing for the specified session; the result will be appended to the current trace file.
You can also include bind variable information in the same trace file:
SQL> EXECUTE dbms_monitor.session_trace_enable (session_id=>164,binds=>true);
By default, wait information is included in the trace file. You can specify that wait events should be traced using this snippet:
SQL> EXECUTE dbms_monitor.session_trace_enable (session_id=>164,waits=>true);
You can also include both bind and wait information in the same trace file. Tracing can be disabled for the same session with this snippet:
SQL> EXECUTE dbms_monitor.session_trace_disable (session_id=>164);
You can also enable tracing for all sessions connected to the database or instance with database_trace_enable
. Similarly, you can enable tracing for a particular service or module with serv_mod_act_trace_enable
Note that enabling tracing for all sessions will generate a large amount of trace output very quickly in the ADR trace directory. Enabling tracing globally will also slow down the database, so it is advisable not to enable it at the instance level on a production system.
Once you have generated your trace file, you can either view the contents directly or use the standard Oracle utility tkprof
to provide a summary of the statements processed by the session when tracing was enabled and the events on which the statements waited. The following excerpt generated by tkprof
illustrates how tracing permits you to view the activities of a session with finer granularity:
... UPDATE DISTRICT SET D_YTD = D_YTD + :B3 WHERE D_W_ID = :B2 AND D_ID = :B1 call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- -------- Parse 0 0.00 0.00 0 0 0 0 Execute 391 0.17 0.77 0 868 464 391 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- -------- total 391 0.17 0.77 0 868 464 391 Misses in library cache during parse: 0 Optimizer mode: ALL_ROWS Parsing user id: 85 (recursive depth: 1) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ ges message buffer allocation 291 0.00 0.00 gc current block 2-way 196 0.00 0.07 gc cr block busy 7 0.00 0.00 enq: TX - row lock contention 71 0.02 0.53 gc cr block 2-way 25 0.00 0.00 gc current block congested 1 0.00 0.00 gc current block busy 8 0.00 0.00 gc buffer busy acquire 3 0.00 0.00 KJC: Wait for msg sends to complete 2 0.00 0.00 buffer busy waits 1 0.00 0.00 ****************************************************************************** ...
In this chapter, we described some of the tools and techniques available to you for monitoring the performance of your RAC cluster at the Oracle level. In particular, we looked at the tools integrated with the Oracle 11g Manageability Infrastructure.
In the next chapter, we'll look at what to do at the Linux level. For example, you'll learn about the most important of the built-in Linux commands and tools to help you diagnose server and interconnect performance problems.