Chapter 14. Parallel Execution

Parallel Execution is a feature of the Oracle Database that enables the division of certain database operations into multiple subtasks that operate simultaneously. The goal of this approach is to complete these tasks more quickly than if they were executed serially as a single task. Parallel Execution has the potential to improve performance and response times by harnessing more of a system's hardware resources (see Chapter 4 for more information on hardware resources) to complete the task at hand. Parallel Execution enables singe-instance environments by utilize more of a system's CPU cores, memory, and I/O in a single operation. However, Parallel Execution is particularly relevant to RAC because it enables operations to be processed beyond the resource capabilities of a single node, letting you leverage the CPU, memory, and I/O capability of the entire cluster (if desired) to complete a single operation.

The nature of Parallel Execution, where you utilize multiple resources to accomplish a single task, makes it applicable to processing resource intensive operations. However, it is not applicable to simple DML operations, such as SELECT, INSERT, UPDATE, and DELETE statements that manipulate a limited number of rows. For this reason, Parallel Execution is not associated with regular DML operations in OLTP environments. However, in an OLTP RAC environment with a high number of concurrent users, this type of utilization is the desired outcome. In this case, multiple users can fully utilize all server resources across the cluster, all without requiring parallel operations.

That said, Parallel Execution still may be used in OLTP environments for more resource-intensive maintenance and data load operations. For example, Parallel Execution makes sense for operations such as CREATE TABLE AS SELECT, CREATE TABLE for an index organized table, and CREATE INDEX. It also makes sense for Parallel Data Pump operations, RMAN backup, and recovery and gathering statistics. You should be aware that not all of these operations may be able to fully utilize parallel operations across the entire cluster. For example, Parallel Data Pump jobs can only run on a single node in a cluster, regardless of the number of nodes in the cluster and even though the worker processes may run on multiple instances.

Parallel Execution is most recognized as bringing benefits to RAC environments in cases involving data warehousing or decision support systems (DSS). In these cases, Parallel Execution lets you the process SELECT statements efficiently, retrieve a large number of rows using full table scans, index fast full scans, and partition index range scans. Used judiciously and configured correctly, Parallel Execution can result in orders-of-magnitude improvements in the time to complete these types of data warehouse query operations. Consequently, this chapter will focus on Parallel Query in RAC environments in particular. We will also cover the concepts behind Parallel Execution, configuring Parallel Execution, and the monitoring and tuning of such environments. Before doing so, it is important to note that Parallel Operations require the use of Oracle Enterprise Edition because Parallel Operations are not available in Oracle Standard Edition. Therefore, if operating in a Standard Edition environment, features such as Parallel Query, Parallel DML, Parallel index builds, and Parallel Backup and Recovery will not be available to you.

Parallel Execution Concepts

At the most basic design level, executing an operation in parallel requires a controlling process to initially determine whether the operation will complete more quickly in parallel, as opposed to serially. Once the controlling process forecasts that it will complete an operation more quickly in parallel, it will then determine how the tasks required for the operation are to be allocated between a number of processes. These processes then simultaneously execute the tasks allocated to them before returning their subset of results back to the controlling process. The controlling process then serially assembles the results gathered from all of the parallel processes allocated to the subtasks, before returning the result set back to the user.

From the basic analysis, we can make a number of initial observations regarding Parallel Execution. First, it is not necessarily the case that all operations will complete more quickly in parallel than they will serially. Even when operating in parallel, there are serial components that will limit the performance gains available from parallelizing an operation. In fact, this premise is long established in computing as Amdahl's Law; and even though the level of impact varies between workloads, it is important to bear in mind when configuring Parallel Execution that allocating an increasing number of parallel subprocesses is unlikely to result in linear scalability. At some point, the allocation of subprocesses will reach an optimal level, beyond which more resources will be consumed, but the operation as a whole will take longer to complete. It also raises a related point that the optimal level for one parallel operation may not actually be the optimal level for multiple operations on a system in which all the processes are attempting to run in parallel. In other words, the resources consumed by adding additional parallel processes up to the point where the operation completes the quickest may be proportionally greater than the additional performance improvement warrants. Therefore, reducing the number of subprocesses to a level below where the operation would complete quickest may actually provide the most benefit to the system as whole. This raises an important point: the correct sizing, configuration, and testing of the software are vital to extracting the benefits of parallelism for the workload as a whole across the system. This sizing must be based on the administrator's knowledge of the available CPU, memory, and I/O resources—and Oracle Parallel Execution is no exception to this general rule.

Serial Execution

To understand Oracle Parallel Execution, we'll begin with an example involving serial execution. The following listing shows a formatted execution plan for a simple query that counts the number of rows in the LINEITEM table, which is executed serially:

SQL> explain plan for
  2  select count(*) from lineitem;

Explained.

SQL> select * from table(DBMS_XPLAN.DISPLAY);
Plan hash value: 2139482517

------------------------------------------------------------------------------
|Id| Operation                        | Name                   | Rows   |Cost(%CPU)| Time
------------------------------------------------------------------------------
|0 |SELECT STATEMENT    |                                              |1             |3701   (1)| 00:00:45
|1 |SORT AGGREGATE      |                                               |1             |                  |
|2 |INDEX FAST FULL SCAN|I_L_ORDERKEY              |6002K   |               |3701   (1)| 00:00:45
------------------------------------------------------------------------------

In this example, the users foreground process executes an INDEX FAST FULL SCAN, performs the COUNT aggregate function, and returns the result to the user.

Parallel Execution

For the same statement using Oracle Parallel Execution, the user's foreground process becomes the parallel controlling process, which is called the Parallel Execution Coordinator (PEC). For Parallel Queries, this controlling process is also known as the Query Coordinator (QC). The PEC initially determines whether to parallelize the operation.

Note

In some circumstances, the PEC will have been instructed to parallelize execution. For example, a DBA or developer can set a configuration that causes Parallel Execution, or a developer can force Parallel Execution manually.

When operating in parallel, the PEC allocates a number of Parallel Execution Servers (PES). The Parallel Execution Servers are also sometimes known for queries as Parallel Query Slaves (PQ Slaves). The Parallel Execution Servers are granted to the PEC from an available system-wide pool according to the Degree of Parallelism (DOP).

The DOP is calculated automatically by the Oracle Software, or it can be determined by the DBA or developer's configuration. Depending on the configuration, the Parallel Execution Servers may be allocated on the local node, or on one or more nodes across the entire cluster, where they utilize the available resources and distribute information across the private interconnect. The Parallel Execution Servers are implemented on the operating system by processes such as the following:

oracle   13955     1  0 16:49 ?        00:00:03 ora_p000_PROD1
oracle   13957     1  0 16:49 ?        00:00:02 ora_p001_PROD1
oracle   14232     1  4 16:55 ?        00:00:02 ora_p002_PROD1
oracle   14234     1  4 16:55 ?        00:00:02 ora_p003_PROD1
oracle   14236     1  4 16:55 ?        00:00:03 ora_p004_PROD1
oracle   14238     1  3 16:55 ?        00:00:02 ora_p005_PROD1
oracle   14240     1  4 16:55 ?        00:00:03 ora_p006_PROD1
oracle   14242     1  2 16:55 ?        00:00:01 ora_p007_PROD1
oracle   14244     1  2 16:55 ?        00:00:01 ora_p008_PROD1
oracle   14246     1  2 16:55 ?        00:00:01 ora_p009_PROD1
oracle   14248     1  2 16:55 ?        00:00:01 ora_p010_PROD1

Prior to Oracle Database 10g Release 2, the Parallel Execution Coordinator would generate a SQL statement for each Parallel Execution Server—hence the term PQ Slave. In those earlier releases, each slave would parse and execute a cursor to retrieve its allocated subset of data from the database. In Oracle Database 10g Release 2 and later, the Parallel Execution Coordinator parses a single cursor that is shared by all of the participating Parallel Execution Servers. On the additional nodes in the cluster, the SQL is compiled by one Parallel Execution Server, and the cursor is shared with the other Parallel Execution Servers. The Parallel Execution Coordinator divides the workload into units termed granules that are either based on ranges of data blocks or on allocation of partitions for partitioned objects. The granules are calculated dynamically and assigned to the Parallel Execution Servers. The Parallel Execution Servers then act as Dataflow operators for the portion of the global Parallel Execution Plan assigned to them. When a task is complete, the rows are returned to the Parallel Execution Coordinator, and another granule is assigned until all granules have been consumed.

In the next example, the Parallel Execution Coordinator is returned, and the results of the COUNT aggregate function are completed by the Parallel Execution Servers. The Parallel Execution Servers then apply the COUNT aggregate function to all accumulated results as if the INDEX FAST FULL SCAN had been completed serially. This method of allocation ensures that if a particular Parallel Execution Server has not completed its workload as rapidly as another, it will be assigned a lower number of granules. This prevents one server from unduly slowing the response time for the entire operation. It also introduces the concept of Table Queues as the method by which data is distributed between processes. When a Parallel Execution Server has completed all of its assigned workload, it returns to the pool to be reallocated. Figure 14-1 shows an example of this, depicting a simple Parallel Query operation with a DOP of 3.

A parallel query operation with three degrees of parallelism

Figure 14.1. A parallel query operation with three degrees of parallelism

The following formatted execution plan shows exactly the same query you saw executed serially earlier in this chapter, except this time it is executed in parallel:

--------------------------------------------------------------------------------
|Id| Operation                             | Name              | TQ      |IN-OUT | PQ Distrib
--------------------------------------------------------------------------------
|0 | SELECT STATEMENT                       |                       |           |           |
|1 | SORT AGGREGATE                         |                       |           |           |
|2 | PX COORDINATOR              |                       |           |           |
|3 | PX SEND QC (RANDOM)        |:TQ10000         |Q1,00   |P->S    |QC (RAND)
|4 | SORT AGGREGATE             |                       |Q1,00   |PCWP |
|5 | PX BLOCK ITERATOR          |                       |Q1,00   |PCWC |
|6 | INDEX FAST FULL SCAN      |I_L_ORDERKEY           |Q1,00   |PCWP |
--------------------------------------------------------------------------------
Note
-----
   - automatic DOP: Computed Degree of Parallelism is 2

In this parallel example, there is a single Dataflow operation that satisfies the query. The Dataflow operation is itself satisfied by two Dataflow operators, which in this case are shown as the PX BLOCK ITERATOR. The PX BLOCK ITERATOR or granule iterator operates upon granules allocated as block ranges that perform an INDEX FAST FULL SCAN on their allocated range of the I_L_ORDERKEY index. The columns PCWP for Parallel Combined with Parent and PCWC for Parallel are combined with the child. This example illustrates that the output from these stages is communicated within the same Parallel Execution Server and not between multiples processes. For these steps, the PX BLOCK ITERATOR is the parent, and the INDEX FAST FULL SCAN is the child. The aggregate function is performed by the Parallel Execution Servers, and the results are sent randomly using QC(RAND) to the Parallel Execution Coordinator process as a Parallel to Serial (P->S) operation. There is a single Table Queue, TQ1,00, that provides the communication between the Parallel Execution Servers and the Parallel Execution Coordinator. The Parallel Execution Coordinator process performs the COUNT aggregate function on the results, and then returns the results of the SELECT statement to the user.

Producers and Consumers

So far we have covered a simple example of the Parallel Execution. More complex examples introduce the concept of data redistribution. For example, if the Query contains a JOIN between two tables, the Parallel Execution Server operating on the block range of an allocated table might not contain sufficient information about either the entire block range of that table or, more importantly, the data retrieved from the other table in the JOIN.

To complete the JOIN, the information generated by both Parallel Execution Servers needs to be redistributed to a single process. From a design perspective, the simplest approach is to send all of the information to the Parallel Execution Coordinator. This single process may gather all of the data retrieved in parallel and perform the JOIN operation on it. Remember, however, that this two table JOIN is a simple example, and it may require multiple operations to process this task serially. This problem illustrates the factors previously identified in Amdahl's Law. That is, the time required for the operations to be processed in serial by the Parallel Execution Coordinator and then redistributed through all of the required data back to that single process would often negate the gains achieved in the portions of the operation that can be run in parallel. Consequently, for queries containing operations such as JOINs or SORTs, Oracle Parallel Execution deploys Producer and Consumer Parallel Execution Servers. The producer continues to serve the role of the granule iterator, processing its allocated range. However, the retrieved rows are then distributed to another Parallel Execution Server that retrieves all of the rows required to complete its operation. The number of Parallel Execution Servers deployed in such a scenario is double the Degree of Parallelism, as illustrated in Figure 14-2.

How Parallel Execution Servers communicate

Figure 14.2. How Parallel Execution Servers communicate

Figure 14-2 shows that all of the Parallel Execution Servers establish a channel to all of the other Parallel Execution Servers. Within a RAC environment, these servers create four message buffers per channel and communicate asynchronously by passing the message buffers between them.

Let's look at a sample query that can be satisfied using Producers and Consumers Parallel Execution Servers. The following parts/supplier relationship query (Q16) from the TPC-H specification returns the number of suppliers who have not received complaints and who can supply the specified parts with certain attributes:

select
    p_brand,
    p_type,
    p_size, count(distinct ps_suppkey) as supplier_cnt
from    partsupp,
part
where    p_partkey = ps_partkey
    and p_brand <> 'Brand#24'
    and p_type not like 'MEDIUM BURNISHED%'
    and p_size in (8, 27, 42, 9, 4, 13, 25, 19)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
   p_size;

The following formatted execution plan shows how this query is executed in parallel:

--------------------------------------------------------------------------------
|Id  | Operation                      | Name                                   | TQ         |IN-OUT | PQ Distrib
--------------------------------------------------------------------------------
|0   | SELECT STATEMENT                                |                                               |               |               |
|1   | PX COORDINATOR                    |                                               |               |               |
|2   | PX SEND QC (ORDER)             |:TQ10005                             |Q1,05    | P->S     |QC (ORDER)
|3   | SORT ORDER BY                       |                                               |Q1,05    | PCWP  |
|4   | PX RECEIVE                                |                                               |Q1,05    | PCWP  |
|5   | PX SEND RANGE                       |:TQ10004                             |Q1,04    | P->P     |RANGE
|6   | HASH GROUP BY                       |                                               |Q1,04    | PCWP  |
|7   | PX RECEIVE                                |                                               |Q1,04    | PCWP  |
|8   | PX SEND HASH                          |:TQ10003                             |Q1,03    | P->P     |HASH
|9   | HASH GROUP BY                       |                                               |Q1,03    | PCWP  |
|10  | VIEW                                             |VM_NWVW_1                      |Q1,03    | PCWP  |
|11  | HASH GROUP BY                      |                                               |Q1,03    | PCWP  |
|12  | PX RECEIVE                               |                                               |Q1,03    | PCWP  |
|13  | PX SEND HASH                         |:TQ10002                             |Q1,02    | P->P     |HASH
|*14 | HASH JOIN RIGHT ANTI          |                                               |Q1,02    | PCWP  |
|15  | PX RECEIVE                               |                                               |Q1,02    | PCWP  |
|16  | PX SEND BROADCAST            |:TQ10000                             |Q1,00    | P->P     |BROADCAST
|17  | PX BLOCK ITERATOR              |                                               |Q1,00    | PCWC  |
|*18 | TABLE ACCESS FULL             |SUPPLIER                           |Q1,00    | PCWP  |
|*19 |       HASH JOIN                           |                                               |Q1,02    | PCWP  |
|20  | PX RECEIVE                               |                                               |Q1,02    | PCWP  |
|21  | PX SEND BROADCAST            |:TQ10001                             |Q1,01    | P->P     |BROADCAST
|22  | PX BLOCK ITERATOR              |                                               |Q1,01    | PCWC  |
|*23 | TABLE ACCESS FULL             |PART                                     |Q1,01    | PCWP  |
|24  | PX BLOCK ITERATOR              |                                               |Q1,02  | PCWC    |
|25  | INDEX FAST FULL SCAN         |I_PS_PARTKEY_SUPPKEY             |Q1,02    | PCWP  |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  14 - access("PS_SUPPKEY"="S_SUPPKEY")
  18 - filter("S_COMMENT" LIKE '%Customer%Complaints%')
  19 - access("P_PARTKEY"="PS_PARTKEY")
  23 - filter(("P_SIZE"=4 OR "P_SIZE"=8 OR "P_SIZE"=9 OR "P_SIZE"=13 OR "P_SIZE"=19 OR "P_SIZE"=25 OR "P_SIZE"=27 OR "P_SIZE"=42) AND
                      "P_BRAND"<>'Brand#24' AND "P_TYPE" NOT LIKE 'MEDIUM BURNISHED%')
Note
-----
   - automatic DOP: Computed Degree of Parallelism is 2

Running the preceding execution plan produces output such as the following:

SQL> /
P_BRAND    P_TYPE                        P_SIZE SUPPLIER_CNT
---------- ------------------------- ---------- ------------
Brand#54   STANDARD ANODIZED TIN              9           28
Brand#11   STANDARD PLATED TIN               25           24
Brand#12   PROMO ANODIZED COPPER             27           24
Brand#12   STANDARD BURNISHED COPPER         25           24
Brand#14   LARGE ANODIZED NICKEL             19           24
Brand#14   PROMO ANODIZED COPPER             19           24
Brand#21   LARGE ANODIZED COPPER             13           24
Brand#21   PROMO BURNISHED TIN                4           24
...

In this example, there are now six Table Queues that range from TQ1,00 to TQ1,05. These queues redistribute information between the Parallel Execution Servers and the Parallel Execution Coordinator. An individual operation or row source is identified under a particular Table Queue number. PX SEND and PX RECEIVE identify the distribution of rows between processes, while P->P under the IN-OUT column shows that the output is redistributed to another set of Parallel Execution Servers. The mapping of these operations to the operating system processes can be viewed in the V$PQ_TQSTAT table, which must be viewed in the session where the Parallel Query was run. The following view returns information only for the last query; it shows the previous execution plan in action:

SQL>  select DFO_NUMBER, TQ_ID, SERVER_TYPE, PROCESS, INSTANCE from
v$PQ_TQSTAT order by TQ_ID;

DFO_NUMBER    TQ_ID SERVER_TYP PROCESS      INSTANCE
---------- ---------- ---------- ---------- ----------
                 1                 0 Producer         P000                          1
                 1                 0 Producer         P001                          1
                 1                 0 Consumer      P002                          1
                 1                 0 Consumer      P003                          1
                 1                 1 Producer         P000                          1
                 1                 1 Producer         P001                          1
                 1                 1 Consumer      P002                          1
                 1                 1 Consumer      P003                          1
                 1                 2 Consumer      P001                          1
                 1                 2 Producer         P003                          1
                 1                 2 Consumer      P000                          1
                 1                 2 Producer         P002                          1
                 1                 3 Producer         P001                          1
                 1                 3 Producer         P000                          1
                 1                 3 Consumer      P002                          1
                 1                 3 Consumer      P003                          1
                 1                 4 Ranger            QC                              1
                 1                 4 Consumer      P000                          1
1                 4 Producer         P003                          1
                 1                 4 Producer         P002                          1
                 1                 4 Consumer      P001                          1
                 1                 5 Producer         P001                          1
                 1                 5 Producer         P000                          1
                 1                 5 Consumer      QC                              1

24 rows selected.

The preceding output illustrates that, from a conceptual standpoint, we have identified the distinct roles of producer and consumer processes. In this case, there are four processes, ranging from P000 to P003 according to the Degree of Parallelism of 2. In terms of implementation, it is important to note that a single Parallel Execution Server process can be both a producer and consumer for multiple operations in the Parallel Execution Plan. In this example, the Producer Parallel Execution Servers continue to serve the role of granule iterators against their allocated block ranges. Also, full table scans (TABLE ACCESS FULL) are performed against the PART and SUPPLIER tables, and an INDEX FAST FULL SCAN is performed against the I_PS_PARTKEY_SUPPKEY index on the PARTSUPP table. The results of both full table scans BROADCAST to all Parallel Execution Servers, which means you don't need to distribute the results of the index scan against the index on the comparatively larger PARTSUPP table.

The operations that are the consumer of these rows are identified by table queue TQ1,02. This queue is also the producer of the index scan. From the output of V$PQ_TQSTAT, it can be observed that the operating system processes P002 and P003 are the producers for this operation, as well as the consumers for the TQ1,00 and TQ1,01 Table Queue operations. These Parallel Execution Servers then perform the JOIN on the PART and PARTSUPP rows and the ANTI JOIN for the NOT IN operation on the SUPPLIER rows before redistributing the results to another set of Parallel Execution Servers. In this example, these servers are the operating system processes, P000 and P001. This form of redistribution continues for Table Queues ranging from TQ1,03 to TQ1,04. In this case, the operating system processes P000 and P001 and P002 and P003 alternate between producer and consumer roles, processing the GROUP BY operations in parallel. For the output and Table Queue TQ1,04, the Parallel Execution Coordinator acts as the RANGER to determine which ranges the producers will send to the Consumers at TQ1,05. This enables the Parallel Execution Servers to sort the rows in parallel for the ORDER BY operation. The results are returned to the Parallel Execution Coordinator serially, already in the sort order identified by QC (ORDER). Here, the Parallel Execution Coordinator returns the combined results shown previously to the user.

The execution plan will show the number of rows and bytes used to calculate each operation, and the V$PQ_STAT view can also be used to show the actual number of rows and the bytes redistributed between processes after the statement has executed. The following RAC example shows the same statement executed with a Degree of Parallelism of 14. It utilizes 14 Parallel Execution Servers per instance, distributing the producers and consumers across the cluster. Therefore, some of this data is transferred across the private interconnect, as opposed to using interprocess communication (IPC) on the same instance:

SQL> select DFO_NUMBER, TQ_ID, SERVER_TYPE, NUM_ROWS, BYTES,
PROCESS, INSTANCE from v$PQ_TQSTAT where TQ_ID = 2;

DFO_NUMBER TQ_ID SERVER_TYP   NUM_ROWS      BYTES PROCESS      INSTANCE
---------- ----- ---------- ---------- ---------- ---------- ----------
    1          2 Consumer      5716868   73913206 P011                1
    1          2 Producer      6364469   82530761 P005                2
    1          2 Consumer      5717580   73921160 P012                1
1          2 Consumer      5709740   73819091 P008                1
    1          2 Consumer      5705592   73766961 P009                1
    1          2 Consumer      5710244   73823988 P011                2
    1          2 Consumer      5709316   73815808 P007                2
    1          2 Consumer      5711992   73847887 P012                2
    1          2 Consumer      5718760   73936753 P008                2
    1          2 Consumer      5715084   73888517 P010                2
    1          2 Consumer      5721996   73978570 P013                1
    1          2 Producer      5117198   65998449 P003                1
    1          2 Producer      5032410   65301516 P006                1
    1          2 Producer      5473409   70611392 P001                1
    1          2 Producer      5065132   65752161 P005                1
    1          2 Producer      5210951   67171073 P004                1
    1          2 Producer      5194921   67380186 P000                1
    1          2 Producer      6342539   81829197 P001                2
    1          2 Consumer      5713532   73868349 P013                2
    1          2 Consumer      5719812   73953487 P009                2
    1          2 Producer      4656192   60328063 P002                1
    1          2 Producer      6440369   83128877 P006                2
    1          2 Producer      6375671   82229038 P003                2
    1          2 Producer      5944988   76978368 P002                2
    1          2 Consumer      5721952   73977523 P010                1
    1          2 Producer      6341648   81927190 P004                2
    1          2 Producer      6440103   83137013 P000                2
    1          2 Consumer      5707532   73791984 P007                1

Bloom Filters

Additionally—and of particular importance to RAC environments—you may also observe the use of Bloom Filters. You can identify such a filter in an execution plan easily; it takes the form of :BF0000. The step that creates and uses the first Bloom Filter in the plan looks like this:

--------------------------------------------------------------------------------
|Id  | Operation                      | Name                                   | TQ   |IN-OUT | PQ Distrib
--------------------------------------------------------------------------------
...
|*19 | HASH JOIN                                 |                                               |Q1,03 | PCWP |
|20  | JOIN FILTER CREATE              |:BF0000                                |Q1,03 | PCWP |
|21  | PX RECEIVE               |                                               |Q1,03 | PCWP |
|22  | PX SEND HASH                         |:TQ10001                             |Q1,01 | P->P |HASH
|23  | PX BLOCK ITERATOR              |                                               |Q1,01 | PCWC |
|*24 | TABLE ACCESS FULL             |PART                     |Q1,01 | PCWP |
|25  | PX RECEIVE                               |                                               |Q1,03 | PCWP |
|26  | PX SEND HASH                         |:TQ10002                             |Q1,02 | P->P |HASH
|27  | JOIN FILTER USE                     |:BF0000                                |Q1,02 | PCWP |
|28  | PX BLOCK ITERATOR              |                                               |Q1,02 | PCWC |
|29  | INDEX FAST FULL SCAN         |I_PS_PARTKEY_SUPPKEY             |Q1,02 | PCWP |
--------------------------------------------------------------------------------

A Bloom Filter implements an algorithm to test for membership in a set that requires considerably less space to implement than a 1 to 1 mapping detailing all of the members in the set. The trade off is that the Bloom Filter is probabilistic; in other words, when testing for set membership, an element is either probably a member of the set or definitely not a member of the set. However, the probable result does not confirm membership. If a test returns that an element is probably a member of the set when it is not, that result is termed a false positive. The implementation consists of two aspects. The first aspect involves constructing the Bloom Filter by adding members to a set. The second aspect involves querying whether elements are a member of that set. In the execution plan, these two aspects are represented by JOIN FILTER CREATE and JOIN FILTER USE, respectively.

To add an element to the set, a number of hash values are calculated for that element, and the bits in an array are set so they correspond to the computed positions. To test whether an element is a member of a set, the same hash values are calculated for the element. If all the corresponding bits are set, then it is probable that the element is a member of the set. The word probable is used because you cannot rule out the possibility that all of the bits were set for other elements. However, if one of the corresponding bits is not set, then it is guaranteed that the element is not a set member; otherwise, the bits would have been set when the element was added at the point the filter was created.

The benefit of the Bloom Filter is that it makes it possible to determine the probable mapping of values, but without requiring all of the values in a single location. In the execution plan example where the Bloom Filter was not used, this is exactly what occurred. All of the rows of the full table scan of the PART table were BROADCAST in order to JOIN with the rows of the PARTSUPP table. In this example, there are a comparatively greater number of rows. Consequently, at Table Queue TQ1,03, the results of the full table scan on the PART table are received, and the elements in the Bloom Filter are set on the values of P_PARTKEY. We can use this to determine the Predicate Information, as follows:

19 - access("P Partkey"="PS PARTKEY")

The operation at Table Queue Q1,02 uses this Bloom Filter to test the values of PS_PARTKEY from the rows retrieved from the PARTSUPP table. The Bloom Filter makes it possible to determine the rows where it is probable that "P_PARTKEY"="PS_PARTKEY"; it also makes it possible to reject rows where that is not probable without requiring that all of the rows retrieved be BROADCAST. Only the rows that are marked as probable are sent through the PX SEND HASH operation, and the HASH JOIN takes place on these rows. This enables the rejection of any false positives at this stage. Using the Bloom Filter enables you to significantly reduce the number of rows redistributed between Parallel Execution Servers. In a RAC environment, this translates to a reduction in the interconnect traffic required for Parallel Execution.

By default, Bloom Filters are enabled in the Oracle Database 11g Release 2, and the hidden or underscore parameter _bloom_filter_enabled is set by default to TRUE. The optimizer determines when to use of Bloom Filters, and such filters are more likely to be used for larger data sets where redistributing a large number of rows would be prohibitive.

The view V$SQL_JOIN_FILTER displays information about active filters. Contention is also reported within AWR reports, as in this example:

Enqueue Type (Request Reason)
------------------------------------------------------------------------------
    Requests    Succ Gets Failed Gets       Waits  Wt Time (s) Av Wt Time(ms)
------------ ------------ ----------- ----------- ------------ --------------
BF-BLOOM FILTER (allocation contention)
       1,098          757         341         274            2           8.32
...

Bloom Filters can be disabled by setting _bloom_filter_enabled to FALSE. Take care, however, because that is a hidden parameter that you are not supposed to set without guidance from Oracle support.

Partitioning

So far we have examined data retrieved based on allocated ranges of block-based granules, data redistributed between producer and consumer processes, and a technique to reduce the amount of data redistributed with Bloom Filters. An additional technique that can reduce the amount of data to be redistributed is Partitioning. For some operations, this technique can be used to eliminate data redistribution completely, which makes it particularly suited for scaling Parallel Execution in RAC environments.

At its most simple form of usage, Partitioning can be used for partition pruning where the optimizer is able to determine that a number of partitions need not be accessed to satisfy the query according to the values in the predicate list. Although simple to use, partitioning is a highly effective technique that can assist the optimizer in reducing the number of rows retrieved.

You can also use partitioning to perform Partition-Wise joins, where the single parallel join operation is able to be completed by a number of smaller join operations that operate against the partitions. The premise is similar to the one behind partition pruning: when the data is partitioned, the optimizer is able to identify the partitions that correspond to particular values. Consequently, for a Partition-Wise join, both objects in the join must be equipartitioned on the join key. To be equipartitioned, both objects must use either range or hash partitioning, and both objects must use the same type. Both objects must also have the same number of partitions and be partitioned on the same columns. While the objects themselves must use the same type, they do not have to be of the same type. For example, two tables can obviously be equipartitioned, but so too can an index and a table. When the two objects are equipartitioned, the join of a pair of equivalent partitions can be processed by a single Parallel Execution Server. This means that, even though the join operation takes place in parallel, it is not necessary to distribute any data between the objects to complete it. Without partitioning, you have seen that the allocation units for parallel processing are block-based granules. With partitioning, the units are partition-based granules where the minimum allocation unit is one pair of partitions per Parallel Execution Server. Thus the Degree of Parallelism that can be used is limited to the number of partitions. By definition, the number of partitions is always the same for both objects when using equipartitioning.

The previous example includes a JOIN operation between the PARTSUPP and PART tables, where P_PARTKEY is equal to PS_PARTKEY. The PARTSUPP rows are also accessed through the I_PS_PARTKEY_SUPPKEY index. This means we can use the following approach to equipartition the objects in question for the PART table for our example DOP of 2:

CREATE TABLE PART(
P_PARTKEY NUMBER NOT NULL,
P_TYPE VARCHAR(25),
P_SIZE NUMBER,
P_BRAND CHAR(10),
P_NAME VARCHAR(55),
P_CONTAINER CHAR(10),
P_MFGR CHAR(25),
P_RETAILPRICE NUMBER,
P_COMMENT VARCHAR(23)
)
PARALLEL
PARTITION BY HASH (P_PARTKEY)
PARTITIONS 2;

Similarly, we can use the following g approach to equipartition the PARTSUPP table:

CREATE TABLE PARTSUPP(
PS_PARTKEY NUMBER NOT NULL,
PS_SUPPKEY NUMBER NOT NULL,
PS_SUPPLYCOST NUMBER NOT NULL,
PS_AVAILQTY NUMBER,
PS_COMMENT VARCHAR(199)
)
PARALLEL
PARTITION BY HASH(PS_PARTKEY)
PARTITIONS 2;

We can also partition the I_PS_PARTKEY_SUPPKEY index:

CREATE UNIQUE INDEX I_PS_PARTKEY_SUPPKEY
ON PARTSUPP (PS_PARTKEY,PS_SUPPKEY) GLOBAL PARTITION BY HASH (PS_PARTKEY)
PARTITIONS 2;

With the preceding partitioning schema, the optimizer may now choose an execution plan, such as the following for the PART and PARTSUPP JOIN operation:

--------------------------------------------------------------------------------
|Id  | Operation                      | Name                                   |Pstart|Pstop         |TQ          |IN-OUT
--------------------------------------------------------------------------------
...
|20  | PX PARTITION HASH ALL        |                                               |1             |2             |Q1,01    | PCWC
|*21 | HASH JOIN                 |                                               |               |               |Q1,01    | PCWP
|*22 | TABLE ACCESS FULL             |PART                                     |1             |2             |Q1,01    | PCWP
|23  | INDEX FAST FULL SCAN         |I_PS_PARTKEY_SUPPKEY             |1             |2             |Q1,01    | PCWP
--------------------------------------------------------------------------------

You can now observe that a partition iterator is used instead of a block range iterator. You can identify it in the plan by the additional Pstart and Pstop headings. No partition pruning is performed that can be identified by all of the partitions being iterated over. In this example, that is the 2 partitions for both objects. It is also possible to see that that the JOIN operation takes place in parallel, without any data redistribution having taken place, thus confirming the importance of partitioning for limiting interinstance traffic in Parallel Execution environments.

It is not possible to force the optimizer to use partition-based granules. Even against a partitioned schema, the optimizer may choose to use data redistribution if it determines that using a higher value for the Degree of Parallelism than the partitioning permits would result in a resolving the query faster. Additionally, the optimizer may also select an execution plan that utilizes a combination of the two approaches. Where a partition-wise join is available, but a higher Degree of Parallelism may be faster, the optimizer may use a Parallel Execution Server mapper against the objects. This mapper would use the partitioning scheme to optimize the redistribution of data to local-only redistribution; such an approach would minimize the traffic across the interconnect.

Parallel Execution Configuration

In addition to the configuration options at the schema level, features such as Partitioning mean that you have a number of alternatives at the instance and cluster level that you can use to implement and modify the behavior of Parallel Execution.

Next, you will learn about the configuration options with the relevant Parallel Execution parameters, and then look at how to configure your environment to match both your requirements and available hardware resources.

cluster_interconnects

We discussed setting the cluster_interconnects parameter in Chapter 4 in the context of hardware. Specifically, we discussed how this parameter enables the use of multiple private network interconnects to increase bandwidth availability for interconnect traffic.

This parameter can also enable you to increase the network capacity for interinstance messaging in a Parallel Execution environment. However, it is important to reiterate that the use of multiple interconnects will result in the failure of the entire interconnectsystem if any one network fails, which means it is not a high availability option. Consequently, we do not recommend setting cluster_interconnects. Instead, we recommend meeting demands for increased interconnect bandwidth with the appropriate hardware solution, such as 10 Gigabit Ethernet.

db_block_size, db_cache_size, and db_file_multiblock_read_count

As discussed in Chapter 12, using a smaller block size may be a solution for reducing interinstance traffic in transactional environments where multiple nodes are accessing the same rows in a single block. In a Parallel Execution environment, however, the emphasis is on high volumes of read activity. For example, there is no contention for the same rows for inserts, updates, or deletes. Neither is there contention for reading the same blocks because separate block ranges are allocated to separate Parallel Execution servers.

The emphasis on read activity means that data warehouse environments can benefit from using a larger block size for both table and index data. The block size can be schema-wide for a dedicated data warehouse database. Or, it can be set on a tablespace basis by following a couple steps. First, you create a buffer cache for a specific block size with a parameter such as DB_16K_CACHE_SIZE or DB_32K_CACHE_SIZE. Second, you set the corresponding block size by specifying BLOCKSIZE during the CREATE TABLESPACE statement. With an emphasis on reads where there is no additional insert activity, you should also set the value of PCTFREE for tables to a low value. This ensures that as many rows as possible are stored in these blocks. This configuration minimizes read activity by storing a defined number of rows in fewer blocks. The amount of data read in a single read operation is based upon the block size and the parameter, db_file_multiblock_read_count. db_file_multiblock_read_count is calculated automatically, and it should not be set manually. However, viewing the parameter can reveal the calculated value, as in this example:

SQL> show parameter db_file_multiblock_read_count

NAME                                 TYPE        VALUE
------------------------------------ ----------- -----------------------------
db_file_multiblock_read_count        integer     128

The parameter db_file_multiblock_read_count may be calculated to different values on the nodes in the cluster if different workload characteristics are running on those nodes.

Now take a look at the db_cache_size parameter for sizing of the buffer cache. In this case, you also include the multiple block size caches and the buffer cache component for automatic shared memory management to distinguish the memory allocated to the SGA from that allocated to the PGA. However, you also handle Automatic Memory Management separately because memory can be reallocated between the SGA and PGA automatically.

In a Parallel Execution environment, rows are typically read with a direct path read event, as opposed to a db file scattered read event, which is usually associated with a full table scan operation. We discuss the importance of distinguishing between the two events in the context of Parallel Execution Performance later in this chapter. In the context of sizing the buffer cache, however, it is important to note that, when using a direct path read, the retrieved data blocks are not buffered in memory in the buffer cache component of the SGA. Instead, they are read directly in the user sessions memory allocated as PGA memory. Consequently, the emphasis of memory allocation for the Parallel Execution environment is focused more on the PGA, as opposed to the buffer cache component of the SGA. The notable exception to this is for In-memory Parallel Execution, which was introduced with the Oracle Database 11g Release 2. If enabled for a full table scan with db file scattered reads, in-memory Parallel Execution can calculate whether some data is in fact cached in the buffer cache and read. In-memory Parallel Execution is enabled with the parallel_degree_policy parameter, which you'll learn more about later in this section. However, it is likely that, if you're choosing to use In-memory Parallel Execution, then the decision will be made at the hardware sizing phase. This will enable you to specify a clustered environment with the appropriate levels of RAM and memory bandwidth for a dedicated buffered memory data warehouse environment.

instance_groups and parallel_instance_group

The instance_groups and parallel_instance_groups parameters were of particular relevance prior to Oracle 11g Release 1 because, before this release, Parallel Execution was not services-aware. In other words, the Parallel Execution Coordinator would be allocated to a node according to a defined service, but the Parallel Execution Servers could not be distributed across the other available nodes in the cluster. The solution to this problem was to use the INSTANCE_GROUPS parameter to assign instances to individual groups. The PARALLEL_INSTANCE_GROUP parameter could then be specified with, for example, an ALTER SESSION command. This command would allocate the instances within the cluster, so that a Parallel Query operation could run against in an individual session.

Parallel Execution has been services-aware since Oracle 11g, and we recommend using service configuration to distribute Parallel Execution Servers across the cluster (see Chapter 11 for more information on this topic). Although instance_groups and parallel_instance_groups are available and continue to provide the same functionality as in earlier releases, the parameters were deprecated in Oracle Database 11g, so a services configuration should be used instead.

When tracing Parallel Execution, it is necessary to use the trcsess utility to combine all of the trace files from the Parallel Execution Servers before using them with a utility such as tkprof. Once you have defined a service that you want to run Parallel Execution under, you can then specify that you want that service to extract the relevant trace data to combine, as in this example:

[oracle@london1 ˜]# trcsess output="query16.trc" service=DSS *.trc

large_pool_size, parallel_execution_message_size, and shared_pool_size

By default, Parallel Execution-related memory is allocated in the shared pool. An exception to this rule is the largest memory component for Parallel Execution, the Parallel Execution Message Pool. This pool's memory will be allocated in the large pool when the SGA_TARGET parameter is set. You can view the memory allocations by executing the following query:

SQL> select * from v$sgastat where name like '%PX%';

POOL          NAME                                                     BYTES
------------ -------------------------- ----------
shared pool  PX subheap desc                               140
shared pool  PX msg pool struct             692
shared pool  PX QC deq stats                                1692
shared pool  PX server deq stats           1692
shared pool  PX subheap                       33244
shared pool  PX QC msg stats                               2816
shared pool  PX server msg stats         2816
large pool   PX msg pool                       491520

Note

When using Automatic Memory Management or Automatic Shared Memory Management, a large pool will be configured, and the Parallel Execution Message Buffers will be stored in that pool, regardless of whether you have explicitly set the large_pool parameter. The Parallel Execution Message Buffers will be store in the shared pool only if you meet this pair of conditions: first, you must have disabled automatic memory configuration; and second, you must not have set the large_pool parameter.

In any case, it is good practice to set the large_pool parameter, even if you're using automatic memory features to specify a minimum large pool requirement that Oracle can adjust upwards according to its current requirements.

The parallel_execution_message_size parameter determines the size of the buffers that constitute the Parallel Execution message pool, and this is the memory used for communication between the Parallel Execution Servers. In releases prior to Database 11g Release 2, the default value was 2148 bytes (or 2k). Consequently, it was typically reported that increasing the value of parallel_execution_message_size was beneficial. However, beginning with Oracle Database 11g Release 2, the default value of 16384 is usually sufficient, and increasing the amount to a value such as 32768 will result in more memory consumed from the large pool. Therefore, you should conduct tests to determine whether increasing the size is beneficial in a particular environment.

parallel_adaptive_multi_user

By default, the parallel_adaptive_multi_user parameter is set to TRUE. This parameter enables the rationing of available Parallel Execution Servers between multiple users executing Parallel Queries on the system. When a Parallel Query is issued, the Parallel Execution Servers will be assigned according to the requested Degree of Parallelism and the number of available Parallel Execution Servers not already in use. However, if multiple users are executing Parallel Queries and insufficient Parallel Execution Servers are available, then the parallel operation will be downgraded and run with a reduced Degree of Parallelism. This will occur whether parallel_adaptive_multi_user is set or not. Once a Degree of Parallelism is set for a particular query, it will not be downgraded while the query is running. If no Parallel Execution Servers are available, then a query will run serially or throw an error, according to the setting of parallel_min_percent, which will be discussed later in this section. This approach has the disadvantage that one or two queries may utilize all of the available Parallel Execution resources on the system, while the rest fail or run serially. Therefore, enabling parallel_adaptive_multi_user enables the system to downgrade a requested Degree of Parallelism. This downgrade is done according to an algorithm based upon the settings of the cpu_count parameter, which is multiplied by the parallel_threads_per_cpu to obtain a default Degree of Parallelism. Next, a reduction factor is applied according to predicted system load. This value is based on the number of active users on the system at the time that the query was started. This algorithm enables the Degree of Parallelism to be throttled below the requested amount, while the Parallel Execution Servers remain available for additional users. Over time, the algorithm is designed to provide an even distribution of resources. The potential disadvantage of setting parallel_adaptive_multi_user to TRUE is that it is difficult to predict the level of resources that will need to be allocated on anticipated future loads, especially in a clustered environment. Therefore, setting this parameter may have the unintended consequence of artificially limiting Parallel Execution performance, while also making it difficult to configure optimal performance for a particular individual query. Additionally, the number of CPUs on the system is a key input factor, but Parallel Execution is highly dependent on both memory and I/O resources.

For theses reason, we believe that parallel_adaptive_multi_user should typically be set to FALSE until its benefits have been tested and assessed in a particular environment. As an alternative, the DBA should investigate the features enabled by the parallel_degree_policy parameter for automating the Degree of Parallelism and enabling statement queuing. To limit resource usage on a per-user basis, we recommend using the parallel_degree_policy in conjunction with the DBMS_RESOURCE_MANAGER package to create a consumer group, map that consumer group to a data warehouse-focused service, and then create a plan to limit resource usage by users under that group. This approach offers a much finer degree of control than can be achieved using parallel_adaptive_multi_user.

parallel_automatic_tuning

The parallel_automatic_tuning parameter is a deprecated parameter. By default, it is set to FALSE; if this parameter were set to TRUE and the parallel clause was used when creating a table, then Oracle would automatically set the additional Parallel Execution parameters. However, this parameter's functionality was superseded first by parallel_adaptive_multi_user; and second by parallel_degree_policy.

parallel_degree_limit

The parallel_degree_limit parameter defaults to the value of CPU, and it is equal to the value of cpu_count, multiplied by parallel_threads_per_cpu, and multiplied again by the number of available instances. This parameter controls the maximum Degree of Parallelism that can be selected by any one statement. Alternatively, parallel_degree_limit can be set to the value of IO to limit the Degree of Parallelism according to the I/O capacity of the system, as measured by running dbms_resource_manager.calibrate_io (see Chapter 4 for more information on this). Finally, the parallel_degree_limit parameter can be set to an integer value to set a user defined limit for the maximum Degree of Parallelism. This parameter is enabled only when Automatic Degree of Parallelism is used by setting parallel_degree_policy to AUTO or LIMITED.

parallel_degree_policy, parallel_min_time_threshold, and parallel_servers_target

The parallel_degree_policy parameter introduced with Oracle Database 11g Release 2 is the most significant change to the configuration of Parallel Execution. Setting parallel_degree_policy enables three features: Automatic Degree of Parallelism, Statement Queuing, and in-memory Parallel Execution. The default setting for parallel_degree_policy is MANUAL, and it disables all of these features. Modifying this parameter to AUTO enables all three features, while setting it to LIMITED enables Automatic Degree of Parallelism, but also includes the option to manually override the calculated DOP for specific objects and statements.

Automatic Degree of Parallelism

Without Automatic Degree of Parallelism, when the parallel_degrees_policy parameter is set to MANUAL, the Degree of Parallelism can be set in order of precedence, as follows:

  1. It can be set with a PARALLEL hint for a particular statement such as the following:

    SQL> select /* parallel (lineitem,8) */ count(*) from lineitem;
  2. It can be set with the ALTER SESSION FORCE PARALLEL command for an individual session, as in the following parallel query:

    SQL> alter session force parallel query parallel (degree 8);
  3. It can be set by specifying a PARALLEL value within a table or index definition, using an optional DEGREE clause.

The additional INSTANCES clause specified for RAC environments in previous versions of the Oracle Database is deprecated in release 11g, and it is no longer required. If the PARALLEL clause is used but the degree clause is not specified, then the Degree of Parallelism will default to the total of the number of CPUs, multiplied by the value of parallel_threads_per_cpu on all instances in the cluster. For example, if there are two nodes in the cluster with two sockets, eight cores in total on each, and a default value for parallel_threads_per_cpu of 2, then the default Degree of Parallelism will be 32, or 16 Parallel Execution Servers per instance.

Setting Automatic Degree of Parallelism to the AUTO level enables the optimizer to select a Degree of Parallelism that a statement will run against, without the DBA or developer needing to set the Degree of Parallelism explicitly at any level. It is important to reiterate that Automatic Degree of Parallelism, as the name implies, entirely overrides any manual settings of Parallelism, whether the settings are implemented with a hint, at the table level, or with a PARALLEL clause at the object level. The result is that parallelism may be used and a DOP selected entirely at the discretion of the optimizer. For example, an operation may be parallelized against an object without a PARALLEL clause or hint, or without a session-level parallel setting. The explain plan for such a statement specifies the calculated Degree of Parallelism, as shown in this example:

SQL> set autotrace on explain;
SQL> select count(*) from lineitem;

  COUNT(*)
----------
 599982790
...
Note
-----
 - automatic DOP: Computed Degree of Parallelism is 32 because of degree limit

The preceding example shows that the Degree of Parallelism has been set to the limit defined by parallel_degree_limit parameter, which is to the default to value of CPU. However, if the parallel_max_servers parameter is set to a value below this level (as discussed later in this section), then the computed DOP will remain at 32, but the statement will run with Parallel Execution Servers equal to parallel_max_servers on each instance. With an Automatic Degree of Parallelism in effect at the instance level, it is also possible to set a MANUAL policy at the session level, as in this example:

SQL> alter session set parallel_degree_policy=MANUAL;

To utilize object level definitions across the database, it is necessary to set the parallel_degree_policy parameter to LIMITED. At this setting, the optimizer will compute a Degree of Parallelism for objects for which the parallel clause has been used, as opposed to deciding whether to parallelize statements regardless of whether the target object has a PARALLEL clause defined. When Automatic Degree of Parallelism is enabled, the parameter parallel_min_time_threshold specifies the minimum time a query should take before Automatic DOP is used with the default value of AUTO.

We recommend testing Automatic Degree of Parallelism to determine whether it is beneficial in your environment. You can do this by examining the execution plan, recording the Degree of Parallelism against the performance of your queries, and comparing that degree of a parallelism to a manually calculated DOP. As noted previously, the DOP calculation is based on CPU or I/O, while the DBA will have superior knowledge about the resources available across the entire cluster. Notwithstanding the additional statement queuing and In-memory Parallel Execution features, the authors have observed that, while a manually calculated DOP can result in superior Parallel Query Performance, the Automatic DOP is accurate in calculating a near optimal level of performance. The parallel_degree_limit parameter is an important factor in the calculation.

Statement Queuing

When parallel_degree_policy is set to AUTO, a feature called Statement Queuing is enabled; this works in conjunction with Automatic DOP. As you have seen previously with the parallel_adaptive_multi_user parameter in releases prior to Oracle 11g Release 2, if insufficient Parallel Execution Servers are available to meet a requested DOP, then a Parallel Query may be downgraded, run serially, or throw an error. The disadvantage of this behavior is that it can prove suboptimal for the system as a whole, resulting in slower query times for all users and high utilization of resources. It also makes it difficult to tune, administer, and predict workloads. Statement Queuing provides a simple yet effective alternative to the previous behavior. Under Statement Queuing, if the requested Parallel Execution Servers are not available, then the new query is blocked from running and placed in a queue until the existing query holding the Parallel Execution Servers releases them. The next statement is then able to run at the requested DOP or the system enforced limit. This ensures that no statements are downgraded, run serially, or error. When this action occurs, the session waits on the JX enqueue event, as shown in the following extract from an AWR report:

Avg
                                                        wait   % DB
Event                               Waits     Time(s)   (ms)   time Wait Class
------------------------------ ---------- ----------- ------ ------ ----------
direct path read                    5,180       1,942    375   71.1 User I/O
enq: JX - SQL statement queue           1         273 3.E+05   10.0 Scheduler
DB CPU                                            256           9.4
log file sync                           6           7   1166     .3 Commit
Parameter File I/O                      8           5    630     .2 User I/O

The parameter parallel_servers_target defines the number of running Parallel Execution Servers across the cluster before statements will be queued. parallel_servers_target cannot be more than the value of parallel_max_servers; and, by default, it is set to a lower value to ensure that statements are queued before all Parallel Execution Servers are utilized. For example, the default setting for a two-node cluster with eight CPUs per node is 32.

If Statement Queuing is implemented, then it enables more predictable behavior when analyzing and resolving issues in an environment where more Parallel Queries are consistently run compared to Parallel Execution Servers and system resources available.

In-memory Parallel Execution

In releases prior to Oracle 11g Release 2, a Parallel Query would almost always use a direct path read for full table scans, as opposed to a db file scattered read to read the data. As will be discussed later in this chapter in the context of performance, a direct path read bypasses the buffer cache. It also places a performance emphasis on the I/O capabilities of the system because the majority of objects subject to Parallel Queries would be larger than the buffer cache available. In-memory Parallel Execution has been made possible by advances in hardware capabilities, 64-bit computing, and memory performance (see Chapter 4 for more details). You can expect to see a number of competing in-memory data warehouse products coming to market that will harness these new hardware capabilities. In the Oracle Database, large memory capabilities give aggregated buffer caches in a RAC environment the potential to hold large objects in memory. This improves performance for buffered I/O, such as for db file scattered reads associated with full table scans in an OLTP environment.

When the parallel_degree_policy parameter is set to AUTO, the database may, depending upon an object's size, cache the data for subsequent reuse after initially reading from disk. However, this behavior diverges from the managing of data in a standard RAC environment. As discussed in Chapter 2, Cache Fusion enables the data blocks requested on one node in the cluster to be passed directly from the buffer cache on another node across the interconnect. If a block is requested on one node, that block will be always be cached on that node before access. This is a common shared-all clustering approach.

In-memory Parallel Execution differs from the preceding approach in this respect: if the sum total of the data blocks to satisfy the query will fit in the total buffer cache available across all nodes in the cluster, then they will be cached in a distributed fashion. Only a Parallel Execution Server on a particular node will be used to access the data blocks on that node, as opposed to passing block images between nodes with Cache Fusion. This concept can be thought of as simulating a shared-nothing cluster approach, which is typical in in-memory data warehouse products. The data is divided between the nodes in the cluster without sharing any of the data between them, and the resources on one node are dedicated to accessing the data only on that node. With this shared-nothing approach, it should also be clear why in-memory Parallel Execution, Automatic Degree of Parallelism, and Statement Queuing are enabled with a single parameter, as opposed to being enabled with separate parameters for each. With data cached on a per node basis and node affinity used to access this data, features such as Parallel Query table definitions, hints, query downgrading, and serialization would face significant design implementation challenges under such an approach. If In-memory Parallel Execution is to deliver benefits beyond a direct path read approach from any or multiple nodes in the cluster, then automating the entire In-memory Parallel Execution process is more likely to deliver optimal results when data is segmented on a per-node basis.

Once In-memory Parallel Execution is enabled, then the Oracle Database software will determine the caching of data for Parallel Query based on factors such as the size of the sum of the buffer caches across the nodes, the size of the objects subject to the Parallel Query, and the regularity with which they are accessed. Data is affinitized to the nodes based either on block ranges or partitions familiar from the Parallel Execution concepts discussed previously in this chapter. From these concepts, it is also possible to surmise that partitioning is particularly beneficial to In-memory Parallel Execution. This is because it enables Parallel Execution Servers on an individual node to operate on the rows contained within those partitions, while simultaneously minimizing the data redistribution and moving the implementation to something that closely resembles a clustered shared-nothing approach.

It should be clear that the parallel_degree_policy parameter brings great potential to implementing a high performance Parallel Execution environment where the full benefits of segmenting and affinitizing the data across the nodes are realized at the hardware selection and schema design phase. This approach may not benefit all environments, particularly environments where the data being queried changes on a regular basis. Therefore, we recommend adopting the following approach by default: parallel_degree_policy should remain at the MANUAL setting while you're reviewing the architectural design of the data warehouse and testing to determine whether In-memory Parallel Execution and its associated features will benefit your particular environment.

parallel_force_local

The parallel_force_local parameter defaults to the value of FALSE; however, when set to TRUE, this value restricts the Parallel Execution to utilizing the Parallel Execution Servers only on the node where the query was executed and nowhere else. parallel_force_local is particularly useful when enabled at the session level because it restricts the queries from a particular session to the local node. In previous releases, this result could only be achieved by setting the transaction isolation level to serializable.

parallel_io_cap_enabled

parallel_io_cap_enabled achieves a similar goal. It sets the parallel_degree_limit parameter to the value of IO to limit the Degree of Parallelism according to the I/O capacity of the system. However, parallel_io_cap_enabled is available even when Automatic Degree of Parallelism is not set. Before you can set the limit on the I/O, you must first calibrate the capacity by running dbms_resource_manager.calibrate_io.

parallel_max_servers, parallel_min_servers, parallel_threads_per_cpu, and processes

The parallel_max_servers parameter sets a limit on the number of Parallel Execution Servers available on an instance. By default, this parameter is equal to the value of cpu_count, multiplied by parallel_threads_per_cpu, and multiplied again by 5. parallel_min_servers sets the minimum value, and it must always be less than or equal to parallel_max_servers.

parallel_threads_per_cpu specifies the number of Parallel Execution Processes a CPU would be expected to process simultaneously. On an eight CPU server, the default values are as follows:

parallel_max_servers                 integer     80
parallel_min_servers                 integer     0
parallel_threads_per_cpu             integer     2

You should also ensure that the standard processes parameter is set to a value that permits the creation of the desired level of Parallel Execution Servers. parallel_max_servers sets the upper limit of the available Parallel Execution Servers for all users on an instance, which prevents the starvation of CPU resources. A query may request a higher Degree of Parallelism and use the CPU resources on multiple nodes in the cluster up to the level of parallel_max_servers, multiplied by all of the nodes in the cluster. However, the shared I/O resources available will remain the same. Thus it is important that you select an appropriate Degree of Parallelism or that you have it calculated automatically, without relying on parallel_max_servers to limit the DOP. parallel_min_servers should be set to a value of Parallel Execution Servers that you wish to be running on an instance permanently. The process will consume memory, but it will remain immediately available for Parallel Execution. Beyond parallel_min_servers, operating system processes will be started while the Oracle Database waits on the event os thread startup up to the limit of parallel_max_servers. This approach saves resources when they are not required at the expense of the ability to create Parallel Execution Servers when they are needed.

As you have seen, parallel_threads_per_cpu serves as an input value and multiplier for a number of additional Parallel Execution parameters. Therefore, even minor modifications can have a disproportionate effect on the configuration. Given the advances in CPU architecture and the performance available from even a single core on a multicore processor, we do not recommend increasing this value beyond the default value of 2, especially if the system is I/O bound. The only potential benefit from changing this parameter can be seen when reducing it to 1 for environments with a highly tuned I/O subsystem or when using In-memory Parallel Execution. However, we recommend tuning the parameters that depend upon it before modifying parallel_threads_per_cpu itself.

parallel_min_percent

The parallel_min_percent parameter defaults to the value of 0 and specifies the minimum percentage of Parallel Execution Servers that must be available on an instance for it to execute in Parallel; the query will fail if the availability does not meet the required percentage. The default value also means that a query will fail if insufficient Parallel Execution Servers are available to execute serially, as opposed to failing. The setting of this parameter is not required if Statement Queuing has been enabled.

pga_aggregate_target

In a default non-in-memory Parallel Execution environment, data is read directly into a Parallel Execution Server Process's private memory area. Consequently, the pga_aggregate_target parameter should be sized sufficiently to account for a work area big enough for all Parallel Execution Servers to be used. The amount of PGA memory required for a particular query on an instance is proportional to its Degree of Parallelism. You also need to provide enough memory for operations such as sorts and to minimize additional I/O by using the temporary tablespace. Your focus when memory sizing should be on the PGA before the SGA when using non-in-memory Parallel Execution.

We recommend using the PGA advisory section in an AWR report or viewing V$PGA_TARGET_ADVICE to ensure that sufficient PGA memory is available when sizing memory manually. Alternatively, you should also consider using Automatic Memory Management to enable the reallocation of memory between SGA and PGA. When the buffer cache is not being heavily utilized, there is less benefit in using operating system level Huge Pages for the SGA. This fact is one of the reasons for using Automatic Memory Management in Parallel Execution environments.

Parallel Execution Performance

The Monitoring of Parallel Execution performance can be done with all of the tools and utilities discussed in Chapters 12 and 13. The AWR report in particular is a useful tool for assessing the overall impact of Parallel Execution on each instance.

AWR Reports

The load profile provides a clear indication that the performance emphasis is on the physical reads. Also, it is important to note that the Transactions figure is typically low, so Parallel Query workloads are significantly different from OLTP environments, as illustrated by this example:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
˜˜˜˜˜˜˜˜˜˜˜˜         ---------------    --------------- ---------- ----------
      DB Time(s):               38.4            8,358.2       9.28       0.41
       DB CPU(s):                1.5              317.9       0.35       0.02
       Redo size:            1,370.7          298,506.0
   Logical reads:           26,810.8        5,838,968.8
   Block changes:                3.7              794.0
  Physical reads:           24,061.6        5,240,243.8
 Physical writes:              380.7           82,907.4
      User calls:               94.1           20,483.5
          Parses:                2.1              466.3
     Hard parses:                0.2               45.3
W/A MB processed:               22.8            4,973.6
          Logons:                0.9              185.2
        Executes:                4.1              901.1
       Rollbacks:                0.0                0.0
    Transactions:                0.0

This focus upon read performance is confirmed in the Top 5 Timed Foreground Events section. In the following example, the User I/O wait class dominates the top events:

Top 5 Timed Foreground Events
˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path read                  1,012,948      29,481     29   35.3 User I/O
db file scattered read               40,059       4,731    118    5.7 User I/O
DB CPU                                            3,179           3.8
direct path read temp                13,605       1,828    134    2.2 User I/O
db file sequential read               6,081         148     24     .2 User I/O

As we have noted previously in this chapter, the most significant wait event is typically the direct path read. This wait event signifies that the Parallel Execution Server is reading data into its private memory area, bypassing the SGA. When doing direct path reads, you may observe fast object checkpoints, where a checkpoint is performed at the segment level to ensure that the any data blocks modified in the SGA are consistent with the data to be read on disk before the data blocks are read directly, as in this example:

Enqueue Type (Request Reason)
------------------------------------------------------------------------------
    Requests    Succ Gets Failed Gets       Waits  Wt Time (s) Av Wt Time(ms)
------------ ------------ ----------- ----------- ------------ --------------
KO-Multiple Object Checkpoint (fast object checkpoint)
          63           63           0           7            0           4.29

If In-memory Parallel Execution is enabled, then db file scattered read will indicate when it reads data that has been previously cached. As the previous example illustrates, you may see a combination where larger tables are not cached, but smaller ones may be. db file scattered reads may also of course continue be recorded for serial full table scan operations.

These statistics and concepts should make it clear that a significant contributing factor to Parallel Query performance is the availability of hardware resources, notably the CPU, memory, and I/O, which determine the rate at which data can be read. You should review Chapter 4 for more information on how to configure hardware in a RAC environment.

For Parallel Execution, the most significant of these factors is the CPU. Over time, the availability of processors with many high performance cores in a Linux environment means that considerable levels of I/O can be sustained by the CPUs across all nodes in an up-to-date RAC environment. Consequently, the emphasis on performance is invariably about providing sufficient I/O read capacity to match the level of performance available at the CPU level. Note that, when sizing Parallel Execution, some documentation may have been written against non-current, non-Linux systems where the available CPU resources were significantly below that of Linux systems available today. Also, some of that documentation advises increasing CPU usage to compensate for an I/O bound system. Contrary to this advice, where I/O is a limiting factor, the most viable alternative that will add a significant contribution to performance is to add the hardware resources necessary to increase I/O capacity. Other techniques will only provide limited results in cases where the available CPU is not being heavily utilized. You can use the AWR report to observe the relative usage of the CPU and I/O for an individual statement, as shown under the SQL ordered by User I/O Wait Time section in this example:

User I/O                UIO per           Elapsed
Time (s)  Executions    Exec (s) %Total   Time (s)   %CPU    %IO    SQL Id
-------- ------------ ---------- ------ ---------- ------ ------ -------------
    49.5            1      49.53    1.6      332.8   14.9   85.7 dqh4ks2qwcgs0
select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt
from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#35'
and p_type not like 'STANDARD BRUSHED%' and p_size in (29, 25, 16, 34, 6,
12, 36, 47) andps_suppkey not in ( select s_suppkey from supplier where
s_comment like '%Custom

This emphasis on I/O means that you should also pay attention to the operating system utilities discussed in Chapter 12. These utilities let you observe read performance, in addition to the AWR report data; they also run the desired configuration of Parallel Execution. In the following example, iostat is used to monitor the read performance of a single two-socket node executing a parallel query. The heading rMB/s shows a read rate of approximately 1 Gigabyte per second:

[root@london1 ˜]# iostat -x -m sdb1 3 10
...
Device: rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz  await
svctm %util
sdb1    3.32     0.00 4150.50  0.66  1032.32     0.01   509.31   135.24  32.74
0.24  99.70

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.78    0.00    0.67   71.04    0.00   22.51

Device: rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz  await
svctm  %util
sdb1    5.67     0.00 4689.67  0.67  1166.69     0.01   509.43   136.11   29.10
0.21   100.03

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.37    0.00    0.62   72.44    0.00   21.56

Device: rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz  await
svctm  %util
sdb1    3.33     0.00 4352.33  0.67  1082.91     0.01   509.49   135.69 31.10
0.23   100.03

In a RAC environment, you should aim for the read rate to consistently be in the range of multiple gigabytes per second. If these rates are not achievable, then it is unlikely that you will be able to benefit from the full CPU potential of the all of the nodes in your cluster. In our experience, where sufficient capacity is not available, you should investigate some of the hardware technologies discussed in Chapter 4. For example, Solid State Disks (SSDs) can make a significant contribution to improving Parallel Query performance. Where I/O performance is maximized but CPU is underutilized, you may also consider some of the available techniques for data compression. These can reduce I/O usage at the expense of increasing the CPU usage required to decompress the data.

In an AWR report captured during a Parallel Execution workload, you may also observe wait events with the prefix PX Deq, such as PX Deq: Table Q Normal. From the earlier discussion of the concepts behind Parallel Execution, you might recall that producers and consumers communicate with Table Queues and that this type of wait event is recorded by a consumer process waiting for data from the consumer. As such, events in this class are considered idle events. This means that they cannot be explicitly tuned. Additional inspection should only be warranted if these events are recorded as consuming considerably more time than the I/O related events.

When previously discussing the pga_aggregate_target parameter, we noted that correctly sizing the PGA is the single most significant contributing factor for memory allocation in Parallel Execution environments. The following example shows a PGA aggregate target of 12 Gigabytes. It shows a size factor of 1 and estimates that the optimal size lies between 12 Gigabytes and 15 Gigabytes:

Estd Extra    Estd P Estd PGA
PGA Target    Size           W/A MB   W/A MB Read/    Cache Overallo    Estd
  Est (MB)   Factr        Processed Written to Disk   Hit %    Count    Time
---------- ------- ---------------- ---------------- ------ -------- -------
     1,536     0.1         49,789.8         74,501.5   40.0        2 1.1E+08
     3,072     0.3         49,789.8         48,597.1   51.0        1 8.8E+07
     6,144     0.5         49,789.8         10,680.6   82.0        0 5.4E+07
     9,216     0.8         49,789.8          7,668.8   87.0        0 5.1E+07
    12,288     1.0         49,789.8          7,668.8   87.0        0 5.1E+07
    14,746     1.2         49,789.8              0.0  100.0        0 4.5E+07
    17,203     1.4         49,789.8              0.0  100.0        0 4.5E+07
    19,661     1.6         49,789.8              0.0  100.0        0 4.5E+07
    22,118     1.8         49,789.8              0.0  100.0        0 4.5E+07
    24,576     2.0         49,789.8              0.0  100.0        0 4.5E+07
    36,864     3.0         49,789.8              0.0  100.0        0 4.5E+07
    49,152     4.0         49,789.8              0.0  100.0        0 4.5E+07
    73,728     6.0         49,789.8              0.0  100.0        0 4.5E+07
    98,304     8.0         49,789.8              0.0  100.0        0 4.5E+07

Given the focus on I/O driven by CPU performance, it is important to also ensure that the memory capacity is sufficient to match the capabilities of the I/O and CPU. For In-memory Parallel Execution, this also means correctly sizing the buffer cache. The following truncated buffer cache advisory shows advice for a dedicated 32k block cache that would benefit from being increased from its current 4 Gigabyte allocation.

Est
                                   Phys      Estimated                  Est
    Size for   Size      Buffers   Read     Phys Reads     Est Phys %DBtime
P    Est (M) Factor  (thousands) Factor    (thousands)    Read Time for Rds
--- -------- ------ ------------ ------ -------------- ------------ -------
...
32k    3,456     .8          109    1.0          1,651            2    34.0
32k    3,840     .9          122    1.0          1,617            2    33.0
32k    4,096    1.0          130    1.0          1,593            2    33.0
32k    4,224    1.0          134    1.0          1,578            2    32.0
32k    4,608    1.1          146    0.9          1,492            2    30.0
32k    4,992    1.2          158    0.9          1,362            2    28.0
32k    5,376    1.3          170    0.7          1,150            2    23.0
32k    5,760    1.4          182    0.6          1,015            2    21.0
32k    6,144    1.5          194    0.6            983            2    20.0
...

We noted previously that the interconnect plays an important contribution in Parallel Execution environments, especially when it comes to transferring data between producers and consumers on different nodes in the cluster. The AWR report includes a section on Interconnect Throughput by client, illustrating the level of bandwidth utilized by Parallel Query. The following example shows the interconnect throughput for one node in a two-node cluster:

Interconnect Throughput by Client          DB/Inst: PROD/PROD1  Snaps: 621-623
-> Throughput of interconnect usage by major consumers
-> All throughput numbers are megabytes per second

                        Send     Receive
Used By           Mbytes/sec  Mbytes/sec
---------------- ----------- -----------
Global Cache             .00         .00
Parallel Query           .70        2.93
DB Locks                 .00         .00
DB Streams               .00         .00
Other                    .00         .00

The following example shows the throughput for the second node in the cluster:

Interconnect Throughput by Client          DB/Inst: PROD/PROD2  Snaps: 621-623
-> Throughput of interconnect usage by major consumers
-> All throughput numbers are megabytes per second

                        Send     Receive
Used By           Mbytes/sec  Mbytes/sec
---------------- ----------- -----------
Global Cache             .00         .00
Parallel Query          2.85         .70
DB Locks                 .00         .00
DB Streams               .00         .00
Other                    .00         .00

The preceding example shows that Parallel Query consumed less than 3 MB/s for this two-node cluster. Therefore, it is well within the capabilities of a Gigabit-based interconnect. However, monitoring this MB/s value will indicate whether the bandwidth capacity is sufficient for Parallel Execution, particularly as the number of nodes in the cluster is increased.

SQL*Plus

In addition to the information based in an AWR report, you can also use SQL*Plus to observe Parallel Execution-related performance data. Previously, you saw an example at the session level with V$PQ_TQSTAT.

Statistics related to parallel operations can be monitored in the GV$PQ_SYSSTAT and GV$PQ_SESSTAT tables for system and session levels, respectively. More general information is also recorded in GV$SYSSTAT under events, with a name including the words parallelized, Parallel Operations, and PX. GV$PX_SESSION illustrates the status of the Parallel Execution Server processes available on the system, as in this example:

SQL> SELECT * FROM GV$PX_PROCESS;

   INST_ID SERV STATUS           PID SPID                 SID SERIAL#
---------- ---- --------- ---------- ------------- ---------- -------
         2 P013 IN USE            56 29719                  4   14432
         2 P005 IN USE            48 29703                  7   56722
         2 P021 IN USE            64 29735                  9    3136
         2 P029 IN USE            72 29751                 10      33
...

Advice on sizing the large pool adequately for the required message buffers can also be derived from the view, V$PX_BUFFER_ADVICE:

SQL>  select * from v$px_buffer_Advice;

STATISTIC                           VALUE
------------------------------ ----------
Servers Highwater                      36
Buffers HWM                          2575
Estimated Buffers HWM                1080
Servers Max                            64
Estimated Buffers Max                3264
Buffers Current Free                  156
Buffers Current Total                2688

Trace Files

When discussing the applicability of configuring services, we also noted the uses of the trcsess utility, which you can leverage to combine Parallel Execution trace files. As shown in the following example, running tkprof against the generated trace file will also show detailed output of the events waited on during the execution of the particular queries traced:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Execution Msg                         659        0.88         13.41
  latch: cache buffers chains                    18        0.00          0.01
  direct path read                            11860        0.60        184.48
  latch free                                      6        0.00          0.00
  latch: library cache pin                        1        0.00          0.00
  PX Deq: Table Q Normal                      11910        0.41        192.58
  enq: BF - allocation contention                18        0.00          0.03
  latch: enqueue hash chains                      1        0.00          0.00
  PX Deq Credit: need buffer                   1154        0.33         29.42
  PX Deq Credit: send blkd                     2370        0.64         54.86
  PX qref latch                                1182        0.05          3.78
  PX Deq: Table Q Get Keys                       10        0.00          0.02
  PX Deq: Table Q Sample                         10        0.00          0.04
  SQL*Net message to client                    2786        0.00          0.00
  SQL*Net message from client                  2786        0.00          0.46
os thread startup                              24        0.97         22.52
  PX Deq: Join ACK                               16        0.00          0.00
  PX Deq: Parse Reply                            15        0.00          0.01
  PX Deq: Execute Reply                         569        1.96         43.90
  PX Deq: Table Q qref                            6        0.01          0.02
  PX Deq: Signal ACK                             13        0.00          0.00
  latch: session allocation                       5        0.00          0.00

Summary

In this chapter, we have provided detailed information on the concepts, configuration, and performance of Parallel Execution in an Oracle Database 11g RAC environment. Along the way, we paid particular attention to some of the new features, such as in-memory Parallel Execution. If used appropriately, these features are particularly suited to helping clustered environments scale out to manage the largest of parallel operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset