Unlearning Some Things We Thought We Knew
Oracle can do some things very differently when running on Exadata than when running on non-Exadata platforms. The optimizations provided by Exadata are designed to take a different approach than Oracle has traditionally followed. This change means that you need to attack some problems with a different mindset. That is not to say that everything is different—quite the contrary. In fact, most of the fundamental principles remain unchanged. After all, the same database software runs on Exadata that runs on other platforms. But there are some things that are just different. As you read in Chapter 13, you can just take your database and deploy it 1:1 on Exadata. Depending on the type of deployment, this might be OK. However, if you are interested in getting the most out of your investment, you should probably take a step back and review what you can do to further optimize the database on Exadata. Since there are a few things that are different on Exadata compared to a standard, non-Exadata deployment, this chapter is worth reading for phase 2 of your migration. In this chapter, we will focus on how we should change our approach when running databases on Exadata.
A Tale of Two Systems
The way we think about systems running on Exadata largely depends on the workload being executed. Online-Transaction-Processing (OLTP)-oriented workloads tend to focus us on using Smart Flash Cache for speeding up small physical reads. But frankly, this type of workload is not able to take advantage of most of the performance advantages provided by Exadata. Data-Warehouse (DW)-oriented workloads tend to focus on making use of Smart Scans at every opportunity and trying to use all the available resources (CPU resources on the storage and database tiers, for example). This is where Exadata’s built-in advantages can have the most impact. Unfortunately, most systems exhibit characteristics of both DW and OLTP workloads. These “mixed workloads” are the most difficult and, ironically, the most common. They are the most common because it is rare to see an OLTP-oriented system that does not have some reporting component that produces long-running, throughput-sensitive queries. It is also quite common to see DW-oriented workloads that have OLTP-like trickle feeds or similar Extraction, Load, Transformation (ELT) processing needs. These combination systems require the most difficult thought process because, depending on the issue at hand, you will have to be constantly resetting your approach to managing performance. They are also difficult because DW workloads are generally constrained by the data-flow dynamics that control throughput, while OLTP systems are usually constrained by latency issues. Hence, for mixed workloads, you basically need to train yourself to evaluate each scenario and categorize it as either latency-sensitive or throughput-sensitive. This evaluation of workload characteristics should be done prior to beginning any analysis.
If you thought that was an interesting challenge, a third variable can be thrown into the equation. Very few Exadata systems deployed today are home to just one single application. Whereas it used to be quite common to place the critical application on Exadata when the first two generations were made available, the increase in capacity for both Smart Flash Cache as well as disk and the ever-increasing CPU power offered by Intel make Exadata an interesting consolidation platform. This has been recognized by many users and, instead of one database with mixed workload requirements, you quite often have to arrange the needs of multiple databases—potentially with mixed workload requirements—on your hardware. Thankfully, Oracle offers tools to deal with this situation.
OLTP-Oriented Workloads
Even though there is little to say about running OLTP workloads on Exadata, there are a handful of points to keep in mind with this type of system. Since Exadata runs standard Oracle database software, you should not have to adjust your basic approach significantly.
Exadata Smart Flash Cache (ESFC)
The key component of Exadata when it comes to OLTP workloads is Exadata Smart Flash Cache (ESFC), which can significantly reduce disk-access times for small reads. For that reason, it is important to verify that ESFC is working correctly. For this type of workload, you should also expect that a large percentage of physical I/O operations are being satisfied by ESFC. This can be inferred fairly easily by looking at the average single-block read times. A single-block read should take approximately 0.5 ms if it is satisfied by Flash Cache. By contrast, single-block reads take on average approximately 5 ms if they are satisfied by actual disk reads. Standard AWR reports provide both average values and a histogram of wait events. If the average single-block read times are well above the 1 ms range, you should be looking for a systemic problem—such as flash cards that are not working or a critical table has been defined to never be cached—using the CELL_FLASH_CACHE NONE syntax. I/O Resource Management should also be checked. The histograms should be used as well to verify that the average is not covering up a significant number of outliers. Here is the cellcli syntax to check the status of the flash cards:
CellCLI> list flashcache detail
name: dm01cel03_FLASHCACHE
cellDisk: FD_00_dm01cel03,FD_01_dm01cel03,FD_02_dm01cel03,
FD_03_dm01cel03,FD_04_dm01cel03,FD_05_dm01cel03,
FD_06_dm01cel03,FD_07_dm01cel03,FD_08_dm01cel03,
FD_09_dm01cel03,FD_10_dm01cel03,FD_11_dm01cel03,
FD_12_dm01cel03,FD_13_dm01cel03,FD_14_dm01cel03,FD_15_dm01cel03
creationTime: 2010-03-22T17:39:46-05:00
id: 850be784-714c-4445-91a8-d3c961ad924b
size: 365.25G
status: critical
Note that the status attribute on this cell is critical. As you might expect, this is not a good thing. On this particular system, the Flash Cache had basically disabled itself. We noticed it because the single-block read times had slowed down. This example is from an early version of cellsrv. The later versions include a little more information. Here is an example from cellsrv 12.1.2.1.0:
CellCLI> list flashcache detail
name: enkx4cel01_FLASHCACHE
cellDisk: FD_04_enkx4cel01,FD_06_enkx4cel01,FD_11_enkx4cel01,
FD_02_enkx4cel01,FD_13_enkx4cel01,FD_12_enkx4cel01,
FD_00_enkx4cel01,FD_14_enkx4cel01,FD_03_enkx4cel01,
FD_09_enkx4cel01,FD_10_enkx4cel01,FD_15_enkx4cel01,
FD_08_enkx4cel01,FD_07_enkx4cel01,FD_01_enkx4cel01,
FD_05_enkx4cel01
creationTime: 2015-01-19T21:33:37-06:00
degradedCelldisks:
effectiveCacheSize: 5.8193359375T
id: 3d415a32-f404-4a27-b9f2-f6a0ace2cee2
size: 5.8193359375T
status: normal
Notice the new attribute degradedCelldisks. Also notice that the Flash Cache on this cell shows a status of normal. Monitoring storage software behavior is covered in more detail in Chapter 12.
Another thing to keep in mind when dealing with OLTP workloads is that the Exadata platform provides exceptional scalability. Upgrading from a half rack to full rack doubles the number of CPUs at both the database layer and the storage layer. The amount of ESFC is also doubled, as is the available memory. This allows Exadata to scale in a nearly linear fashion for many systems.
To add to this point, we can share a little anecdote. When presenting Exadata Smart Scan capabilities during workshops, we used to run a query against a table with all Exadata optimizations disabled, purely hard-disk access at first against an X2-2 quarter rack. When the X4-2 was introduced, we were lucky to have access to an X4-2 half rack. CPU differences aside, the difference between using three cells vs. seven cells was staggering. Add to the mix that most of the table was served via ESFC and you can imagine response times dropping drastically. In order to get back to the timings we were used to, we had to significantly increase the size of the table. This is just one example where Exadata scales very nicely, without any change to the application itself.
Write-Intensive OLTP Workloads
Write-intensive workloads are a subset of OLTP-oriented systems. There are some systems that just constantly perform single-row inserts followed by a commit or, using different words, employing the slow-by-slow (row-by-row) approach. These systems are often limited by the speed at which commits can be done, which often depends on the speed with which writes to the log files can be accomplished. This is one area where Exadata competes with other platforms on a fairly even playing field when operating ESFC in the default write-through mode. There are no major enhancements that make Exadata run orders of magnitudes faster for systems that are bottlenecked on write operations. Flash Logging can help here. Beginning with cellsrv 11.2.2.4, Exadata performs redo writes to disk and Flash devices simultaneously. The one that completes first “wins,” allowing the log writer to continue processing while the other write is completed. Additionally, the cache on the disk controllers has doubled to 1GB with the X5 hardware generation.
Smart Flash Logging is no magic bullet, and established methods of tuning— such as minimizing commits—are much more appropriate for these types of systems. You would be amazed what changing from row-by-row processing to set-based processing can do in terms of performance, without even having to think about hardware! Many web sites, most prominently perhaps Tom Kyte’s, Ask Tom have lots of references showing how row-by-row processing followed by commits every time some work has been performed are suboptimal for performance.
Log Writer is not the only component that longs for low latency I/O performance. Until Exadata version 11.2.3.2.1, using the Smart Flash Cache in write-through mode was the only available option. Beginning with 11.2.3.2.1, it is possible to use Flash Cache in write-back mode. Switching to write-back mode does not remove the need for Flash Log, by the way—after the switch you will still find it defined. Enabling write-back mode can improve performance for write-intensive workloads, as you saw in Chapter 5. However, you should assess carefully if using Write Back Cache is worth it, using the information provided in the aforementioned Chapter 5 on Flash Cache.
DW-Oriented Workloads
Exadata was initially designed to speed up long-running queries against large volumes of data. Therefore, it should come as no surprise that data warehousing is where it is important to change some of our basic thought processes. Of course, one of the major techniques is to be constantly looking for opportunities to allow Exadata optimizations to kick in. This means making sure that the application can take advantage of Smart Scans.
The most important concept to keep in mind when dealing with DW-oriented workloads is that long-running statements should usually be offloaded. Here are the steps to follow:
These points seem so obvious that it should not really be necessary to repeat them. A large portion of the optimizations built into Exadata work only when Smart Scans are used. One of the first changes you need to make in the way you think is to train yourself to be constantly considering whether Smart Scans are being used appropriately or not. This means you need to have a good understanding of which statements (or parts of statements) can be offloaded and be able to determine whether statements are being offloaded or not. The requirements for Smart Scans and some of the techniques that can be used to verify whether they are being performed or not have been covered extensively throughout the book. But at the risk of being repetitive, here you go again.
Essentially, there are two main prerequisites that must be met in order for Smart Scans to occur. The first is that the optimizer must choose to do a full scan of a table or a materialized view, or the optimizer must choose to do a fast full scan of an index. Note that Smart Scans are not limited to queries or even to sub-queries. The optimizer can also choose to use full scans for DELETEs and UPDATEs when a large percentage of the rows will be affected. However, if your application is doing this, you might want to consider modifying it to do something like a truncate and rebuild. As the saying goes, “it depends”.
The second requirement for Smart Scans is that the scans must be performed using the direct path read mechanism. Note that the optimizer was not mentioned intentionally in the description of the second requirement. This is because the optimizer does not make the decision about whether to use direct path reads or not. It is a heuristic decision that is made after the plan has been determined. As such, it is not directly exposed by any of the tools like explain plan or other performance-related utilities. What this means in practice is that it is easy to verify that the first requirement has been met, but more challenging to verify the second requirement.
On most Exadata implementations, a fairly high percentage of long-running queries are offloaded. You can check to see what percentage of your long-running SQL statements have been offloaded by selecting all the statements from v$sql (or from the AWR table DBA_HIST_SQLSTAT provided you have the license for it) that have an average run time of over some number of seconds, or that have an average logical I/O value that is greater than some reasonable value. Actually, the logical I/O is a better metric to use, as some of the offloaded statements will run very quickly and may not meet your minimum time criteria, which will give you a distorted perspective on the percentage of statements that are being offloaded. Here is an example (note that the scripts are included in the online code repository):
SQL> @offload_percent
Enter value for sql_text:
Enter value for min_etime:
Enter value for min_avg_lio: 500000
TOTAL OFFLOADED OFFLOADED_%
-------------------------------
13 11 84.62%
SQL> /
Enter value for sql_text: SELECT%
Enter value for min_etime:
Enter value for min_avg_lio: 500000
TOTAL OFFLOADED OFFLOADED_%
-------------------------------
11 11 100%
The listing makes use of the offload_percent.sql script, which calculates a percentage of statements currently in the shared pool that have been offloaded. It was initially used to evaluate all statements that had over 500,000 logical I/Os. It was run a second time where the investigation was limited to statements that begin with the word SELECT. In the next listing, you can see the output of a different script (fsxo.sql) that allows you to see the actual statements that contribute to the OFFLOAD_% calculated in the previous listing:
SQL> @fsxo
Enter value for sql_text:
Enter value for sql_id:
Enter value for min_etime:
Enter value for min_avg_lio: 500000
Enter value for offloaded:
SQL_ID EXECS AVG_ETIME OFFLOAD IO_SAVED_% SQL_TEXT
----------------------------------------------------------------------------
0bvt5z48t18by 1 .10 Yes 100.00 select count(*) from skew3 whe
0jytfr1y0jdr1 1 .09 Yes 100.00 select count(*) from skew3 whe
12pkwt8zjdhbx 1 .09 Yes 100.00 select count(*) from skew3 whe
2zbt555tg123s 2 4.37 Yes 71.85 select /*+ parallel (a 8) */ a
412n404yughsy 1 .09 Yes 100.00 select count(*) from skew3 whe
5zruc4v6y32f9 5 51.13 No .00 DECLARE job BINARY_INTEGER :=
6dx247rvykr72 1 .10 Yes 100.00 select count(*) from skew3 whe
6uutdmqr72smc 2 32.83 Yes 71.85 select /* avgskew3.sql */ avg(
7y09dtyuc4dbh 1 2.87 Yes 71.83 select avg(pk_col) from kso.sk
b6usrg82hwsa3 5 83.81 No .00 call dbms_stats.gather_databas
fvx3v0wpvxvwt 1 11.05 Yes 99.99 select count(*) from skew3 whe
gcq9a53z7szjt 1 .09 Yes 100.00 select count(*) from skew3 whe
gs35v5t21d9yf 1 8.02 Yes 99.99 select count(*) from skew3 whe
13 rows selected.
The fsxo.sql script provides the same limiting factors as the offload_percent.sql script, namely a minimum average elapsed time and a minimum average logical I/O. It also optionally allows you to limit the statements to only those that are offloaded or those that are not offloaded. Please refer to the scripts for further details and keep in mind that these techniques can also be applied to the data recorded by AWR for a historical perspective.
In the next section, we will discuss some issues that can complicate your efforts to enable Smart Scans.
Things That Can Cripple Smart Scans
There are several common coding “techniques” that either disable Smart Scans completely or cause them to be much less effective than they could be. Some of the techniques are just bad practices, regardless of whether you are on the Exadata platform or not. Others do not carry as significant a penalty on non-Exadata platforms, but, when run on Exadata, they can prevent the storage software from doing all that it could do. That is a common observation that can be made when development is located on a non-Exadata platform. Such a system makes developing a well-performing application difficult at best. Ensuring that an application performs on a non-Exadata system will most likely use techniques that do not lend themselves for best performance on the Exadata platform, and additional work is needed to optimize the code once on Exadata.
Many of these issues have been discussed throughout this book. Here are a few that you should keep in mind because of the fundamentally different behavior on the Exadata platform.
Oracle provides a large set of functions that can be applied directly in SQL statements. As discussed in Chapter 2, not all of those functions are offloadable. Knowing which functions are not offloadable is important because the use of those functions in WHERE clauses disables predicate filtering that may otherwise provide a massive reduction in the amount of data to be transferred back to the database layer. Obviously, custom-written PL/SQL functions fall into the category of “non-offloadable” functions as well.
This issue is somewhat counterintuitive since you are often doing full table scans anyway with data warehouse systems. On non-Exadata platforms, applying a function in a WHERE clause of a statement that is executed via a full table scan does not impose much of a penalty with regard to the amount of data that must be returned because the database must already return all blocks from the table to the database server. With Exadata, though, applying a function that can disable predicate filtering can impose a huge performance penalty. By the way, using custom PL/SQL functions in a WHERE clause is generally also a bad idea on non-Exadata platforms because additional CPU will be required to process PL/SQL for each row, as opposed to the optimized functions, based on C code, provided by Oracle.
Note You can query V$SQLFN_METADATA to see which functions are offloadable.
Furthermore, “offloadable” functions can also impose large performance penalties. Here is a very simple example showing the negative effect of an offloadable function in a WHERE clause:
SQL> select /* example001 */ count(*) from SALES where prod_id < 1;
COUNT(*)
----------
0
Elapsed: 00:00:00.75
SQL> select /* example001 */ count(*) from SALES where abs(prod_id) < 1;
COUNT(*)
----------
0
Elapsed: 00:00:34.26
SQL> @fsx4
Enter value for sql_text: %example001%
Enter value for sql_id:
SQL_ID CHILD OFFLOAD IO_SAVED_% AVG_ETIME SQL_TEXT
------------- ------ ------- ---------- ---------- ----------------------------------------
7ttw461bngzn0 0 Yes 100.00 .75 select /* example001 */ count(*) from SA
fktc9145xy6qg 0 Yes 99.98 33.81 select /* example001 */ count(*) from SA
Elapsed: 00:00:00.08
SQL> select name, offloadable from v$sqlfn_metadata
2 where name = 'ABS';
NAME OFF
------------------------------ ---
ABS YES
ABS YES
ABS YES
3 rows selected.
ABS() is an offloadable function, yet when used in the WHERE clause of this particular statement, the result was a large degradation in performance. If you have been following this example on your own environment, you may already have a pretty good idea why. Here is the solution:
SQL> select name,value from v$statname natural join v$mystat
2 where name like '%storage index%';
NAME VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index 0
SQL> set timing on
SQL> select /* example001 */ count(*) from SALES where abs(prod_id) < 1;
COUNT(*)
----------
0
Elapsed: 00:00:33.90
SQL> select name,value from v$statname natural join v$mystat where name like '%storage index%';
NAME VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index 0
Elapsed: 00:00:00.00
SQL> select /* example001 */ count(*) from SALES where prod_id < 1;
COUNT(*)
----------
0
Elapsed: 00:00:00.76
SQL> select name,value from v$statname natural join v$mystat where name like '%storage index%';
NAME VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index 1.4517E+11
Elapsed: 00:00:00.00
Storage indexes are disabled by functions, just like regular indexes. This is not too surprising, but, again, it can easily be missed. When we see a full scan, we have trained ourselves to not worry about functions in the WHERE clause that could disable indexes. Exadata is different.
This is a very broad generalization, but basically any Oracle processing that requires reading an extra block to complete a row causes the Exadata storage software to revert to block shipping or passthrough mode. You read about this in several places in the previous chapters-Chapter 11 specifically provides most details. A simple example is a chained row, but there are other situations that can cause Oracle to revert to passthrough mode. What this means in practice is that some operations that cause slight delays on non-Exadata platforms can potentially have a more severely impact performance on Exadata. The primary diagnostic symptom of this issue is the presence of many single-block-read wait events in combination with cell Smart Scan wait events. In such situations, you may find that you are better off not using offloading for the statements in question as an immediate remedy before addressing the problem more thoroughly. Here is an example showing where Oracle spends its time when selecting from a table with chained rows during the first execution. The example is specifically engineered to exaggerate the problem and make it reproducible, as the dictionary information about the table shows.
SQL> select num_rows,chain_cnt,avg_row_len from tabs where table_name = 'CHAINS';
NUM_ROWS CHAIN_CNT AVG_ROW_LEN
---------- ---------- -----------
600000 600000 20005
1 row selected.
SQL> select segment_name,partition_name,round(bytes/power(1024,3),2) gb
2 from user_segments
3 where segment_name = 'CHAINS';
SEGMENT_NAME PARTITION_NAME GB
------------------------------ ------------------------------ ----------
CHAINS 12.75
SQL> alter system flush buffer_cache;
System altered.
SQL> select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains;
AVG(LENGTH(C))
--------------
3990
1 row selected.
An investigation into where the execution time is spent could see the following commands being used:
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------
SQL_ID 6xpsmzknmkutw, child number 0
-------------------------------------
select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains
Plan hash value: 1270987893
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 450K(100)| |
| 1 | SORT AGGREGATE | | 1 | 3990 | | |
| 2 | TABLE ACCESS STORAGE FULL| CHAINS | 600K| 2283M| 450K (1)| 00:00:18 |
-------------------------------------------------------------------------------------
Please note that the above output shows the estimates for time and number of rows returned by the query. If you want the actual statistics, either use SQL Monitor (provided you have the license to) or provide ALLSTATS LAST as the format parameter in DBMS_XPLAN.DISPLAY_CURSOR. Other useful tools include session snapper and fsx- family of scripts mentioned in Chapter 2. Querying Active Session History (which also requires a license) can also provide interesting insights into what is currently happening, but make sure you filter appropriately. Ultimately, a trace will reveal every single wait that occurred. After processing the raw trace file, the following information has been gathered. The actual processing time is prolonged due to the large number of entries in the trace file:
SQL ID: 6xpsmzknmkutw Plan Hash: 1270987893
select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 95.91 497.32 2321346 2321394 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 95.91 497.32 2321346 2321394 0 1
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 65
Number of plan statistics captured: 1
Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
1 1 1 SORT AGGREGATE (cr=2321394 pr=2321346 pw=0 time=497324130 us)
600000 600000 600000 TABLE ACCESS STORAGE FULL CHAINS (cr=2321394 pr=2321346 pw=0
time=484769611 us cost=450678 size=2394000000 card=600000)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
row cache lock 5 0.00 0.00
Disk file operations I/O 2 0.00 0.00
SQL*Net message to client 2 0.00 0.00
cell single block physical read 660743 0.42 448.04
enq: KO - fast object checkpoint 3 0.00 0.00
reliable message 1 0.00 0.00
cell smart table scan 1681 0.00 0.45
latch: redo allocation 1 0.00 0.00
SQL*Net message from client 2 16.25 16.25
********************************************************************************
In this example, a query was executed against a table in which every row was chained (again a bit of an extreme, but a 100%-reproducible example). Before starting the execution of the main query, a 10046/SQL Trace was started, which ran throughout the execution of the statement. You saw the output of tkrprof on the trace file, which produced the list of wait events just shown. Notice that the statement had an elapsed time of about 8 minutes (497 seconds), of which the vast majority was spent doing single-block physical reads. The statement used a full table scan and was clearly offloaded in parts, as indicated by the cell smart table scan wait events, but it spent most of its time doing single-block reads. Of course, the single-block reads were a result of the chained rows. The next listing shows the difference between running the query with offloading and without offloading. First, the example with all Exadata features disabled:
SQL> alter session set "_serial_direct_read"=always;
Session altered.
Elapsed: 00:00:00.01
SQL> alter session set cell_offload_processing=false;
Session altered.
Elapsed: 00:00:00.00
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:01.99
SQL> select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains;
AVG(LENGTH(COL2))
-----------------
3990
Elapsed: 00:00:50.10
As you can see with all offloading disabled and direct path reads, the execution time is 50 seconds. The rows are still chained—nothing has happened to the table. In the next example, direct path reads are disabled as well:
SQL> alter session set "_serial_direct_read"=never;
Session altered.
Elapsed: 00:00:00.01
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.14
SQL> select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains;
AVG(LENGTH(COL2))
-----------------
3990
Elapsed: 00:00:28.33
And, finally, everything back to default—cell offloading enabled and Smart Scans:
SQL> alter session set "_serial_direct_read"=always;
Session altered.
Elapsed: 00:00:00.00
SQL> alter session set cell_offload_processing=true;
Session altered.
Elapsed: 00:00:00.00
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.08
SQL> select /*+ gather_plan_statistics monitor */ avg(length(col2)) from chains;
AVG(LENGTH(COL2))
-----------------
3990
Elapsed: 00:07:02.29
Ouch—the last example clearly showed severe performance degradation. However, it is very important at this stage to point out that this is the most extreme example, as every row was chained. In fact, many of the records spanned three blocks. In this extreme case, it is clear that avoiding offloading actually improves the performance significantly. Of course, eliminating the chained rows where possible would be the best solution.
Very Busy Cells
When a storage cell becomes very busy with CPU operations, it is possible for the storage software to begin refusing to perform offload activities on some portion of the requests. That is, if a particular cell becomes extremely busy and is getting offloadable requests from a database server that is not particularly busy, it may decide to send some data back to the database server that has not been processed or has only partially been processed. In some cases, column projection may be done but not filtering; in other cases, cellsrv may revert to shipping entire blocks back to the database layer. While this issue does not result in a complete shutdown of offloading, it can reduce the amount of work that is done on the storage tier.
This is an extremely complex topic, and it is difficult to observe the behavior directly. The goal of the feature is to utilize available CPU resources, regardless of whether they are on the database tier or the storage tier. This behavior was introduced in cellsrv version 11.2.2.3.1 with databases running 11.2.0.2 with bundle patch 6 or later. There is a statistic called cell physical IO bytes pushed back due to excessive CPU on cell in 11.2.0.2 with BP 6 and later that shows this is happening. Note that the statistic name later changed to cell physical IO bytes sent directly to DB node to balance CPU. The feature is designed to improve throughput on very busy systems, but it may also cause some degree of instability in the performance of certain statements. It is possible to disable this feature if your cellsrv is erroneously deciding it is too busy to take on additional work; however, in general, if you observe this behavior, you are probably getting close to the limits of the system. Adding additional resources at the storage layer (more storage cells) may be a viable option if reducing the load created is not possible.
Hints are very useful for coercing the optimizer to do what you want it to do. Unfortunately, hints are not well documented and even less well understood. In many cases, hints are used to resolve a problem that is caused by some misconfiguration of the database (and then forgotten during the upgrade, potentially causing problems with the new major release). Their intended purpose is to allow humans to help the optimizer make the right choices in situations where it just cannot do the job effectively (or consistently) on its own. This happens in situations where the optimizer is just not smart enough (yet) to arrive at the best execution plan in a specific situation. However, even when hints are used appropriately and are generating the expected behavior, they can prevent Exadata from taking advantage of some of its built-in abilities. When migrating to Exadata, the best approach is to allow ample time for testing. If your application makes use of hints, one of the important steps in the test plan should be to test its behavior without the hints. This can easily be accomplished by setting the hidden parameter _optimizer_ignore_hints to “true” in a session for the upgrade/regression test. Only after the unnecessary hints have been weeded out should the system go into production on Exadata. Ultimately, it is far easier to remove hints when the migration testing is performed than after the system is live. You simply have more resources and time dedicated to testing!
This may seem like a strange topic, but indexes can work against Smart Scans as well. The optimizer will try to use indexes if they are available. In a pure data warehouse environment, analytic indexes may not be necessary at all. You are about to read more about indexes in the next section on mixed workloads, but it is important to understand that index usage often means that offloading is less likely to occur.
It is amazing how often we see row-at-a-time processing in very large data sets. This type of coding is rarely a good idea on non-Exadata platforms, and it is definitely not a good idea on Exadata. In fact, the differences in performance can be even more dramatic on Exadata because of the optimizations built into the platform.
Other Things to Keep in Mind
There are a few other things that you should keep in mind when working with DW systems: the use of Exadata Smart Flash Cache, compression, and partitioning.
Exadata Smart Flash Cache: To Keep or Not to Keep
Exadata Smart Flash Cache (EFSC) is thought of primarily as providing benefit to latency-sensitive SQL statements, satisfying single-block reads from faster Flash devices. What is good for single-block I/O cannot be bad for multi-block I/O. In this context, it is important to remember that scanning from Flash can dramatically improve scan performance as well. Up until Exadata version 11.2.3.3.0, Smart Scans ignore ESFC and only scan using hard disks by default. In version 11.2.3.3.0 and later, Smart Scans can and will transparently cache data in the Flash Cache, too. That particular Exadata feature had a radical effect on the way scans were performed on the platform. From that release onward, scan performance has improved by many magnitudes. Transparently storing entire segments or parts thereof in ESFC will benefit both Smart Scans as well as non-offloaded single- and multi-block reads. The new feature largely eliminates the fine-tuning of which objects may use Flash Cache by using the segment’s storage clause for those segments critical for processing.
If you want to demonstrate the usefulness of ESFC for table scans in current Exadata versions, you have to reverse the situation from before 11.2.3.3.0: Instead of pinning segments to Flash Cache to enable Smart Scans from benefiting from ESFC, you specifically have to forbid segments from using it. The following two tables are 100% identical, with the exception that table T1_NOCOMPRESS_NOESFC is forbidden to benefit from ESFC.
SQL> select table_name, num_rows, compression, partitioned, cell_flash_cache
2 from tabs where table_name like 'T1_NOCOMPRESS%';
TABLE_NAME NUM_ROWS COMPRESS PAR CELL_FL
------------------------------ ---------- -------- --- -------
T1_NOCOMPRESS_NOESFC 10000000 DISABLED NO NONE
T1_NOCOMPRESS 10000000 DISABLED NO DEFAULT
2 rows selected.
SQL> select segment_name, bytes/power(1024,2) m, blocks
2 from user_segments where segment_name like 'T1_NOCOMPRESS%';
SEGMENT_NAME M BLOCKS
------------------------------ ---------- ----------
T1_NOCOMPRESS 13056 1671168
T1_NOCOMPRESS_NOESFC 13056 1671168
2 rows selected.
In preparation for the demonstration, a number of full scans were executed against T1_NOCOMPRESS so as to ensure that the table contents are largely found in Flash Cache. Remember from Chapter 5 that you can use cellcli's list flashcachecontents command to see what is actually cached on a cell. A Smart Scan against T1_NOCOMPRESS will set the baseline for the comparison:
SQL> select count(*) from T1_NOCOMPRESS;
COUNT(*)
----------
10000000
Elapsed: 00:00:01.24
The query has deliberately been chosen to not feature a WHERE clause to rule out any optimization by storage indexes. There are no indexes involved either, as you can see from the execution plan:
SQL> select * from table(dbms_xplan.display_cursor)
2 /
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
SQL_ID 38ttfy95yg3sd, child number 0
-------------------------------------
select count(*) from T1_NOCOMPRESS
Plan hash value: 3825536868
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | 452K(100)| |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS STORAGE FULL| T1_NOCOMPRESS | 10M| 452K (1)| 00:00:18 |
------------------------------------------------------------------------------------
14 rows selected.
Digging into the execution statistics using session snapper or mystats.sql, you can see that lots of the read requests are optimized. The output below is taken from mystats; statistics not necessary for this discussion have been removed:
------------------------------------------------------------------------------------------
2. Statistics Report
------------------------------------------------------------------------------------------
Type Statistic Name Value
------ ---------------------------------------------------------------- ----------------
STAT cell IO uncompressed bytes 13,653,352,448
STAT cell blocks helped by minscn optimization 1,666,679
STAT cell flash cache read hits 11,576
STAT cell physical IO bytes eligible for predicate offload 13,653,336,064
STAT cell physical IO interconnect bytes returned by smart scan 269,067,552
STAT cell scans 1
STAT physical read IO requests 13,048
STAT physical read bytes 13,653,336,064
STAT physical read requests optimized 11,576
STAT physical read total IO requests 13,048
STAT physical read total bytes 13,653,336,064
STAT physical read total bytes optimized 12,111,839,232
STAT physical read total multi block requests 13,037
STAT physical reads 1,666,667
STAT physical reads direct 1,666,667
------------------------------------------------------------------------------------------
3. About
------------------------------------------------------------------------------------------
- MyStats v2.01 by Adrian Billington (http://www.oracle-developer.net)
- Based on the SNAP_MY_STATS utility by Jonathan Lewis
You can see that out of all these, approximately 13GB read using 13,048 I/O requests, a large portion was satisfied via Flash Cache, namely 11,576. You can put this into perspective by comparing physical read total bytes optimized to physical read total bytes. And, by the way, once the data is in Flash Cache, traditional single-block and multi-block reads will benefit from the fact that the segment is on faster storage, too, at no extra penalty.
On the other hand, if a table cannot benefit from ESFC such as T1_NOCOMPRESS_NOESFC, things look a little different:
SQL> select count(*) from T1_NOCOMPRESS_NOESFC;
COUNT(*)
----------
10000000
Elapsed: 00:00:11.27
Repeated executions of the statement cannot have an effect: There will not be any caching on Flash Cache as this operation is administratively prohibited. The execution plan is identical to the first one shown:
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
SQL_ID 9gdnw7yk14mpw, child number 0
-------------------------------------
select count(*) from T1_NOCOMPRESS_NOESFC
Plan hash value: 4286875364
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | 452K(100)| |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS STORAGE FULL| T1_NOCOMPRESS_NOESFC | 10M| 452K (1)| 00:00:18 |
-------------------------------------------------------------------------------------------
14 rows selected.
Interestingly, the optimizer assumes the same elapsed time for the scan, namely 18 seconds. The difference in the execution time can be found in the execution statistics:
------------------------------------------------------------------------------------------
2. Statistics Report
------------------------------------------------------------------------------------------
Type Statistic Name Value
------ ---------------------------------------------------------------- ----------------
STAT cell IO uncompressed bytes 13,653,336,064
STAT cell physical IO bytes eligible for predicate offload 13,653,336,064
STAT cell physical IO interconnect bytes returned by smart scan 269,067,184
STAT cell scans 1
STAT physical read IO requests 13,046
STAT physical read bytes 13,653,336,064
STAT physical read total IO requests 13,046
STAT physical read total bytes 13,653,336,064
STAT physical read total multi block requests 13,037
STAT physical reads 1,666,667
STAT physical reads direct 1,666,667
------------------------------------------------------------------------------------------
3. About
------------------------------------------------------------------------------------------
- MyStats v2.01 by Adrian Billington (http://www.oracle-developer.net)
- Based on the SNAP_MY_STATS utility by Jonathan Lewis
The effect of Flash Cache is visible in the I/O-related statistics in V$SQL, you can query physical_read_requests and optimized_phy_read_requests. The output is rearranged for better readability:
SQL> @fsx4.sql
Enter value for sql_text: %esfc_example%
Enter value for sql_id:
SQL_ID CHILD OFFLOAD IO_SAVED_% AVG_ETIME SQL_TEXT
------------- ------ ------- ---------- ---------- --------------------------------------- ...
aqmusjaqj5yy6 0 Yes 98.03 9.23 select /* esfc_example */ count(*) from ...
g85ux15kbh9hr 0 Yes 98.03 1.23 select /* esfc_example */ count(*) from ...
... SQL_ID PHYSICAL_READ_REQUESTS OPTIMIZED_PHY_READ_REQUESTS
... ------------- ---------------------- ---------------------------
... aqmusjaqj5yy6 13046 0
... g85ux15kbh9hr 13046 11576
2 rows selected.
Thankfully, a lot of the complexity around pinning segments to Flash Cache has been resolved with the introduction of Exadata 11.2.3.3. And you do not even have to spend too much time thinking about it. In the output of the query against user_tables shown earlier, you will notice that the attributes to cell_flash_cache are NONE and DEFAULT, but none of them is set to KEEP. For the transparent caching of data for Smart Scans alone, it was worth upgrading to 11.2.3.3.
Exadata’s Hybrid Columnar Compression (HCC) is a big step forward in its ability to reduce the size of data stored inside of Oracle databases. The compression ratios that are achievable with HCC turn the traditional concept of information life-cycle management on its head. HCC makes it practical to consider using compression instead of tiered storage or archiving and purging strategies. Because partitions of a table can be defined to use different compression methods, the combination of partitioning and compression can provide a much more robust solution for “archiving” data than actually purging it from the database.
You should remember, though, that HCC is not appropriate for data that is being subject to DML. A better approach is to partition data such that HCC can be applied to data that is no longer being changed. This leads us to the next topic—partitioning.
Partitioning
Partitioning has been and still is a very key component for data warehousing systems. The optimizations provided by Exadata do not alleviate the need for a well thought-out partitioning strategy. Of course, date-based strategies are very useful from a management standpoint. Being able to use more aggressive compression on older data is often a good approach. But partition elimination is still a technique that you will want to use. And, of course, storage indexes can work well with partitioning, providing behavior comparable to partition elimination on additional columns.
You should keep in mind that the sizes of partitions can affect Oracle’s decision to use Smart Scans. When performing a serial scan on a partitioned object, the decision to do direct path reads is based on the individual segment (table, partition, subpartition) size, not the overall size of the object. This can result in situations where scans of some partitions are offloaded, while scans of others are not. This is particularly relevant for colder partitions that have been compressed. Consider a table created with a number of random dates, range partitioned by date using monthly intervals:
SQL> select segment_name, partition_name, bytes/power(1024,2) m
2 from user_segments where segment_name= 'SMARTSCANHCC';
SEGMENT_NAME PARTITION_NAME M
------------------------------ ------------------------------ ----------
SMARTSCANHCC SYS_P9676 944
SMARTSCANHCC SYS_P9677 968
SMARTSCANHCC SYS_P9678 384
3 rows selected.
SQL> select partition_name, high_value from user_tab_partitions
2 where table_name = 'SMARTSCANHCC'
PARTITION_NAME HIGH_VALUE
------------------------------ --------------------------------------------------
P_START TO_DATE(' 1995-01-01 00:00:00', 'SYYYY-MM-DD HH24:
SYS_P9676 TO_DATE(' 2014-05-01 00:00:00', 'SYYYY-MM-DD HH24:
SYS_P9677 TO_DATE(' 2014-09-01 00:00:00', 'SYYYY-MM-DD HH24:
SYS_P9678 TO_DATE(' 2014-10-01 00:00:00', 'SYYYY-MM-DD HH24:
SQL> select partition_name, last_analyzed, num_rows from user_tab_partitions
2 where table_name = 'SMARTSCANHCC';
PARTITION_NAME LAST_ANALYZED NUM_ROWS
------------------------------ ------------------- ----------
P_START 2015-03-15:16:31:52 0
SYS_P9676 2015-03-15:16:31:55 837426
SYS_P9677 2015-03-15:16:31:58 863109
SYS_P9678 2015-03-15:16:31:59 335539
4 rows selected.
The table is interval-partitioned, partition P_START is empty, and—thanks to deferred segment creation—has not even be created. The size of the partitions enables Smart Scans. The SQL Monitor report has been cut down to fit the page, and only relevant information is shown:
SQL Monitoring Report
SQL Text
------------------------------
select /*+ monitor gather_plan_statistics */ count(*) from smartscanhcc partition (SYS_P9676)
Global Information
------------------------------
Status : DONE (ALL ROWS)
Instance ID : 1
Session : MARTIN (591:27854)
SQL ID : 76yr0u2rhkqq8
SQL Execution ID : 16777217
Execution Started : 03/15/2015 16:32:52
First Refresh Time : 03/15/2015 16:32:52
Last Refresh Time : 03/15/2015 16:32:53
Duration : 1s
Module/Action : SQL*Plus/-
Service : SYS$USERS
Program : [email protected] (TNS V1-V3)
Fetch Calls : 1
Global Stats
========================================================================================
| Elapsed | Cpu | IO | Application | Fetch | Buffer | Read | Read | Cell |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | Offload |
========================================================================================
| 0.33 | 0.13 | 0.21 | 0.00 | 1 | 120K | 940 | 935MB | 97.64% |
========================================================================================
Imagine next that the partition undergoes maintenance and is HCC compressed:
SQL> alter table smartscanhcc modify partition SYS_P9676 column store compress for query high;
Table altered.
SQL> alter table smartscanhcc move partition SYS_P9676;
Table altered.
SQL> select segment_name, partition_name, bytes/power(1024,2) m
2 from user_segments where segment_name= 'SMARTSCANHCC';
SEGMENT_NAME PARTITION_NAME M
------------------------------ ------------------------------ ----------
SMARTSCANHCC SYS_P9677 968
SMARTSCANHCC SYS_P9678 384
SMARTSCANHCC SYS_P9676 16
3 rows selected.
As expected, the compressed size of this partition smaller than it was before. If a user now executes a query against the segment, it is most likely not Smart Scanned. And, indeed, this can be confirmed, for example by using a SQL Monitor report (alternatively if you do not have the license to use it, you can query V$SQL):
SQL Monitoring Report
SQL Text
------------------------------
select /*+ monitor gather_plan_statistics */ count(*) from smartscanhcc partition (SYS_P9676)
Global Information
------------------------------
Status : DONE (ALL ROWS)
Instance ID : 1
Session : MARTIN (1043:51611)
SQL ID : 76yr0u2rhkqq8
SQL Execution ID : 16777216
Execution Started : 06/09/2015 11:02:38
First Refresh Time : 06/09/2015 11:02:38
Last Refresh Time : 06/09/2015 11:02:38
Duration : .123527s
Module/Action : SQL*Plus/-
Service : SYS$USERS
Program : [email protected] (TNS V1-V3)
Fetch Calls : 1
Global Stats
======================================================================================
| Elapsed | Cpu | IO | Cluster | Other | Fetch | Buffer | Read | Read |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes |
======================================================================================
| 0.12 | 0.02 | 0.09 | 0.00 | 0.01 | 1 | 1245 | 11 | 10MB |
======================================================================================
The missing column about cell offload efficiency is an indicator for a traditional read and the absence of a Smart Scan. There are cases when the absence of a Smart Scan does not matter when partitions are small. In the above example the data set was reduced from 944MB to 16MB. These can be read very quickly if you looked at the elapsed time.
Mixed Workloads
There is a third type of system that is a combination of the other two. In fact, one could argue that the pure form of the other two (OLTP and DW) rarely exist in the real world. There are many systems that do not fall neatly into the two main categories already described. In fact, most systems display characteristics of both. Consider, for example, the case where an “OLTP” system is performing short, distinct little transactions during the day with a lot of reporting during the evening hours. Or take the point of view of the data warehouse where you run lots of reports but have a scheduled ELT (Extraction Load Transform) process that makes generous use of the merge clause, which of course requires a lookup.
Combining long-running, throughput-sensitive queries with fast, latency-sensitive statements definitely introduces some additional issues that must be dealt with. One of the main issues in systems of this type is how to deal with indexes.
One of the biggest debates we have had during Exadata implementations is whether to drop indexes or not. The problem has somewhat been exacerbated by claims that you do not need any indexes on Exadata. Access paths that use indexes are generally not able to take advantage of Exadata-specific optimizations. Yes, you read that right—it is generally because offloading can occur in cases where the optimizer chooses to execute a fast full scan on an index, but this is not the most common usage pattern for indexes. The more common pattern is to use them for retrieving relatively few records from a table using an index range scan, and this operation is not currently offloadable. Generally speaking, you will want to use index range scans on selective predicates. However, since Exadata is so effective at scanning disks, in many cases the index-based access paths are no longer faster than the scan-based access operations. The frequency of how many times this query is executed starts to play an important role. If you are scanning a multi-million-row table in a couple of seconds, then this is undoubtedly fast. However, if you need to do this 10,000 times during a merge an index on the lookup might speed things up. It is really a case of getting our bearings all over again with respect to when we want to use indexes and when we would expect a full scan to perform better.
One of the things we commonly heard when Exadata was first starting to appear at customer sites was that indexes were no longer necessary and that they should be dropped. For pure data warehouse workloads, this may actually be reasonably good advice for analytical indexes. However, we rarely see anything you could call a “pure data warehouse” workload. Most systems have a mix of access patterns, with one set of statements hoping for low latency and another set hoping for high throughput. In these cases, dropping all indexes just will not work. This is why this discussion was saved for this section. The problem with mixed workloads, where it is necessary to keep indexes for specific sets of statements, is that the optimizer is not as well equipped to choose between using and ignoring them as one might hope. However there might be a way around this situation by making creative use of invisible indexes. This 11g feature allows you to hide indexes from the optimizer when it comes to developing an execution plan. The indexes are still there and will also be maintained, so you might want to review index use. Not every index needs to be dropped, but, likewise, not every analytical index needs to be kept.
The following example is a relatively simple implementation on how to have indexes in the database but make only selective use of them. The dbm01 database has been modified, and two services have been created and started. DSSSRV, as the name implies, is a service users should use when performing decision support queries or those with high demand on throughput and less need for low latency. As you would imagine, the opposite is true for the OLTPSRV. Users connecting through that service care a lot for low latency but less for bandwidth. A small PL/SQL procedure can be written to check which service a session used to connect and change the parameter optimizer_use_invisible_indexes. The index on table T1_WITH_INDEX is invisible:
SQL> select index_name, visibility
2 from user_indexes
3 where index_name = 'I_T1_WITH_INDEXES_1';
INDEX_NAME VISIBILIT
------------------------------ ---------
I_T1_WITH_INDEXES_1 INVISIBLE
The little procedure being used here just checks for the service name and changes the optimizer’s visibility of the index:
SQL> create procedure check_service is
2 begin
3 if lower(sys_context('userenv','service:name')) = 'dsssrv' then
4 execute immediate 'alter session set optimizer_use_invisible_indexes = false';
5 elsif lower(sys_context('userenv','service:name')) = 'oltpsrv' then
6 execute immediate 'alter session set optimizer_use_invisible_indexes = true';
7 end if;
8 end;
9 /
Procedure created.
The following execution plans show that the index is used based on the service the session connects through. The first example uses the OLTPSRV connection:
SQL> select sys_context('userenv','service:name') from dual;
SYS_CONTEXT('USERENV','SERVICE_NAME')
-----------------------------------------------------------------------------------------
oltpsrv
SQL> exec check_service
SQL> select /* oltpsrv */ count(*) from t1_with_index where id between 200 and 400;
COUNT(*)
----------
201
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------
SQL_ID 46auh11c8ddts, child number 0
-------------------------------------
select /* oltpsrv */ count(*) from t1_with_index where id between 200
and 400
Plan hash value: 2861271559
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3 (100)| |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | INDEX RANGE SCAN| I_T1_WITH_INDEXES_1 | 202 | 1212 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID">=200 AND "ID"<=400)
20 rows selected.
Notice the index-driven execution plan. When connecting via DSSSRV, the situation changes:
SQL> select sys_context('userenv','service:name') from dual;
SYS_CONTEXT('USERENV','SERVICE_NAME')
--------------------------------------------------------------------------------------------
dsssrv
Elapsed: 00:00:00.00
SQL> exec check_service
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01
SQL> select /* dsssrv */ count(*) from t1_with_index where id between 200 and 400;
COUNT(*)
----------
201
Elapsed: 00:00:00.56
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
SQL_ID 0ym2m0whwsn1y, child number 0
-------------------------------------
select /* dsssrv */ count(*) from t1_with_index where id between 200
and 400
Plan hash value: 1131101492
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 452K(100)| |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | TABLE ACCESS STORAGE FULL| T1_WITH_INDEX | 202 | 1212 | 452K (1)| 00:00:18 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage(("ID"<=400 AND "ID">=200))
filter(("ID"<=400 AND "ID">=200))
21 rows selected.
Unlike the previous example, you do not see any index in the execution plan. If you consider putting the little PL/SQL block into a nicer format, you can easily embed it in a login trigger and control index usage that way. Further optimizer limitations in the context of Exadata are discussed in the next section.
The Optimizer Doesn’t Know
You have read several times in this book that the optimizer is not aware that it is running on Exadata. In general, the principles that guide the optimizer decisions are sound, regardless of the storage platform. The fact that the code on the database tier is identical—regardless of whether it is running on Exadata or not—means that an application will behave similarly on Exadata in terms of plan selection. So, you should not expect any application to experience a large number of changes in the plans caused simply by moving to Exadata if you remain on the same version and same memory settings. Having the same software generating the execution plans help a lot toward stability! The situation might be different if you are migrating to Exadata from a lower Oracle version, such as during a 11.2 to 12.1 migration, or from single instance to RAC, but you would expect similar changes outside the Exadata platform, too.
The downside is that the optimizer is not aware that Exadata has optimizations that can cause full scans to perform much better than on other platforms, apart from the EXADATA system statistics you will read about in the next section. So mixed-workload systems that have many indexes make the optimizer’s job more challenging. In fact, as you might expect, the optimizer will tend to pick index-oriented plans in preference to full scan-based plans in situations where indexes are available, despite the fact that the full scan-based plans are often much faster.
There are several ways to deal with the optimizer’s tendency to prefer index access over full table scans. System statistics, optimizer parameters, and hints all come to mind as potential solutions. You can read more about these in the following sections.
System Statistics
System statistics provide the optimizer with additional information about the “system,” including how long it takes to do a single-block read (typical of index lookups) and how long it takes to do a multi-block read (typical of full table scans). This may appear to be an ideal mechanism to manipulate the optimizer by giving it the additional information it needs to make the right decisions. Unfortunately, Smart Scans are not based on traditional multi-block reads and, in fact, Smart Scans can be orders of magnitude faster than multi-block reads. Hence, modifying System Statistics is probably not the best option in this case.
In fact, the question whether or not to gather system statistics in WORKLOAD mode comes up quite often during discussions about Exadata deployments. For the reasons outlined above, it is probably not a wise idea to gather them, as it potentially introduces plan regression. Introducing WORKLOAD statistics can also have far-reaching effects.
Another alternative exists, however, for database version 11.2.0.2 BP18 and 11.2.0.3 BP8 and newer, according to DOC ID 1274318.1: gathering statistics the Exadata way. The Document on My Oracle Support specifically states that this is not a generic recommendation, but should be assessed carefully. To enable Exadata system statistics, you can use the following command:
SQL> exec DBMS_STATS.GATHER_SYSTEM_STATS('EXADATA')
As a result of this call, the database engine is told that it can read more data in a single request on Exadata, thus lowering the cost of full scans. It does not prevent the optimizer from selecting an index though. The change in the costing model is the reason why you should only introduce the change after careful testing! The aforementioned note also recommends that if the application is developed from the ground up on Exadata, the effect of gathering Exadata-statistics can be controlled more easily and any adverse side effects can be caught in testing, before going live. Whichever way-careful testing is needed.
There are a couple of initialization parameters that can push the optimizer toward or away from index usage. The parameters OPTIMZER_INDEX_CACHING and OPTIMIZER_INDEX_COST_ADJ can both be used for this purpose. While these are big knobs that can affect the core functionality of the optimizer, they were designed for the very purpose of making indexes more or less attractive to the optimizer. Using the parameters in a limited way, such as with an alter session command, before running large batch processes is a viable approach in some cases. These parameters can also be set at the statement level using the OPT_PARAM hint. Here is a very simple example:
SQL> show parameter optimizer_index
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
optimizer_index_caching integer 0
optimizer_index_cost_adj integer 100
SQL> select /*+ parallel(2) gather_plan_statistics monitor chap17-f */
2 count(*), a.state
3 from bigt a, t1_sml b
4 where a.id = b.id
5 and b.state = 'RARE'
6 group by a.state
7 /
no rows selected
Elapsed: 00:00:25.46
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
SQL_ID 4q6vqy2r1yn5w, child number 0
-------------------------------------
select /*+ parallel(2) gather_plan_statistics monitor chap17-f */
count(*), a.state from bigt a, t1_sml b where a.id = b.id and b.state =
'RARE' group by a.state
Plan hash value: 1484706486
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | PX COORDINATOR | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10003 | 1 | 32 |
| 3 | HASH GROUP BY | | 1 | 32 |
| 4 | PX RECEIVE | | 1 | 32 |
| 5 | PX SEND HASH | :TQ10002 | 1 | 32 |
| 6 | HASH GROUP BY | | 1 | 32 |
|* 7 | HASH JOIN | | 1 | 32 |
| 8 | JOIN FILTER CREATE | :BF0000 | 8 | 128 |
| 9 | PX RECEIVE | | 8 | 128 |
| 10 | PX SEND BROADCAST | :TQ10001 | 8 | 128 |
| 11 | TABLE ACCESS BY INDEX ROWID BATCHED| T1_SML | 8 | 128 |
| 12 | SORT CLUSTER BY ROWID | | 8 | |
| 13 | BUFFER SORT | | | |
| 14 | PX RECEIVE | | 8 | |
| 15 | PX SEND HASH (BLOCK ADDRESS) | :TQ10000 | 8 | |
| 16 | PX SELECTOR | | | |
|* 17 | INDEX RANGE SCAN | T1_SML_STATE | 8 | |
| 18 | JOIN FILTER USE | :BF0000 | 100M| 1525M|
| 19 | PX BLOCK ITERATOR | | 100M| 1525M|
|* 20 | TABLE ACCESS STORAGE FULL | BIGT | 100M| 1525M|
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("A"."ID"="B"."ID")
17 - access("B"."STATE"='RARE')
20 - storage(:Z>=:Z AND :Z<=:Z AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."ID"))
filter(SYS_OP_BLOOM_FILTER(:BF0000,"A"."ID"))
Note
-----
- dynamic statistics used: dynamic sampling (level=AUTO)
- Degree of Parallelism is 2 because of hint
SQL> alter session set optimizer_index_cost_adj=10000;
Session altered.
SQL> select /*+ parallel(2) gather_plan_statistics monitor chap17-f */
2 count(*), a.state
3 from bigt a, t1_sml b
4 where a.id = b.id
5 and b.state = 'RARE'
6 group by a.state
7 /
no rows selected
Elapsed: 00:00:15.40
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------
SQL_ID 4q6vqy2r1yn5w, child number 1
-------------------------------------
select /*+ parallel(2) gather_plan_statistics monitor chap17-f */
count(*), a.state from bigt a, t1_sml b where a.id = b.id and b.state =
'RARE' group by a.state
Plan hash value: 3199786897
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 2510K(100)|
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 32 | 2510K (1)|
| 3 | HASH GROUP BY | | 1 | 32 | 2510K (1)|
| 4 | PX RECEIVE | | 1 | 32 | 2510K (1)|
| 5 | PX SEND HASH | :TQ10000 | 1 | 32 | 2510K (1)|
| 6 | HASH GROUP BY | | 1 | 32 | 2510K (1)|
|* 7 | HASH JOIN | | 1 | 32 | 2510K (1)|
| 8 | JOIN FILTER CREATE | :BF0000 | 8 | 128 | 137 (4)|
|* 9 | TABLE ACCESS STORAGE FULL | T1_SML | 8 | 128 | 137 (4)|
| 10 | JOIN FILTER USE | :BF0000 | 100M| 1525M| 2510K (1)|
| 11 | PX BLOCK ITERATOR | | 100M| 1525M| 2510K (1)|
|* 12 | TABLE ACCESS STORAGE FULL| BIGT | 100M| 1525M| 2510K (1)|
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("A"."ID"="B"."ID")
9 - storage("B"."STATE"='RARE')
filter("B"."STATE"='RARE')
12 - storage(:Z>=:Z AND :Z<=:Z AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."ID"))
filter(SYS_OP_BLOOM_FILTER(:BF0000,"A"."ID"))
Note
-----
- dynamic statistics used: dynamic sampling (level=AUTO)
- Degree of Parallelism is 2 because of hint
In this simple example, pushing the optimizer away from indexes with the alter session caused the optimizer to pick a plan that was considerably faster. The plans show that the improvement in elapsed time was a result of doing a full table scan, instead of using the index.
Hints
Of course, hints can also be used to help the optimizer make the right choices, but that is somewhat of a slippery slope. This is especially true with the aforementioned mixed-workload scenarios. Nevertheless, telling Oracle that you would prefer to do a hash join or ignore a specific index is an option. As mentioned in the previous section, the OPT_PARAM hint can also prove useful for setting some initialization parameters that can influence the optimizer’s decisions. SQL patches can help you by injecting hints into code outside of your control. Until a fix is available, Oracle 12c should reduce the necessity to use hints to influence join methods with the introduction of adaptive optimization.
Unfortunately, it is still a commonly held belief that Oracle databases cannot be configured to adequately handle both DW and OLTP workloads at the same time. And, in truth, keeping them on separate systems does make them easier to manage. The downside of this approach is that it is expensive. Many companies dedicate the majority of their computing resources to moving data between platforms. The power of Exadata makes it tempting to combine these environments. Keep in mind that Exadata has additional capabilities for dividing resources between multiple databases that are not available on other platforms. I/O Resource Manager can prevent long-running DW queries from crippling latency-sensitive statements that are running on the same system. Having a good understanding of Oracle’s resource management capabilities should change the way you think about what is possible in a mixed-workload or consolidated environment. Resource management is covered in depth in Chapter 7.
Summary
Exadata is different from traditionally deployed Oracle database. To make the best use of it, you will need to think differently. This does not imply that you have to rewrite your application when moving it to Exadata, but it is a good opportunity to perform a general review of it. Quite often, in today’s world, DBAs look after tens or hundreds of databases. “Know your data” is becoming wishful thinking in such situations. The DBA might be assigned a problem ticket to be closed, and the sheer number of tickets to handle often does not allow any deeper analysis of the root cause for as long as the system “ticks along” nicely.
When it is decided that a database is migrated to Exadata, this decision frequently implies an update to a newer Oracle version. Platform changes and database release changes are the most opportune moments to review a database environment for further performance gains. If it is possible and you are not facing massive time constraints when moving to Exadata, we would like to encourage you not to stop working on the system after the migration completed successfully, but to continue to push the boundaries of what is possible. The Exadata system is very powerful when using Smart Scans, and you should harness that performance where you can and where it makes sense.