Chapter 12: Tuning for Performance and Efficiency
Understanding Performance Tuning
User-Specified Sorting (SORTPGM= System Options)
Reviewing CONTENTS Output and System Messages
Optimizing WHERE Clause Processing with Indexes
Constructing Efficient Logic Conditions
A book on PROC SQL would not be complete without some discussion of query optimization and performance. Enabling a query to run efficiently involves writing code that can take advantage of the PROC SQL query optimizer. Because PROC SQL is designed to handle a variety of processes while accommodating small to large database environments, this chapter presents a number of query tuning strategies, techniques, and options to help you to write more efficient PROC SQL code. In this chapter, you will find tips and suggestions to help identify areas where a query’s inefficiencies might exist and to conduct the tuning process to achieve the best performance possible.
Performance tuning is the process of improving the way a program operates. It involves taking a program and seeing what can be done to improve performance in an intelligent, controlled manner. As you might imagine, a tuned program is one that gets the most from the existing hardware and software environment.
Performance tuning involves measuring, evaluating, and modifying a program until it uses the minimum amount of computer resources to complete its execution. The biggest problem with the tuning process is that it is sometimes difficult to determine the amount of computer resources that a program uses. Complicating matters further, adequate and complete information about resource utilization is often unavailable. In fact, no simple formula exists to determine how efficiently a program runs. Often the only way to assess whether a program is running efficiently is to evaluate its performance under varying conditions, such as during interactive use or during shortages of specific resources including memory and storage.
Performance issues might be difficult to identify. It is possible to have a program that operates without any apparent problem, but does not perform as efficiently as it could. In fact, a program might perform well in one environment and poorly in another. Take, for example, an organization that has a shortage of Direct Access Storage Device (DASD). A program that uses excessive amounts of this resource might be deemed a poor performer under these circumstances. But if the same program were run in an environment that had adequate levels of DASD, it might not be suspected or tagged as a poor performer. This distinction demonstrates the subjectivity that is frequently used to determine how a program performs and how it is linked to the specific needs (related to resource issues) that an organization has at any point in time.
Sorting data in the SQL procedure, as in other parts of SAS, involves CPU and memory-intensive operations. When the table size is large, and CPU and/or memory resources are in short supply or simply not available, the number of sorts in programs must be minimized. You can minimize problems by understanding your query’s requirements and remembering a few simple guidelines.
Performance bottlenecks can occur if sorts are performed on disk as opposed to in memory because processing on disk is typically slower than processing in memory. The most logical and efficient place to perform a sort is in memory. If the sort requires more space than can fit in available memory, the sort must be performed on disk. The objective is to determine how much space a sort will require as well as where the sort will be performed before the sort is executed.
The performance of a sort is most influenced by the number of rows selected in a query. To reduce the row count, select specific parts of a table for processing rather than selecting the entire table. Another performance consideration is to reduce the number of columns that are specified in an ORDER BY clause. The technique of concatenating two or more columns as opposed to specifying each column individually can be used. Finally, sort performance can be influenced by the size of the columns that are specified in an ORDER BY clause. To reduce a column’s length, built-in functions such as SUBSTR and TRIM can be used.
You can control what sort utility SAS uses when performing sorts. By specifying the SORTPGM= system option, you can direct SAS to use the best possible sort utility for the environment in question. The SORTPGM= system options are displayed in Table 12.1.
Table 12.1: SORTPGM= System Options
Sort Option |
Purpose |
BEST |
The BEST option uses the sort utility that is best suited to the data. |
HOST |
The HOST option tells SAS to use the host sort utility that is available on your host computer. This option might be the most efficient for large tables that contain many rows of data. |
SAS |
The SAS option tells SAS to use the sort utility that is supplied with SAS. |
The next example illustrates how to use the SORTPGM= option to select the sort utility that is most suited to the data. Both options use the name that is specified in the SORTNAME= option.
OPTIONS SORTPGM=BEST;
<or>
OPTIONS SORTPGM=HOST;
Using the SELECT DISTINCT clause invokes an internal sort to remove duplicate rows. The single exception is when an index exists. If an index exists, then the index is used to eliminate the duplicate rows.
The results of a grouped query are automatically sorted using the grouping columns. When the SELECT clause contains only the columns that are listed in the GROUP BY clause along with any summary functions, then the duplicates in each group based on the grouping columns are removed as soon as any defined summary functions are performed. If additional columns then appear in the SELECT clause, the rows are not collapsed and therefore duplicates are not removed.
As with the ORDER BY clause, processes such as the GROUP BY clause can also impact the speed of a query. Because the GROUP BY clause can trigger sort processing, much of what was discussed with the ORDER BY clause also applies here. In addition, it is recommended that you avoid grouping redundant columns by keeping the number of grouping columns to a minimum.
Splitting tables involves moving some of the rows from one table to another table. Data is split for the purpose of separating some predetermined range of data, such as historical data from current data, so that query performance is improved. This reduces the burden imposed on queries that only access current data. The next example shows the current year’s data being copied and then removed from a table that contains five years of data.
PROC SQL;
CREATE TABLE INVENTORY_CURRENT AS
SELECT *
FROM INVENTORY
WHERE YEAR(ORDDATE) = YEAR(TODAY());
DELETE FROM INVENTORY
WHERE YEAR(ORDDATE) = YEAR(TODAY());
QUIT;
Indexes can be used to allow rapid access to table rows. Rather than physically sorting a table (as performed by the ORDER BY clause or PROC SORT), an index is designed to set up a logical arrangement for the data without the need to physically sort it. This has the advantage of reducing CPU, I/O, and memory requirements. It also reduces data access time when using WHERE clause processing (which is discussed in the “Optimizing WHERE Clause Processing with Indexes” section).
Indexes are useful, but they do have drawbacks. As data in a table is inserted, modified, or deleted, an index must be updated to address the changes. This automatic feature requires additional CPU resources to process any changes to a table. Also, as a separate structure in its own right, an index can consume considerable storage space. As a consequence, care should be exercised not to create too many indexes but to assign indexes to the most discriminating variables in a table. Here are a few suggestions for creating indexes:
● Sort data in ascending order on the key column prior to creating the index.
● Sort data by the key variable first to achieve the greatest performance improvement.
● Sort data in ascending order by the key variable before it is appended to the table.
● Create simple indexes, when possible, to be used by most queries.
● Avoid creating one single index for all queries.
● Assign indexes to the most discriminating of variables (see Table 6.1, Rules of Cardinality, in Chapter 6).
● Select columns that are frequently the subject of summary functions (COUNT, SUM, AVG, MIN, MAX, etc.).
● Only create indexes that are actually needed.
● Avoid taxing CPU resources that are associated with index maintenance (maintaining an index during inserts, modifications, and deletions) by selecting columns that do not change frequently.
● On some operating systems, indexes are stored as a separate file on disk, which uses additional memory and disk space to store the structure.
● To avoid excessive and unnecessary I/O operations, prior to creating an index, sort data in ascending order by the most discriminating key column.
● Attempt to define composite indexes by using the most discriminating of the variables as your first variable in the index.
● Select columns that do not have numerous null values because this results in a large percentage of rows with the same value.
Note: Indexes should only be created on tables where query search time needs to be optimized. Any unnecessary indexes might force SAS to expend resources needlessly—updating and reorganizing after insert, update, and delete operations are performed. And even worse, the SQL optimizer might accidentally use an index when it should not.
Note: When creating an index is impractical or not feasible, consider storing the data in ascending (or descending) order depending on whether the query needs to access data from the first or second half of the table. This often results in a more efficient query because once a WHERE-clause expression is satisfied, the query stops processing.
While no two organizations are alike, it is not surprising to find numerous causes for a program to run at less than peak efficiency. Performance is frequently affected by the specific needs of an organization or its lack of resources. SAS users need to learn as many techniques as possible to correct problems that are associated with poorly performing queries. Attention should be given to individual program functions, because poor query performance often points to one or more inefficient techniques being used.
Two methods can be used to better understand potential performance issues. The first approach uses PROC CONTENTS output to examine engine/host information and library data sets (tables). The output provides information to determine whether a table is large enough. (The page count in the following output shows the performance improvements offered by an index). The general rule that the SQL processor adheres to is that when a table is relatively small (usually fewer than three pages), there is no real advantage to using an index. In fact, using an index with a small table can actually degrade performance levels because in these situations sequential processing would be just as fast as using an index.
The second approach uses PROC SQL to access the dictionary tables, TABLES and COLUMNS, to determine whether a table is large enough to take advantage of the performance improvements that are offered by an index. See the output below.
PROC SQL;
SELECT MEMNAME, NPAGE
FROM DICTIONARY.TABLES
WHERE LIBNAME=’WORK’ AND
MEMNAME=’INVENTORY’;
SELECT VARNUM, NAME, TYPE, LENGTH, FORMAT,
INFORMAT, LABEL
FROM DICTIONARY.COLUMNS
WHERE LIBNAME=’WORK’ AND
MEMNAME=’INVENTORY’;
QUIT;
Table 12.2 compares sequential table access with indexed table access. Although performance gains are data dependent, the greatest gains are realized when an index is applied to a small subset of data in a WHERE clause.
Table 12.2: Sequential versus Indexed Table Access
Condition |
Sequential |
Index |
Page count < 3 pages (from CONTENTS output) |
Yes |
No |
Small table |
Yes |
No |
Frequent updates to table |
Yes |
No |
Large subset of data based on WHERE processing |
Yes |
No |
Infrequent access of table |
Yes |
No |
Limited memory and disk space |
Yes |
No |
Small subset of data (1%–25% of population) |
No |
Yes |
System messages are displayed to provide information that can help tune the indexes that are associated with any data sets. Setting the MSGLEVEL= system option to “I” allows SAS to display vital information (if available) that is related to the presence of one or more indexes for optimization of WHERE clause processing. With the MSGLEVEL= option turned on, the SAS log shows that the simple index INVENQTY was selected in the optimization of WHERE clause processing.
PROC SQL;
CREATE INDEX INVENQTY ON INVENTORY;
NOTE: Simple index invenqty has been defined.
NOTE: PROCEDURE SQL used:
real time 0.04 seconds
SELECT *
FROM INVENTORY
WHERE invenqty < 3;
INFO: Index invenqty selected for WHERE clause
optimization.
QUIT;
NOTE: PROCEDURE SQL used:
real time 0.65 seconds
To get the best possible performance from programs that contain SQL procedure code, an index and WHERE clause can be used together (see the list below). Using a WHERE clause restricts processing in a table to a subset of selected rows (for more information, see Chapter 2, “Working with Data in PROC SQL”). When an index exists, the SQL processor determines whether to take advantage of it during WHERE clause processing. Although the SQL optimizer determines whether using an index will ultimately benefit performance, when it does use an index the result can be an improvement in processing speeds.
● Comparison operators such as EQ (=), LT (<), GT (>), LE (<=), GE (>=), and NOT
● Comparison operators with the IN operator
● Comparison operators with the colon modifier (for example, NOT = :“Ab”)
● CONTAINS operator
● IS NULL or IS MISSING operator
● Pattern-matching operators such as LIKE and NOT LIKE
Constructing efficient logic conditions has a direct effect on processing costs. Because the SQL optimizer evaluates a series of AND expressions from left to right, a chain of AND conditions in a WHERE clause should be specified with the most restrictive expression first. Specified this way, fewer resources are expended by bypassing rows that do not satisfy the first conditional value in the WHERE clause. For example, the SQL query below might expend more resources and run slower because the first condition, “SOFTWARE”, occurs even when products cost more than $99.00.
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’ AND
PRODCOST < 100.00;
QUIT;
For this data, a more efficient way to produce the same results as the previous example, while reducing CPU resources, is to code the least likely condition first so that it appears as shown here.
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE PRODCOST < 100.00 AND
UPCASE(PRODTYPE) = ‘SOFTWARE’;
QUIT;
Another popular construct uses a series of OR condition equality tests or the IN predicate to select rows that match the multiple conditions. Programmers often specify these kinds of lists in order of magnitude or alphabetically to make the lists easier to read and/or maintain. A better and more efficient way is to construct a list of constants in the order of the most frequently occurring value to the least frequently occurring value.
Note: One way to determine the frequency of product type (PRODTYPE) values is to submit the following code:
PROC SQL;
SELECT PRODTYPE, COUNT(PRODTYPE) AS Product_Frequency
FROM PRODUCTS
GROUP BY PRODTYPE;
QUIT;
With the frequencies determined, the list of conditions can be specified in that order. In this way, fewer resources are spent locating frequently occurring values because it has to perform fewer steps to return a value of TRUE. It is also important to remove duplicate values in the constant list because the first identified duplicate value will automatically return a value of TRUE, which results in the second occurrence of the duplicate value being ignored.
The next query illustrates logic conditions that could require more processing resources because the first condition “LAPTOP” occurs less frequently than the value “SOFTWARE”. Consequently, the SQL processor needs to expend more resources to process the second condition in order to find a match of TRUE being returned.
PROC SQL STIMER;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN (‘LAPTOP’, ‘SOFTWARE’);
QUIT;
PROC SQL STIMER;
NOTE: SQL Statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN ('LAPTOP', 'SOFTWARE');
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
cpu time 0.04 seconds
A more efficient way to process the same data while generating the same results would be to specify the condition, “SOFTWARE”, first as follows:
PROC SQL STIMER;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN (‘SOFTWARE’, ‘LAPTOP’);
QUIT;
PROC SQL STIMER;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN ('SOFTWARE', 'LAPTOP');
NOTE: SQL Statement used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
QUIT;
UNIONs are executed by creating two internal sets, then merge-sorting the results together. Duplicate rows are automatically eliminated from the final results. The Venn diagram for a UNION of two sets represents all distinct elements in the collection, as illustrated in Figure 12.1.
Figure 12.1: Venn Diagram
For example, the SQL procedure code that represents a UNION set first constructs the two result sets from each query, merges and sorts the two sets together, and then eliminates duplicate rows from the final results.
OPTIONS FULLSTIMER;
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘LAPTOP’
UNION
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’;
QUIT;
OPTIONS FULLSTIMER;
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = 'LAPTOP'
UNION
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = 'SOFTWARE';
QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.16 seconds
user cpu time 0.00 seconds
system cpu time 0.04 seconds
memory 634.09k
OS Memory 6768.00k
Timestamp 06/03/2012 04:59:57 PM
To improve UNION performance, SQL procedure code can be converted to a single query using OR conditions in a WHERE clause. The next example illustrates the previous SQL procedure code being made more efficient by converting the UNION to a single query using an OR operator in a WHERE clause.
OPTIONS FULLSTIMER;
PROC SQL;
SELECT DISTINCT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’ OR
UPCASE(PRODTYPE) = ‘LAPTOP’;
QUIT;
OPTIONS FULLSTIMER;
PROC SQL;
SELECT DISTINCT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = 'SOFTWARE' OR
UPCASE(PRODTYPE) = 'LAPTOP';
QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.10 seconds
user cpu time 0.04 seconds
system cpu time 0.01 seconds
memory 565.60k
OS Memory 6768.00k
Timestamp 06/03/2012 05:06:55 PM
The next example illustrates the previous SQL procedure code being made more efficient by converting the UNION to a single query using an IN predicate in a WHERE clause.
OPTIONS FULLSTIMER;
PROC SQL;
SELECT DISTINCT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN (‘SOFTWARE’, ‘LAPTOP’);
QUIT;
OPTIONS FULLSTIMER;
PROC SQL;
SELECT DISTINCT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) IN ('SOFTWARE', 'LAPTOP');
QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.07 seconds
user cpu time 0.01 seconds
system cpu time 0.04 seconds
memory 561.41k
OS Memory 6768.00k
Timestamp 06/03/2012 05:12:56 PM
Another approach that can improve the way a query with a SET operator performs is to specify the ALL keyword, as long as duplicates are not an issue or the rows in the table are all unique. Because the ALL keyword prevents SQL from processing the data twice and does not remove duplicate rows, CPU resources might be improved. The next example shows the UNION ALL coding construct being used to perform what amounts to an append operation, thereby bypassing the sort altogether because the duplicate rows are not removed.
OPTIONS FULLSTIMER;
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘LAPTOP’
UNION ALL
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = ‘SOFTWARE’;
QUIT;
OPTIONS FULLSTIMER;
PROC SQL;
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = 'LAPTOP'
UNION ALL
SELECT *
FROM PRODUCTS
WHERE UPCASE(PRODTYPE) = 'SOFTWARE';
QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.06 seconds
user cpu time 0.00 seconds
system cpu time 0.04 seconds
memory 226.42k
OS Memory 6768.00k
Timestamp 06/03/2012 05:22:14 PM
1. Performance tuning involves measuring, evaluating, and modifying a query’s execution to achieve an optimal balance between competing computer resources (see the “Understanding Performance Tuning” section).
2. Avoid specifying an ORDER BY clause when creating a table or view (see the “Sorting and Performance” section).
3. When sorting is necessary, specify the SORTPGM= system option to instruct the SAS System to use the best possible sort utility relative to the size of the database environment (see the “User-Specified Sorting” section).
4. It is recommended to keep the number of grouping columns with the GROUP BY clause as small as possible (see the “Splitting Tables” section).
5. Care should be exercised to assign indexes to only those discriminating variables in a table and to avoid creating too many indexes (see the “Indexes and Performance” section).
6. There is no advantage in creating or using an index when a table is relatively small (usually fewer than three pages) (see the “Reviewing CONTENTS Output and System Messages” section).
7. Setting the MSGLEVEL= system option to “I” allows SAS to display vital information (if available) relative to the presence of one or more indexes for optimization of WHERE clause processing (see the “Reviewing CONTENTS Output and System Messages” section).
8. Apply WHERE clause processing to restrict the number of rows of the result table (see the “Optimizing WHERE Clause Processing with Indexes” section).
9. When constructing a chain of AND conditions in a WHERE clause, specify the most restrictive conditional values first (see the “Constructing Efficient Logic Conditions” section).