PROC SQL consists of a standard set of statements and options to create, retrieve, alter, transform, and transfer data regardless of the operating system or where the data is located. These features provide tremendous power as well as control when integrating information from a variety of sources in a number of ways. Because emphasis is placed on PROC SQL’s data manipulation capabilities and not on its format and output capabilities, many programmers are unfamiliar with the SQL procedure’s output-producing features. Consequently, programmers resort to using report writers or special outputting tools to create the best looking output. To illustrate the virtues of PROC SQL in the SAS System, this chapter presents numerous examples on how output can be formatted and produced.
Being able to display a blank line between each row of output is available as a procedure option in PROC SQL. As with PROC PRINT, specifying DOUBLE in the SQL procedure inserts a blank line between each row of output (NODOUBLE is the default). Setting this option is especially useful when one or more flowed lines spans or wraps in the output because it provides visual separation between each row of data. This example illustrates using the DOUBLE option to double-space output.
PROC SQL DOUBLE;
SELECT *
FROM INVOICE;
QUIT;
Results
The SAS System Invoice Quantity Invoice Manufacturer Customer - Units Invoice Unit Product Number Number Number Sold Price Number ________________________________________________ ________________ 1001 500 201 5 $1 ,495.00 5001 1002 600 1301 2 $1 ,598.00 6001 1003 210 101 7 $245.00 2101 1004 111 501 3 $9 ,600.00 1110 1005 500 801 2 $798.00 5002 1006 500 901 4 $396.00 6000 1007 500 401 7 $23 ,100.00 1200 |
To revert back to single-spaced output, you can specify the RESET statement as long as the QUIT statement has not been issued to turn off the SQL procedure. When PROC SQL is active, you can specify the RESET statement with or without options to reestablish each option’s original settings. When the RESET statement is specified with one or more options, only those options are reset. This example illustrates the NODOUBLE option specified to turn off double-spaced output and reset printing back to the default single-spaced output.
PROC SQL;
RESET NODOUBLE;
QUIT;
You can specify an SQL procedure option called NUMBER to display row numbers on output under the column heading Row. The NUMBER option displays row numbers on output. The next example shows the NUMBER option being specified to turn on the display of row numbers.
SQL Code
PROC SQL NUMBER;
SELECT ITEM,
UNITS,
UNITCOST
FROM PURCHASES;
QUIT;
Results
The SAS System Row Item Units Unitcost _________________________________________ 1 Chair 1 $179.00 2 Pens 12 $0.89 3 Paper 4 $6.95 4 Stapler 1 $8.95 5 Mouse Pad 1 $11.79 6 Pens 24 $1.59 7 Markers . $0.99 |
As was presented in Chapter 2, “Working with Data in PROC SQL,” two or more strings can be concatenated together to produce a combined and longer string of characters. The concatenation character string operator, represented by two vertical bars “‖”, “!!”, or “¦¦” (depending on the operating system and keyboard being used), combines two or more strings or columns together forming a new string value. The next example shows the concatenation of the manufacturer city and state columns from the MANUFACTURERS table so that the second column immediately follows the first. Although the two character strings are concatenated, the output illustrates potential problems as a result of using the concatenation operator.
First, column headers have been suppressed for both columns. Without header information, a true understanding of the contents of the output may be in jeopardy. Next, blanks are automatically padded to the entire length of the first concatenated column for each row of data resulting in a “fixed” length for the first column.
SQL Code
PROC SQL; SELECT manucity || manustat FROM MANUFACTURERS; QUIT;
Results
The SAS System ______________________ Houston TX San Diego CA Miami FL San Mateo CA San Diego CA San Diego CA |
To make the preceding output appear a bit more readable and complete, you should consider a few modifications. First, column headers could be assigned as aliases using the AS operator. The maximum size of a user-defined column header is 32 bytes (following valid SAS naming conventions). Finally, the TRIM function (described in Chapter 2, “Working with Data in PROC SQL”) could be used to remove trailing blanks in the city column. This makes the second column act as a floating field.
SQL Code
PROC SQL; SELECT TRIM(manucity) || manustat AS Headquarters FROM MANUFACTURERS; QUIT;
Results
The SAS System Headquarters ______________________ HoustonTX San DiegoCA MiamiFL San MateoCA San DiegoCA San DiegoCA |
Although the preceding output illustrates that some changes were made, it still is difficult to read. A few more cosmetic changes should be made to make it more aesthetically appealing and readable. In the next section, the output will be customized to give the data further separation.
At times, it is useful to be able to insert text and/or constants in query output. This enables special characters including symbols and comments to be inserted in the output. We can improve the output in the previous example by inserting a comma “,” and a single blank space between the manufacturer city and state information. The final output illustrates an acceptable way to display columnar data using a “free-floating” presentation as opposed to fixed columns.
PROC SQL; SELECT trim(manucity) || ', ' || manustat As Headquarters FROM MANUFACTURERS; QUIT;
Results
The SAS System Headquarters _____________________ Houston, TX San Diego, CA Miami, FL San Mateo, CA San Diego, CA San Diego, CA |
Another method of automatically concatenating character strings, removing leading and trailing blanks, and inserting text and constants is with the CATX function. The next example shows how the CATX function is specified with a “,” used as a separator between character strings MANUCITY and MANUSTAT.
SQL Code
PROC SQL; SELECT CATX(',', manucity, manustat) As Headquarters FROM MANUFACTURERS; QUIT;
The SAS System Headquarters _____________________ Houston, TX San Diego, CA Miami, FL San Mateo, CA San Diego, CA San Diego, CA |
In computing terms, a scalar refers to a quantity represented by a single number or value. The value is not represented as an array or list of values, but as a single value. For example, the value 7 is a scalar value, but (0,0,7) is not. PROC SQL allows the use of scalar expressions and constants with selected columns. Typically, these expressions replace or augment one or more columns in the SELECT statement. To illustrate how a scalar expression is used, assume that a value of 7.5% representing the sales tax percentage is computed for each product in the PRODUCTS table. The results consist of the product name, product cost, and derived computed sales tax column.
Note: | Although the computed column contains the results of the sales tax computation for each product, it does not contain a column heading. |
SQL Code
PROC SQL;
SELECT prodname, prodcost,
.075 * prodcost
FROM PRODUCTS;
QUIT;
The SAS System Product Product Name Cost ______________________________________________ Dream Machine $3,200.00 240 Business Machine $3,300.00 247.5 Travel Laptop $3,400.00 255 Analog Cell Phone $35.00 2.625 Digital Cell Phone $175.00 13.125 Office Phone $130.00 9.75 Spreadsheet Software $299.00 22.425 Database Software $399.00 29.925 Wordprocessor Software $299.00 22.425 Graphics Software $299.00 22.425 |
In the next two examples, a column header or alias is assigned to the derived sales tax column computed in the previous example. Two methods exist for achieving this. The first method uses the AS keyword to not only name the derived column, but also to permit referencing the column later in the query. This is useful in situations where a reference to the ordinal position is needed. The next example uses the ordinal position to reference a column in a query with the ORDER BY clause.
SQL Code
PROC SQL; SELECT prodname, prodcost, .075 * prodcost AS Sales_Tax FROM PRODUCTS ORDER BY 3; QUIT;
The SAS System Product Product Name Cost Sales_Tax _______________________________________________ Analog Cell Phone $35.00 2.625 Office Phone $130.00 9.75 Digital Cell Phone $175.00 13.125 Spreadsheet Software $299.00 22.425 Graphics Software $299.00 22.425 Wordprocessor Software $299.00 22.425 Database Software $399.00 29.925 Dream Machine $3,200.00 240 Business Machine $3,300.00 247.5 Travel Laptop $3,400.00 255 |
The next example illustrates the second method of assigning a column heading for the computed sales tax column using the LABEL= option. To further enhance the output’s readability, a numeric dollar format is specified.
Note: | Because the next example is only a query and the table is not being updated, the assigned attributes are only available for the duration of the step and are not permanently saved in the table’s record descriptor. |
SQL Code
PROC SQL; SELECT prodname, prodcost, .075 * prodcost FORMAT=DOLLAR7.2 LABEL='Sales Tax' FROM PRODUCTS; QUIT;
The SAS System Product Sales Product Name Cost Tax _______________________________________________ Dream Machine $3,200.00 $240.00 Business Machine $3,300.00 $247.50 Travel Laptop $3,400.00 $255.00 Analog Cell Phone $35.00 $2.63 Digital Cell Phone $175.00 $13.13 Office Phone $130.00 $9.75 Spreadsheet Software $299.00 $22.43 Database Software $399.00 $29.93 Wordprocessor Software $299.00 $22.43 Graphics Software $299.00 $22.43 |
By definition, tables are unordered sets of data. The data that comes from a table does not automatically appear in any particular order. To offset this behavior, the SQL procedure provides the ability to impose order in a table by using an ORDER BY clause. When used, this clause orders the query results according to the values in one or more selected columns, it must be specified after the FROM clause.
Rows of data can be ordered in ascending or descending (DESC) order for each column specified (ascending is the default order). To illustrate how selected columns of data can be ordered, let’s first view the PRODUCTS table and all its columns arranged in ascending order by product number (PRODNUM).
SQL Code
PROC SQL;
SELECT *
FROM PRODUCTS
ORDER BY prodnum;
QUIT;
Results
The SAS System Manufacturer Product Product Product Name Number Product Type Cost Number _______________________________________________ ________________ 1110 Dream Machine 111 Workstation $3,200.00 1200 Business Machine 120 Workstation $3,300.00 1700 Travel Laptop 170 Laptop $3,400.00 2101 Analog Cell Phone 210 Phone $35.00 2102 Digital Cell Phone 210 Phone $175.00 2200 Office Phone 220 Phone $130.00 5001 Spreadsheet Software 500 Software $299.00 5002 Database Software 500 Software $399.00 5003 Wordprocessor Software 500 Software $299.00 5004 Graphics Software 500 Software $299.00 |
The next example illustrates a query that selects and orders multiple columns of data from the PRODUCTS table. Output is arranged first in ascending order by product type (PRODTYPE) and within product type in descending order by product cost. The code and output is shown.
SQL Code
PROC SQL;
SELECT prodname, prodtype, prodcost, prodnum
FROM PRODUCTS
ORDER BY prodtype, prodcost DESC;
QUIT;
The SAS System Product Product Product Name Product Type Cost Number _____________________________________________ _______________ Travel Laptop Laptop $3,400 .00 1700 Digital Cell Phone Phone $175 .00 2102 Office Phone Phone $130 .00 2200 Analog Cell Phone Phone $35 .00 2101 Database Software Software $399 .00 5002 Spreadsheet Software Software $299 .00 5001 Graphics Software Software $299 .00 5004 Wordprocessor Software Software $299 .00 5003 Business Machine Workstation $3,300 .00 1200 Dream Machine Workstation $3,200 .00 1110 |
Occasionally it may be useful to display data in designated groups. A GROUP BY clause is used in these situations to aggregate and order groups of data using a designated column with the same value. When a GROUP BY clause is used without a summary function, SAS issues a warning on the SAS log with the message, “A GROUP BY clause has been transformed into an ORDER BY clause because neither the SELECT clause nor the optional HAVING clause of the associated table-expression referenced a summary function.” The GROUP BY is transformed into an ORDER BY clause and then processed.
When a GROUP BY clause is used without specifying a summary function in the SELECT statement, the entire table is treated as a single group and ordered in ascending order. The next example illustrates a GROUP BY clause without any summary function specifications. Due to the absence of any summary functions, the GROUP BY clause is automatically transformed into an ORDER BY clause, with the rows being ordered in ascending order by product type (PRODTYPE).
PROC SQL;
SELECT prodtype,
prodcost
FROM PRODUCTS
GROUP BY prodtype;
QUIT;
Results
The SAS System Product Product Type Cost __________________________ Laptop $3,400.00 Phone $130.00 Phone $175.00 Phone $35.00 Software $299.00 Software $299.00 Software $299.00 Software $399.00 Workstation $3,200.00 Workstation $3,300.00 |
When a GROUP BY clause is used with a summary function, the rows are aggregated in a series of groups. This means that an aggregate function is evaluated on a group of rows and not on a single row at a time. Suppose the least expensive product in each product category needs to be identified. A separate query for each product category could be specified using the MIN function to determine the cheapest product. But this would require separate runs to be executed — not a very good approach. A better way to do this would be to specify a GROUP BY clause in a single statement as follows.
PROC SQL; SELECT prodtype, MIN(prodcost) AS Cheapest Format=dollar9.2 Label='Least Expensive' FROM PRODUCTS GROUP BY prodtype; QUIT;
Results
The SAS System Least Product Type Expensive ___________________________ Laptop $3,400.00 Phone $35.00 Software $299.00 Workstation $3,200.00 |
In the absence of an ORDER BY clause, the SQL procedure automatically sorts the results from a grouped query in the same order as specified in the GROUP BY clause. When both an ORDER BY and GROUP BY clause are specified for the same column(s) or column order, no additional processing occurs to satisfy the request. Because the ORDER BY and GROUP BY clauses are not mutually exclusive, they can be used together. Internally, the GROUP BY clause first sorts the results on the grouping column(s) and then aggregates the rows of the query by the same grouping column.
But what happens when the column(s) specified in the ORDER BY and GROUP BY clauses are not the same? In these situations additional processing requirements are generally needed. The additional processing, in the worst case scenario, may require remerging summary statistics back with the original data. In other cases, additional sorting requirements may be necessary. Suppose information about the least expensive product in each product category is desired. But instead of automatically sorting the results in ascending order by product type, as before, the results will be displayed in ascending order by the least expensive product.
PROC SQL; SELECT prodtype, MIN(prodcost) AS Cheapest Format=dollar9.2 Label='Least Expensive' FROM PRODUCTS GROUP BY prodtype ORDER BY cheapest; QUIT;
Results
The SAS System Least Product Type Expensive ____________________________ Phone $35.00 Software $299.00 Workstation $3,200.00 Laptop $3,400.00 |
When processing groups of data, it is frequently useful to subset aggregated rows (or groups) of data. This way, aggregated data can be filtered one group at a time in contrast to the WHERE clause where individual rows of data are filtered one row at a time, not aggregated rows. SQL provides a convenient way to subset (or filter) groups of data by using the GROUP BY and HAVING clauses. The HAVING clause is applied after the summary of all observations.
Suppose you want to identify only those product groupings that have an average cost less than $200.00 from the PRODUCTS table. Your first inclination might be to use a summary function in a WHERE clause. But this would not be valid because a WHERE clause is designed specifically to evaluate a single row at a time. This is in direct contrast with the way a summary function processes data because summary functions evaluate groups of rows at a time, not a single row of data at a time as with a WHERE clause. For those already familiar with subqueries as discussed in Chapter 7, “Coding Complex Queries,” you could also approach the problem as a complex query. But there is an easier and more straightforward way of identifying and selecting the desired product groups using the GROUP BY and HAVING clauses, as follows.
SQL Code
PROC SQL;
SELECT prodtype,
AVG(prodcost)
FORMAT=DOLLAR9.2 LABEL='Average Product Cost'
FROM PRODUCTS
GROUP BY prodtype
HAVING AVG(prodcost) <= 200.00;
QUIT;
Results
The SAS System Average Product Product Type Cost ___________________________ Phone $113.33 |