Chapter 4. Statistics, analytic, OLAP functions in business scenarios 179
The data and histogram shows that a significant proportion of the transactions
are less than fifteen thousand, with a peak in the six to nine thousand range. The
answer to the number of transactions worth less than 6,000 dollars is 1058. The
graph also shows that 449 transactions are less than 3000 dollars, and 609
transactions are between 3000 and 6000 dollars.
Query 2
The SQL shown in Example 4-19 can be used to generate the result set that can
then be displayed via an equi-height histogram. The bucket boundaries are
chosen so that each bucket contains approximately the same number of data
points.
Example 4-19 Equi-height histogram query
WITH dt AS
(
SELECT t.transid, SUM(amount) AS trans_amt,
ROWNUMBER() OVER (ORDER BY SUM(amount))*10/
(
SELECT COUNT(DISTINCT transid) + 1
FROM transitem
) AS bucket
FROM trans t, transitem ti
WHERE t.transid=ti.transid
GROUP BY t.transid
)
SELECT (bucket+1)*10 AS percentile, COUNT(bucket) AS b_count,
MAX(trans_amt) AS max_value
FROM dt
GROUP BY bucket
In this example we have used 10 buckets, so that 10% of the data points fall in
each bucket. The internal bucket boundaries are often referred to as the 0.1, 0.2,
..., 0.9 quantiles of the data distribution, or as the 10th, 20th, ...., 90th percentiles.
For example, Figure 4-26 shows that the 10th percentile for our data is $2840.05,
that is, 10% of transactions have a dollar value less than this amount.
In effect, the query computes the total sales amount for each of ‘n’ transactions
as trans-amt, sorts the amounts in increasing order, and assigns the number one
to the smallest transaction, two to the next largest transaction, etc. using the
ROWNUMBER() function. Multiplying these numbers by 10 (the number of
buckets), and then dividing these numbers by ‘n’ (which is computed as
COUNT(DISTINCT transid)), and rounding to the nearest integer produces the
desired bucket number for each transaction.
Figure 4-26 shows the results of this query.