110 High-Function Business Intelligence in e-business
The usual diagnostic statistics that accompany a linear regression analysis can
be computed in terms of the above functions as follows, and are offered with
minimal explanation. All the following expressions apply to a simple linear
regression, that is a model which includes only one independent variable.
? Adjusted R
2
1 - ((1-REGR_R2)*((REGR_COUNT - 1) / (REGR_COUNT - 2)))
? Standard error (standard deviation of the residuals):
SQRT((REGR_SYY-(POWER(REGR_SXY,2)/REGR_SXX)) / (REGR_COUNT-2))
Note that:
? Total sum of squares:
REGR_SYY
? Regression sum of squares:
POWER(REGR_SXY,2) / REGR_SXX
? Residual sum of squares:
(Total sum of squares) - (regression sum of squares)
? t statistic:
For each coefficient (slope and intercept in the simple linear regression
model), there is a concern as to whether the coefficients value is meaningful,
or if the coefficient is really zero. That is, the independent variable (x) does not
contribute to the value of the dependent variable (y). The t statistic can help
make this determination
t statistic for slope
REGR_SLOPE * SQRT(REGR_SXX) / (Standard error)
t statistic for intercept
REGR_INTERCEPT/((Standard error) *
SQRT((1/REGR_COUNT)+(POWER(REGR_AVGX,2)/REGR_SXX))
3.2.13 COVAR, CORR, VAR, STDDEV, and regression examples
The following examples give a flavor of the use of these functions in a number of
scenarios.
i
th
residual y
i
ax
i
b
+()=
Chapter 3. DB2 UDBs statistics, analytic, and OLAP functions 111
COVARIANCE example
We wish to explore the relationship between employee salary and the bonus that
they receive, using the data shown in Figure 3-1.
Figure 3-1 D11 Employee salary & bonus
The DB2 SQL for covariance could be as shown in Example 3-2:
Example 3-2 COVARIANCE example
SELECT COVARIANCE (salary,bonus)
FROM employee
WHERE workdept = ‘D11’
The result of this query is 23650.86.
This positive result indicates there is a positive relationship between salary and
bonus, that is, employees with high (low) salaries tend to get high (low) bonuses.
While this conclusion appears intuitive with only a few data points, it is less
obvious when there are a large number of data points involved say 1000 or
10,000 employees. The covariance function thus enables relationships between
variables.
Correlation helps quantify the strength of the relationship.
Note: Covariance by itself does not indicate how strong the relationship is. It
merely indicates one exists and whether it is a positive or negative relation. To
determine the strength of a relationship the correlation must be calculated.
112 High-Function Business Intelligence in e-business
CORRELATION examples
Using the same salary bonus example in Figure 3-1, we can quantify the strength
of the relationship with the SQL shown in Example 3-3:
Example 3-3 CORRELATION example 1
SELECT CORRELATION (salary,bonus) AS cor
FROM employee
WHERE workdept = ‘D11’
The result of the query is 0.739.
This quantitatively confirms the reasonably strong linear relationship between
salary and bonus from the employees in department D11.
Another example of correlation involving the retail industry is shown in
Example 3-4. Assume we have the transactions of purchases from all the
customers of a a retail organization selling a variety of products, and we would
like to identify customers with similar buying habits. For example, when Customer
A bought a particular product, Customer B also tended to buy the same product.
Such information can be put to effective use in targeted marketing.
A view called
transhist is created that contains the customer id, product id, and
the dollar amount purchased over all transactions.
Example 3-4 CORRELATION example 2
SELECT a.custid as custid1, b.custid as custid2,
CORR(a.amount, b.amount) AS cor
FROM transhist a, transhist b
WHERE a.prodid = b.prodid AND a.custid < b.custid
GROUP BY a.custid, b.custid
HAVING CORR(a.amount, b.amount) >= 0.5 AND COUNT(*) > 100
ORDER BY a.custid, cor DESC
This query joins the view with itself, and uses the HAVING clause to restrict the
output to cases of high correlation (>= 0.5), and to cases where there are at least
a 100 products involved, that is, there are at least 100 data points used to
compute the correlation.
The result of this query is as follows:
CUSTID1 CUSTID2 CORR
----------- ----------- -----
1026 8271 0.51
1071 2014 0.74 <=
1071 7219 0.63
2014 7219 0.58
8271 9604 0.56
Chapter 3. DB2 UDBs statistics, analytic, and OLAP functions 113
The result shows a high correlation between the buying habits of Customer 1071
and Customer 2014, that is, whenever customer 1071 bought a large amount of
a given product, then customer 2014 also tended to buy a large amount of the
same product.
VARIANCE example
DB2 has a built-in function to calculate variance. Using the same salary and
bonus data shown in Figure 3-1, our SQL is shown in Example 3-5:
Example 3-5 VARIANCE example
SELECT AVG(salary), VARIANCE(salary) AS Variance
FROM employee
WHERE workdept = 'D11’
The average salary is $24677.78, while the variance in our case is
1.885506172839506E7.
However this is not very intuitive, and standard deviation provides a more intuitive
answer.
STDDEV examples
Using the same data as shown in Figure 3-1, the standard deviation of salary of
employees in department D11 an be computed as shown in Example 3-6:
Example 3-6 STDDEV example 1
SELECT AVG(salary), STDDEV(salary) AS StandDev
FROM employee
WHERE workdept = 'D11'
The result of this query is an average of $2477.78 and a standard deviation of
$4342.24.
It indicates the variation of individual salaries from the average salary for the set,
and is more intuitive than the variance function discussed earlier.
Another example of standard deviation involves computing the various statistics
of an organizations sales worldwide over multiple years.
The data is contained in three tables: trans, transitem, and loc, as shown in
Example 3-7.
114 High-Function Business Intelligence in e-business
Example 3-7 STDDEV example 2
SELECT loc.country AS country, YEAR(t.pdate) AS year,
COUNT(*) AS count, SUM(ti.amount) AS sum,
AVG(ti.amount) AS avg, MAX(ti.amount) AS max,
STDDEV(ti.amount) AS std
FROM trans t, transitem ti, loc loc
WHERE t.transid = ti.transid AND loc.locid = t.locid
GROUP BY loc.country, year(t.pdate)
The result of this query is as follows:
country year count sum avg max stddev
------- ------ ------ ------ ------ ------ ------
USA 1998 235 127505 542.57 899.99 80.32
USA 1999 349 236744 678.35 768.61 170.45
GERMANY 1998 180 86278 479.32 771.65 77.41
GERMANY 1999 239 126737 530.28 781.99 72.22
...
The result shows commonly gathered statistics related to sales such as COUNT,
SUM, AVG and MAX. The STDDEV function shows that USA sales in 1999 are
much more variable (STDDEV of $170.45) than sales in other years and other
locations, i.e., the amounts in the individual sales transactions vary more widely
from their average value of $678.35.
Linear Regression examples
Using the same data shown in Figure 3-1, we will derive a regression model
where salary is the independent variable and bonus is the dependent variable
using the DB2 SQL shown in Example 3-8.
Example 3-8 Linear regression example 1
SELECT REGR_SLOPE (bonus , salary) AS slope,
REGR_ICPT (bonus , salary) AS intercept
FROM employee
WHERE workdept = ‘D11’
The result of this query is a slope of 0.0125 and an intercept is $179.313, that is:
Bonus
0.0125
Salary
× 179.313+=
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset