COVAR, CORR, VAR, STDDEV, and regression examples (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

110 High-Function Business Intelligence in e-business

The usual diagnostic statistics that accompany a linear regression analysis can

be computed in terms of the above functions as follows, and are offered with

minimal explanation. All the following expressions apply to a simple linear

regression, that is a model which includes only one independent variable.

? Adjusted R

1 - ((1-REGR_R2)*((REGR_COUNT - 1) / (REGR_COUNT - 2)))

? Standard error (standard deviation of the residuals):

SQRT((REGR_SYY-(POWER(REGR_SXY,2)/REGR_SXX)) / (REGR_COUNT-2))

Note that:

? Total sum of squares:

REGR_SYY

? Regression sum of squares:

POWER(REGR_SXY,2) / REGR_SXX

? Residual sum of squares:

(Total sum of squares) - (regression sum of squares)

? t statistic:

For each coefficient (slope and intercept in the simple linear regression

model), there is a concern as to whether the coefficient’s value is meaningful,

or if the coefficient is really zero. That is, the independent variable (x) does not

contribute to the value of the dependent variable (y). The ‘t statistic’ can help

make this determination

t statistic for slope

REGR_SLOPE * SQRT(REGR_SXX) / (Standard error)

t statistic for intercept

REGR_INTERCEPT/((Standard error) *

SQRT((1/REGR_COUNT)+(POWER(REGR_AVGX,2)/REGR_SXX))

3.2.13 COVAR, CORR, VAR, STDDEV, and regression examples

The following examples give a flavor of the use of these functions in a number of

scenarios.

residual y

+()–=

Chapter 3. DB2 UDB’s statistics, analytic, and OLAP functions 111

COVARIANCE example

We wish to explore the relationship between employee salary and the bonus that

they receive, using the data shown in Figure 3-1.

Figure 3-1 D11 Employee salary & bonus

The DB2 SQL for covariance could be as shown in Example 3-2:

Example 3-2 COVARIANCE example

SELECT COVARIANCE (salary,bonus)

FROM employee

WHERE workdept = ‘D11’

The result of this query is 23650.86.

This positive result indicates there is a positive relationship between salary and

bonus, that is, employees with high (low) salaries tend to get high (low) bonuses.

While this conclusion appears intuitive with only a few data points, it is less

obvious when there are a large number of data points involved — say 1000 or

10,000 employees. The covariance function thus enables relationships between

variables.

Correlation helps quantify the strength of the relationship.

Note: Covariance by itself does not indicate how strong the relationship is. It

merely indicates one exists and whether it is a positive or negative relation. To

determine the strength of a relationship the correlation must be calculated.

112 High-Function Business Intelligence in e-business

CORRELATION examples

Using the same salary bonus example in Figure 3-1, we can quantify the strength

of the relationship with the SQL shown in Example 3-3:

Example 3-3 CORRELATION example 1

SELECT CORRELATION (salary,bonus) AS cor

FROM employee

WHERE workdept = ‘D11’

The result of the query is 0.739.

This quantitatively confirms the reasonably strong linear relationship between

salary and bonus from the employees in department ‘D11’.

Another example of correlation involving the retail industry is shown in

Example 3-4. Assume we have the transactions of purchases from all the

customers of a a retail organization selling a variety of products, and we would

like to identify customers with similar buying habits. For example, when Customer

A bought a particular product, Customer B also tended to buy the same product.

Such information can be put to effective use in targeted marketing.

A view called

transhist is created that contains the customer id, product id, and

the dollar amount purchased over all transactions.

Example 3-4 CORRELATION example 2

SELECT a.custid as custid1, b.custid as custid2,

CORR(a.amount, b.amount) AS cor

FROM transhist a, transhist b

WHERE a.prodid = b.prodid AND a.custid < b.custid

GROUP BY a.custid, b.custid

HAVING CORR(a.amount, b.amount) >= 0.5 AND COUNT(*) > 100

ORDER BY a.custid, cor DESC

This query joins the view with itself, and uses the HAVING clause to restrict the

output to cases of high correlation (>= 0.5), and to cases where there are at least

a 100 products involved, that is, there are at least 100 data points used to

compute the correlation.

The result of this query is as follows:

CUSTID1 CUSTID2 CORR

----------- ----------- -----

1026 8271 0.51

1071 2014 0.74 <=

1071 7219 0.63

2014 7219 0.58

8271 9604 0.56

Chapter 3. DB2 UDB’s statistics, analytic, and OLAP functions 113

The result shows a high correlation between the buying habits of Customer 1071

and Customer 2014, that is, whenever customer 1071 bought a large amount of

a given product, then customer 2014 also tended to buy a large amount of the

same product.

VARIANCE example

DB2 has a built-in function to calculate variance. Using the same salary and

bonus data shown in Figure 3-1, our SQL is shown in Example 3-5:

Example 3-5 VARIANCE example

SELECT AVG(salary), VARIANCE(salary) AS Variance

FROM employee

WHERE workdept = 'D11’

The average salary is $24677.78, while the variance in our case is

1.885506172839506E7.

However this is not very intuitive, and standard deviation provides a more intuitive

answer.

STDDEV examples

Using the same data as shown in Figure 3-1, the standard deviation of salary of

employees in department ‘D11’ an be computed as shown in Example 3-6:

Example 3-6 STDDEV example 1

SELECT AVG(salary), STDDEV(salary) AS StandDev

FROM employee

WHERE workdept = 'D11'

The result of this query is an average of $2477.78 and a standard deviation of

$4342.24.

It indicates the variation of individual salaries from the average salary for the set,

and is more intuitive than the variance function discussed earlier.

Another example of standard deviation involves computing the various statistics

of an organization’s sales worldwide over multiple years.

The data is contained in three tables: trans, transitem, and loc, as shown in

Example 3-7.

114 High-Function Business Intelligence in e-business

Example 3-7 STDDEV example 2

SELECT loc.country AS country, YEAR(t.pdate) AS year,

COUNT(*) AS count, SUM(ti.amount) AS sum,

AVG(ti.amount) AS avg, MAX(ti.amount) AS max,

STDDEV(ti.amount) AS std

FROM trans t, transitem ti, loc loc

WHERE t.transid = ti.transid AND loc.locid = t.locid

GROUP BY loc.country, year(t.pdate)

The result of this query is as follows:

country year count sum avg max stddev

------- ------ ------ ------ ------ ------ ------

USA 1998 235 127505 542.57 899.99 80.32

USA 1999 349 236744 678.35 768.61 170.45

GERMANY 1998 180 86278 479.32 771.65 77.41

GERMANY 1999 239 126737 530.28 781.99 72.22

...

The result shows commonly gathered statistics related to sales such as COUNT,

SUM, AVG and MAX. The STDDEV function shows that USA sales in 1999 are

much more variable (STDDEV of $170.45) than sales in other years and other

locations, i.e., the amounts in the individual sales transactions vary more widely

from their average value of $678.35.

Linear Regression examples

Using the same data shown in Figure 3-1, we will derive a regression model

where salary is the independent variable and bonus is the dependent variable

using the DB2 SQL shown in Example 3-8.

Example 3-8 Linear regression example 1

SELECT REGR_SLOPE (bonus , salary) AS slope,

REGR_ICPT (bonus , salary) AS intercept

FROM employee

WHERE workdept = ‘D11’

The result of this query is a slope of 0.0125 and an intercept is $179.313, that is:

Bonus

0.0125

Salary

× 179.313+=

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for COVAR, CORR, VAR, STDDEV, and regression examples (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
COVAR, CORR, VAR, STDDEV, and regression examples (1/2)