Chapter 3. DB2 UDBs statistics, analytic, and OLAP functions 115
DB2 has a
R
2
function, REGR_R2. The properties of R
2
are:
?
R
2
bound is between 0 and 1.
? If
R
2
equals 1 then all the points fit on the regression line exactly.
? If
R
2
equals zero then the two attributes are independent.
The closer
R
2
is to 1, the better the computed linear regression model. In
general, an
R
2
greater than 0.75 or so, is considered a good fit for most
applications. However, it varies by application and it is ultimately up to the user to
decide what value constitutes a good model.
The DB2 SQL could look as shown in Example 3-9:
Example 3-9 Linear regression example 2
SELECT REGR_R2 (bonus , salary) AS r2
FROM employee
WHERE workdept = 'D11'
The result of this query is 0.54624.
Since
R
2
is not very close to 1, we conclude that the computed linear regression
model does not appear to be a very good fit.
Another example of using regression involves the assumption of a linear
relationship between the advertising budget and sales figures of a particular
organization that conforms to the equation:
Where:
y is the sales dependent variable.
x is the advertising budget independent variable.
a is the slope.
b is the y-axis intercept corresponding to budget cost even with zero sales.
Note: The columns referenced in the regression functions are reversed from
those in the variance and covariance examples. Since we wish to determine
BONUS as a function of SALARY, it is listed first before SALARY.
yaxb
+=
116 High-Function Business Intelligence in e-business
The queries shown in Example 3-10 determine the values for a, and b given a
set of non-null values of budget and sales data points in a table t:
Example 3-10 Linear regression example 3
SELECT
REGR_COUNT(t.sales, t.ad_budget) AS num_cities,
REGR_SLOPE(t.sales, t.ad_budget) AS a,
REGR_ICPT(t.sales, t.ad_budget) AS b
FROM t
The result of the query is as follows, with REGR_COUNT returning the number of
(x,y)
non-null pairs used to fit the regression line.
num_cities a b
---------- ------ ------
126 1.9533 13.381
The input data and the derived linear model are shown in Figure 3-2.
Figure 3-2 Linear regression
While the foregoing SQL models the equation, it does not tell you the quality of
the fit, that is, the accuracy of the regression line. As described earlier, the
R
2
statistic must be computed to determine the quality of the regression line. R
2
is
the square of the correlation coefficient (CORR).
R
2
can also be interpreted as
the proportion of variation in the y values that is explained by the variation in the
x values, as opposed to variation due to randomness or to other variables not
included in the model. Consider the coding shown in Example 3-11.
0
50
100
150
200
250
0 20406080100
sales
ad_budget
y = 1.9533x + 13.381
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset