226 High-Function Business Intelligence in e-business
In order to measure whether the sample data is extreme enough to contradict the
null hypothesis, a test statistic is used. This is a random variable with a known
probability distribution that can be used to measure how likely we are to get such
sample data given that the null hypothesis is true. If, given the null hypothesis,
the probability of getting such a value is extremely low, then we would be inclined
to reject the null hypothesis in favor of the alternative hypothesis. On the other
hand, if the probability is not too small, then the null hypothesis might well be true
and we would have to accept it.
Some well known test statistics include the chi-squared statistic, and the
Wilcoxon Rank Sum Test W statistic.
Before computing the test statistic, a
significance level needs to be set. This is
our cutoff in terms of what we consider to be a probability that could happen by
chance, and typically is either 5% or 1%. The probability that given the null
hypothesis, you would get sample data this extreme or worse is called the
p-value of the test. Once the p-value is found, it is compared to the significance
level
. If the p-value is larger than the significance level, then you accept the
null hypothesis. If the
p-value is less than the significance level, then you reject
the null hypothesis.
In the first case (PP is equal to a), the null hypothesis can be proved wrong in
two ways as follows:
1. PP is bigger than a.
2. PP is smaller than a.
This kind of test is called a
two-tailed hypothesis test, because in the graph of
the probability distribution of the test statistic, the tails to both the left and right
correspond to the rejection of the null hypothesis.
In the second case, where rejection occurs because we think that PP is smaller
than the constant a is called a l
eft-tailed hypothesis test.
In the third case, where rejection occurs because we think PP is larger than the
constant a is called a
right-tailed hypothesis test.
Collectively the left-tailed and right-tailed hypotheses tests are called
one-tailed
tests
.
A.1.7 HAT diagonal
The HAT diagonal is used in conjunction with linear regression.
As discussed, linear regression involves a best fit of a collection of x,y pairs to a
mathematical equation of the form:
Appendix A. Introduction to statistics and analytic concepts 227
However, depending upon the data points, it is possible for the slope A and
intercept B to be unduly influenced by data points far from the mean of x.
The HAT diagonal test measures the leverage of each observation on the
predicted value for that observation.
This concept is demonstrated in Figure A-2, HAT diagonal influence of individual
data points on page 227. The data point at x=9 is far from the data points with
X-values between 0 and 2. The paired value of y for x=9 has a large influence on
the slope in this example.
Therefore, depending upon the computed slope, for a given value of x=9, the
value of Y is 5 or 9 which is a significant difference.
The HAT diagonal test identifies data paris that exert such undue influence on
the computed slope. The user may then choose to exclude or include these data
points from the computed linear regression model based on their unique
understanding of the domain of these data points, in order to obtain a more
accurate linear regression model.
Figure A-2 HAT diagonal influence of individual data points
yAxB
+=
228 High-Function Business Intelligence in e-business
The formula to calculate the HAT diagonal is:
Where:
Values that are above the generally accepted cutoff of
2
p
/
n
for the HAT diagonal
values, should be investigated further to determine their validity. This is typically
done by including more values in the regression to represent this outlying range
or validation of this data pair. Here:
?
n
is the number of observations used to fit the model.
?
p
is the number of parameters in the model.
For the charts in Figure A-2 on page 227:
? p is 1 variable.
? n is 6 observations.
Therefore, any observation whose HAT is greater than 2x1/6 = 0.33, should be
suspect.
The data point at x=9 has a HAT of 0.979 which is nearly three times larger than
the cutoff, and is therefore suspect.
HATdiagonal H
i
()
m
x
2
2
m
x
x
i
x
2
i
+()()
S
xx
()=
m
x
2
1
n
---
x
i
2
i
1=
n
=
m
x
1
n
---
x
i
i
1=
n
=
S
xx
x
i
m
x
()
2
i
1=
n
=
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset