HAT diagonal

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

226 High-Function Business Intelligence in e-business

In order to measure whether the sample data is extreme enough to contradict the

null hypothesis, a test statistic is used. This is a random variable with a known

probability distribution that can be used to measure how likely we are to get such

sample data given that the null hypothesis is true. If, given the null hypothesis,

the probability of getting such a value is extremely low, then we would be inclined

to reject the null hypothesis in favor of the alternative hypothesis. On the other

hand, if the probability is not too small, then the null hypothesis might well be true

and we would have to accept it.

Some well known test statistics include the chi-squared statistic, and the

Wilcoxon Rank Sum Test ‘W’ statistic.

Before computing the test statistic, a

significance level needs to be set. This is

our cutoff in terms of what we consider to be a probability that could happen by

chance, and typically is either 5% or 1%. The probability that given the null

hypothesis, you would get sample data this extreme or worse is called the

p-value of the test. Once the p-value is found, it is compared to the significance

level

. If the p-value is larger than the significance level, then you accept the

null hypothesis. If the

p-value is less than the significance level, then you reject

the null hypothesis.

In the first case (PP is equal to ‘a’), the null hypothesis can be proved wrong in

two ways as follows:

1. PP is bigger than ‘a’.

2. PP is smaller than ‘a’.

This kind of test is called a

two-tailed hypothesis test, because in the graph of

the probability distribution of the test statistic, the “tails” to both the left and right

correspond to the rejection of the null hypothesis.

In the second case, where rejection occurs because we think that PP is smaller

than the constant ‘a’ is called a l

eft-tailed hypothesis test.

In the third case, where rejection occurs because we think PP is larger than the

constant ‘a’ is called a

right-tailed hypothesis test.

Collectively the left-tailed and right-tailed hypotheses tests are called

one-tailed

tests

A.1.7 HAT diagonal

The HAT diagonal is used in conjunction with linear regression.

As discussed, linear regression involves a best fit of a collection of x,y pairs to a

mathematical equation of the form:

Appendix A. Introduction to statistics and analytic concepts 227

However, depending upon the data points, it is possible for the slope A and

intercept B to be unduly influenced by data points far from the mean of x.

The HAT diagonal test measures the leverage of each observation on the

predicted value for that observation.

This concept is demonstrated in Figure A-2, “HAT diagonal influence of individual

data points” on page 227. The data point at x=9 is far from the data points with

X-values between 0 and 2. The paired value of y for x=9 has a large influence on

the slope in this example.

Therefore, depending upon the computed slope, for a given value of x=9, the

value of Y is 5 or 9 which is a significant difference.

The HAT diagonal test identifies data paris that exert such undue influence on

the computed slope. The user may then choose to exclude or include these data

points from the computed linear regression model based on their unique

understanding of the domain of these data points, in order to obtain a more

accurate linear regression model.

Figure A-2 HAT diagonal influence of individual data points

yAxB

228 High-Function Business Intelligence in e-business

The formula to calculate the HAT diagonal is:

Where:

Values that are above the generally accepted cutoff of

for the HAT diagonal

values, should be investigated further to determine their validity. This is typically

done by including more values in the regression to represent this outlying range

or validation of this data pair. Here:

is the number of observations used to fit the model.

is the number of parameters in the model.

For the charts in Figure A-2 on page 227:

? p is 1 variable.

? n is 6 observations.

Therefore, any observation whose HAT is greater than 2x1/6 = 0.33, should be

suspect.

The data point at x=9 has a HAT of 0.979 which is nearly three times larger than

the cutoff, and is therefore suspect.

HATdiagonal H

()

+()–()

()⁄=

---

–()

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for HAT diagonal

Create new playlist

Sign In

Sign Up

Table of Contents for
HAT diagonal