224 High-Function Business Intelligence in e-business
A linear relationship can be expressed in the form:
:
Where:
x is an independent variable.
A is the slope of the line.
B is the intercept with the y-axis.
y is the dependent variable.
A simple regression tries to find the best fit for A and B given a number of data
points.
Once the model coefficients (A and B) are computed, we need to know the
accuracy of the model This is critical in order to understand the reliability of any
extrapolations.
The coefficient of determination or r squared (
r
2
) value provides us with
information about the accuracy of the linear regression model that was
computed.
A.1.5.1 R
2
(R Squared)
The efficacy of the regression model is often measured by how much of the
variability of the data it explains.
R
2
(also known as the coefficient of determination) can be interpreted as the
proportion of variability in the data values that is explained by the model.
The value of R
2
ranges between 0 and 1. The closer it is to 1, the better the
model does in explaining the data.
A.1.6 Hypothesis testing
Hypothesis testing is a major part of inferential statistics. It is a formal procedure
to collect sample data and then use this data to verify whether a given hypothesis
is true or not.
yAxB
+=
Attention: A hypothesis is a claim or statement about the state of the world.
Null hypothesis is the logical negation of the hypothesis.
Appendix A. Introduction to statistics and analytic concepts 225
The hypothesis often takes the form of a statement about an unknown
population
1
parameter, or the relation between unknown population parameters.
A hypothesis test begins with two statements about a population that are
mutually exclusive:
1. The average weight of mountain lions is 150 pounds.
2. The average weight of mountain lions is not 150 pounds.
Often the statements will refer to a population parameter such as a population
mean. Sometimes it applies to more than one population, such as a claim that
the means of 4 different populations are all equal.
Since the population parameter is a number (call it PP), these statements will
have one of the following three different forms, where a is a constant.
1. PP is equal to a versus PP is not equal to a.
2. PP is greater than or equal to a versus PP is less than a.
3. PP is less than or equal to a versus PP is greater than a.
The hypothesis that includes equality is called a
null hypothesis, while the one
that does not include equality is called the
alternative hypothesis.
In each of the above three forms listed above, the first statement is a
null
hypothesis.
The sample data provides the way to distinguish between the null and alternative
hypothesis. For example, if the null hypothesis claims that the population mean is
10, while the sample mean turns out to be 5 and the sample data is very
representative of the population, then the odds are good that the null hypothesis
is wrong. Likewise, if the claim is that the population mean is greater than or
equal to 10, and the sample mean is 5, then the odds are good that the null
hypothesis is wrong, and similarly for the third form listed above.
If we do not reject the null hypothesis, we accept it rather than affirm it.
1
A population is a collection of all data points of interest.
Important: In the case of an equality hypothesis, the sample data will rarely
prove the null hypothesis to be true. It will be very difficult to convince
someone that the population mean is exactly 10 no matter how much of a
sample we gather.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset