Chapter 4. Statistics, analytic, OLAP functions in business scenarios 183
Example 4-20 Chi-square computation
WITH c_table (prod_name, city, n, g1, g2) AS
(
SELECT prod_name, city, count(*), 2e0*(0.5e0-GROUPING(prod_name)),
2e0*(0.5e0-GROUPING(city))
FROM survey
GROUP BY CUBE (prod_name, city)
)
SELECT SUM(g1*g2*2e0*n*log(n)) as chi2
FROM c_table
The result of this query is shown in Figure 4-28.
Figure 4-28 Chi-Squared value of city and product preference relationship
The Cumulative Distribution of the Chi-square table (not shown here) shows that
under the null hypothesis, there is no statistical relationship between where the
customer lives and the products (s)he will buy. The probability of seeing a
Chi-square values of 34.1139 or higher is less than 0.001%.
Therefore, we conclude that there is a relationship between where a person lives
and the products (s)he will buy.
Step 3
Here we use the Wilcoxon Rank Sum Test to prove that Palo Alto residents will
likely buy more mortgages than the residents of San Jose.
Here too, it involves
disproving the null hypothesis, that is we disprove that
customers in Palo Alto will
not buy more mortgages than the residents of San
Jose. This involves the following steps:
a. Collect the percentage of people who responded “yes” to likely buying
mortgages for each branch in Palo Alto and San Jose. Load the statistics
to a table called SURVEY_MORTG. The columns in this table are:
Cityid - city
Branchid - branch within the city
PercentYes - Percentage of branch customers who responded YES
Important: Consider using floating point data type instead of INTEGER or
DECIMAL to avoid arithmetic exceptions such as overflow, when multiple
divide and multiplication operators involved.