232 High-Function Business Intelligence in e-business
A.1.13 Sampling
Critical to statistics is the taking of a sample from a very large population on
which to perform analyses. Two important factors that apply to a sample are the
following:
1.
Size of the sample, that is, the number of units selected from the entire
population.
2.
Quality of the sample (or how good or representative is the sample) vis-a-vis
the population from which it was extracted.
The selection process is critical to the sample, for example, excluding minorities
in a voter survey would significantly taint the results of analysis of such a sample.
A sampling procedure that ensures that all possible samples of n objects are
equally likely is called a
Simple Random Sample
. A simple random sample has
two properties that make it the standard against which all other methods are
measured as follows:
1. Unbiased each object has the same chance of being chosen.
2. Independence selection of one object has no influence on the selection of
the other objects .
Deriving totally unbiased, independent samples may not be cost effective, and
other methods are used to come up with efficient and cost-effective samples. For
example, knowing something about the population allows for different techniques
such as
Stratified
sampling,
Cluster
sampling and
Systematic
sampling.
A.1.14 Transposition
Transposition is simply moving the rows to the columns and vice versa. This is
sometimes called Pivoting.
A.1.15 Histograms
A histogram is a graphical representation of the distribution of a set of data.
A histogram lets us see the shape of a set of data - where its center is, and how
far it spreads out on either side. It can provide a graphical representation of other
statistics like spread, and skewness.
Skewness describes if the tail is to the left or right. Right-hand skewness is
referred to as positive and left-hand is negative.
Appendix A. Introduction to statistics and analytic concepts 233
Individual data points are grouped together into ranges in order to visualize how
frequently data in each range occurs within the data set. High bars indicate more
data in a given range, and low bars indicate less data. In the histogram shown in
Figure A-3, the peak is in the 20-39 range, where there are five points.
Figure A-3 Histogram
The popularity of a histogram comes from its intuitive easy-to-read picture of the
location and variation in a data set. There are, however, two weaknesses of
histograms that you should bear in mind:
? Histograms can be manipulated to show different pictures. If too few or too
many bars are used, the histogram can be misleading. This is an area which
requires some judgment, and perhaps some experimentation, based on the
analyst's experience.
? Histograms can also obscure the time differences among data sets. For
example, if we looked at data for #births/day in the United States in 1996, you
would miss any seasonal variations, e.g. peaks around the times of full moon.
Likewise, in quality control, a histogram of a process run tells only one part of
a long story. There is a need to keep reviewing the histograms and control
charts for consecutive process runs over an extended time to gain useful
knowledge about a process.
A.1.15.1 Equi-width histograms
An equi-width histogram is a special case of the above histogram, with the
requirement that the x-axis ranges are equally distributed, while the y-axis
represents the frequency distribution within each of those x-axis ranges.
A.1.15.2 Equi-height histograms
An equi-height histogram is similar to the typical histogram discussed previously
in that it graphically represents the distribution of data, but it uses a slightly
different scale. In an equi-height histogram the height of each bar is equal and
X-axis range varies.
234 High-Function Business Intelligence in e-business
Figure A-4 Equi-height or frequency histogram
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset