376 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition
10.2.2 Data preparation
It has long been true that one of the most time-consuming stages in a data
mining project is data preparation. This step involves several activities to make
the source data suitable for data mining processing, including but not limited to
the following tasks:
???? Integrate or consolidate data from multiple sources into a single data set
suitable for data mining
???? Transfer data values or calculate new data values for inclusion in the data
mining solution
???? Align granularity (for example, transaction level versus daily summary) of data
from different sources
???? Eliminate or correct “bad” data values in the source data, such as null values
or other errors
The result of data preparation is a data set containing all of the records required
to implement a data mining model using one of the mining methods that we
discuss “The data mining process” on page 358.
Within InfoSphere Warehouse 10.1 Design Studio there are two primary steps
involved in data preparation, namely creating the input model and defining the
data preparation profile, as explained here:
1. Input model creation
The input model defines relationships within the data used for the data mining
model in terms of hierarchies and levels. This is similar to dimensional
structures in an OLAP model, and in fact OLAP models can be used to guide
the development of the input model.
2. Define the data preparation profile
The data preparation profile defines the focus of analysis (what aspect of the
data is being analyzed, such as clients in a clustering model). It then defines
the relevant properties or variables that are related to the focus of analysis.
These properties might be drawn directly from table columns or calculated or
transformed from one or more columns.
In summary, the data preparation stage of the data mining process is an ETL
process designed to prepare source data into a single data set ready for use as
input to the data mining method. As such, traditional ETL means might be used
to perform these steps. The SQL Warehousing Tool (SQW) data flow features
found in Design Studio can be used for this purpose. Alternatively, Version 9.7
introduced new wizards in Design Studio to aid the development of input models
and data preparation profiles.