Data preparation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

376 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

10.2.2 Data preparation

It has long been true that one of the most time-consuming stages in a data

mining project is data preparation. This step involves several activities to make

the source data suitable for data mining processing, including but not limited to

the following tasks:

???? Integrate or consolidate data from multiple sources into a single data set

suitable for data mining

???? Transfer data values or calculate new data values for inclusion in the data

mining solution

???? Align granularity (for example, transaction level versus daily summary) of data

from different sources

???? Eliminate or correct “bad” data values in the source data, such as null values

or other errors

The result of data preparation is a data set containing all of the records required

to implement a data mining model using one of the mining methods that we

discuss “The data mining process” on page 358.

Within InfoSphere Warehouse 10.1 Design Studio there are two primary steps

involved in data preparation, namely creating the input model and defining the

data preparation profile, as explained here:

1. Input model creation

The input model defines relationships within the data used for the data mining

model in terms of hierarchies and levels. This is similar to dimensional

structures in an OLAP model, and in fact OLAP models can be used to guide

the development of the input model.

2. Define the data preparation profile

The data preparation profile defines the focus of analysis (what aspect of the

data is being analyzed, such as clients in a clustering model). It then defines

the relevant properties or variables that are related to the focus of analysis.

These properties might be drawn directly from table columns or calculated or

transformed from one or more columns.

In summary, the data preparation stage of the data mining process is an ETL

process designed to prepare source data into a single data set ready for use as

input to the data mining method. As such, traditional ETL means might be used

to perform these steps. The SQL Warehousing Tool (SQW) data flow features

found in Design Studio can be used for this purpose. Alternatively, Version 9.7

introduced new wizards in Design Studio to aid the development of input models

and data preparation profiles.

Chapter 10. Techniques for data mining in an operational warehouse 377

The input models and data preparation profiles can be seen in the InfoSphere

Warehouse 10.1 Design Studio data project explorer folders as shown in

Figure 10-10.

Figure 10-10 Data mining data preparation folders in design studio

To create the input model, right-click the Input Models folder and select New 

Data Preparation Input Model. The wizard asks for a model name and then

give a choice between the following two options:

???? Selecting an input model based on an OLAP Cubing Services cube model

This option allows the data mining model developer to use a predefined

dimensional model of levels and hierarchies as defined in a cube model. This

can be a significant time saver and provide synergy between the OLAP and

data mining analysis. The admissions, hierarchies, and levels are all derived

from the OLAP metadata. In addition, the tables from which the dimensions

are defined are selected, and the join relationships between the tables. This

greatly simplifies the development of the input model.

???? Selecting a database on which the input model will be based

This option allows you to develop the input model from scratch. Tables can be

drawn from the warehouse model and the join relationships can be defined

manually. Dimensions, hierarchies, and levels can also be defined manually.

The result using this path is the same as when the OLAP model is leveraged.

378 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

The result of creating the input model is to define a set of source tables and

hierarchies for the input data. See the example in Figure 10-11.

Figure 10-11 Data mining input model from OLAP cube with hierarchies

This model was created using an input OLAP cube model. We can see the

different tables for each dimension and the hierarchy levels. Note the Calendar

Date dimension and these levels:

???? Calendar Year

???? Calendar Quarter

???? Calendar Month

???? Calendar Date

After the input model is defined, the data preparation profile can be created and

specified. Recall that the purpose of the data preparation profile is to define the

transformations and calculations necessary to prepare the data set for data

mining. To create the profile, select the Data Preparation  Profiles folder,

right-click and select New  Data Preparation Profile. Give the profile a name

Chapter 10. Techniques for data mining in an operational warehouse 379

and link it to an input model (like the one created here). Figure 10-12 displays a

sampling of what is shown in the data preparation profile panel.

Figure 10-12 Data preparation profile example in Design Studio

The profile consists of essentially two primary elements:

???? Focus of Analysis is the column or columns that are the focus of the data

mining analysis; for example, the customers for whom we are doing

segmentation.

???? Features or Focus Attributes are the variables, defined in terms of input data

columns, that influence the data mining method output. The features are

defined in terms of a Name  Definition  Output Column triplet.

– The name is usually the name of the feature that has meaning to the

developer, such as a business name.

– The definition is in terms of a database column or columns and various

functions (transformations) that can be applied to those columns in terms

of calculations, aggregations, discretization

, and so on.

Discretization is the process of distilling a continuous numeric value (for example, income) into a

series of discrete buckets, such as high, medium, and low.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preparation