Calculate the data ingest rate

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Understand and address data latency requirements 269

???? Processing window

The ingest processing window is the period where the data is available for

processing by the ingest process. For example, if data is presented to the

data warehouse at 4 p.m. and the processing must be complete by 5 p.m.,

then the processing window is one hour.

???? Data volume

Data volume refers to the quantity of data presented for ingest during each

processing window. Consider both average and peak volumes and plan for

peak volumes.

7.1.2 Calculate the data ingest rate

Use the values from the SLOs to calculate the rate of ingest required for each of

the data sources. Express the target data ingest rate per data node as

megabytes per second (MBps).

Calculate the ingest rate as follows:

???? Estimate the volume of data in megabytes (MB) to be ingested per ingest

cycle.

???? State the time in seconds allowed for the ingest process to complete.

???? Divide the volume by the time. In a partitioned database, then divide by the

number data nodes receiving data.

The data volume and ingest rate each refer to the raw data to be ingested. Data

transformations and the implementation of indexes, materialized query table

(MQTs), and multi-dimensional clustering (MDC) tables can significantly increase

the actual data and the number of transactions needed in the database. After

they are identified, execution time for these additional transactions must be

accommodated.

Through this process you will understand at what rate you need to be able to

ingest data to meet your service level objectives. Your infrastructure must have

the capacity to support the ingest rate and also support your service level

objectives for the query and maintenance workloads. This must be the focus of

your initial infrastructure and data throughput tests.

7.1.3 Analyze your ETL scenarios

This section provides a checklist to help you identify the key characteristics of

your ETL application in a systematic fashion. The key distinctions and options

are presented for each item.

270 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

Choose the one that best matches your situation. A checklist item can have

multiple answers because there might be different answers for different data

sources and tables in your project.

The checklist has the following sections:

???? Determining your data ingest pattern

???? Transformations involving the target database

???? Data volume and latency

???? Populating summary (or aggregate) tables

Determine your ETL pattern

In an operational data warehouse environment, the luxury of having an offline

window at the end of each day to process data in large batches is not always

available. It is expected that data is presented for processing at frequent intervals

during the day and that data must be ingested online without affecting the

availability of data to the business. The different patterns can be described as

follows:

???? Continuous feed

Data arrives continually in the form of individual records from a data source or

data feed using messaging middleware (or by OS pipe, or through SQL

operations). The ETL processes run continuously and ingest each insert and

update as it arrives. Thus, new data is constantly becoming available to

business users rather than at fixed intervals.

???? Concurrent batch (“Intra-day batch”)

Several times a day, data is extracted from source system and prepared for

ingesting into the target database. The ETL processes data in batches (files)

as they arrive or on a schedule. The target table is updated at scheduled

intervals, ranging from twice a day to every 15 minutes.

???? Dedicated batch window (“daily batch”)

After the close of the business day (for example, 5 p.m.), data is extracted

from a source system and prepared for ingesting into the target database.

The ETL application populates the target project table during a dedicated,

scheduled batch window (for example, 5 p.m. to midnight).

A given database, star-schema (or even a given dimension or fact table) might be

populated using more than one pattern.

Although the pattern labels emphasize that each pattern differs in terms of

latency, that is not the only or primary difference. Each pattern requires a

somewhat different approach in articulating service level objectives and deciding

which ingest methods might be suitable.

Chapter 7. Understand and address data latency requirements 271

Transformations involving the target database

Data cleansing, surrogate key identification, or the application of business rules

are some of the transformation tasks that might have to occur before data is

ingested into the production database. Table 7-1 identifies the patterns that might

exist in your environment.

Table 7-1 Data transformation patterns

The options presented in Table 7-1 represent valid ways to allocate data

processing between the database server and an external server. Each approach

has situations where it is most appropriate, and the following examples illustrate

when each option is suitable:

???? Ingest ready

When ETL processing (for example, DataStage) exists to prepare the data,

and when there is a need to minimize the load on data server resources.

???? Store only

Same as for “Ingest ready” but data must be queried by users during

intermediate stages of transformation, or the data server has to manage (that

is, store) data during processing.

???? Process transformation

When there is a preference for using data server resources and capabilities,

perhaps because suitable ETL processing does not exist.

Label Description

None - Ingest/Insert ready When the data is first presented to the warehouse, the

records are ready to populate directly into the

production table.

Store only When the data is first presented to the warehouse, the

records still need one or more transformation steps.

The database is used to store the data during these

steps; data is extracted out of a staging table in the

database, transformed, then put back into the next

staging table or into the production tables.

Process transformations Same as the “Store only” option, but transformation

logic is executed within the database, usually using

stored procedures. (This is the so-called “ELT” or

“ETLT” approach.)

272 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition

These options have different effects on the database:

???? Staging tables

For each transformation step performed within the database, you will need do

design and manage a staging table for each transformation operation.

However, using this approach means that the data touches disk multiple

times and this can increase the overall ingest time.

???? Transformation logic

You have to design and manage processing logic in the database (usually by

using stored procedures).

???? Recovery

Your ability to recover data is based on your backup strategy, but is also

based on the number of transaction logs required to complete the recovery

and any exposure to data ingested using non-logged transactions, that is,

LOAD. Your ETL schedule and backup schedule must be aligned to mitigate

against longer recovery times.

Chapter 8, “Building a corporate backup and recovery strategy” on page 289,

discusses the use of DB2 Merge Backup in your backup strategy where full

database backups are replaced by more frequent incremental backups, thereby

helping to reduce the number of transaction logs required in a recovery scenario.

DB2 utilities for loading data

There are two general methods available within DB2 10.1 for loading data into

your data warehouse:

???? Load utility

The load utility is the fastest way to get data into the database layer. This is

achieved by loading data as non-logged transactions. The load utility is ideal

when loading data into staging tables where transformations can be applied

to the data before it is inserted into the production database as a logged

transaction, or attached to the production table as a data partition. The key

items here are data availability and recoverability.

Because the load utility has full access to the table, additional features are

possible. For example, GENERATED ALWAYS and SYSTEM_TIME columns

can be specified in the input file. Use the load utility when you have to load

data at faster speeds than the ingest utility can achieve and where other

applications do not need to access the table.

???? Ingest utility

The ingest utility, introduced in DB2 10, is the fastest method of getting data

into the database as a logged transaction. The ingest utility is compatible with

Chapter 7. Understand and address data latency requirements 273

previous versions of DB2 software because it has a client interface and is

ideal when data is to be loaded directly into the production database.

Use the ingest utility when you need other applications to access the table

while data is being ingested, or where you need the ability to pause and

continue processing or recover from error.

Data volume and latency

The volume and velocity of data to be ingested into a data warehouse can

present challenges when determining the approach to use. Identifying which one

of these represents your environment is important before you begin to implement

an architecture to support your data ingest needs.

Table 7-2 identifies the deployment patterns for approach to ETL development.

Table 7-2 Deployment patterns for approach to ETL development

High-performance design approaches include the need to incorporate the ability

to increase parallelism and volume in all aspects of ETL component design and

to take advantage of all the features available in DB2 to reduce maintenance

operations that also compete for resources outside of the query workload to a

minimum.

Here are a few situations where the “high performance” case is chosen:

???? Late in the development process during performance testing, the team

discovers that ingest service levels are not being met. Therefore, additional

design to achieve higher performance is required. At this point the hardware

has been purchased and it is unacceptable to request the purchase of

additional hardware.

???? The project team is “performance oriented”, wanting to obtain the best

performance possible from the available server resources.

Populating summary (or aggregate) tables

The ingest process has to incorporate the population of all the tables that are

affected by the new data, not simply the initial detail or atomic data table.

Table 7-3 on page 274 identifies the summary tables.

Label Description

Regular Regular ingest design practices should be sufficient to meet the

data volume and latency requirements.

High performance Special high-performance design approaches are needed to

meet the data volume and latency requirements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Calculate the data ingest rate

Create new playlist

Sign In

Sign Up

Table of Contents for
Calculate the data ingest rate