Chapter 7. Understand and address data latency requirements 269
???? Processing window
The ingest processing window is the period where the data is available for
processing by the ingest process. For example, if data is presented to the
data warehouse at 4 p.m. and the processing must be complete by 5 p.m.,
then the processing window is one hour.
???? Data volume
Data volume refers to the quantity of data presented for ingest during each
processing window. Consider both average and peak volumes and plan for
peak volumes.
7.1.2 Calculate the data ingest rate
Use the values from the SLOs to calculate the rate of ingest required for each of
the data sources. Express the target data ingest rate per data node as
megabytes per second (MBps).
Calculate the ingest rate as follows:
???? Estimate the volume of data in megabytes (MB) to be ingested per ingest
cycle.
???? State the time in seconds allowed for the ingest process to complete.
???? Divide the volume by the time. In a partitioned database, then divide by the
number data nodes receiving data.
The data volume and ingest rate each refer to the raw data to be ingested. Data
transformations and the implementation of indexes, materialized query table
(MQTs), and multi-dimensional clustering (MDC) tables can significantly increase
the actual data and the number of transactions needed in the database. After
they are identified, execution time for these additional transactions must be
accommodated.
Through this process you will understand at what rate you need to be able to
ingest data to meet your service level objectives. Your infrastructure must have
the capacity to support the ingest rate and also support your service level
objectives for the query and maintenance workloads. This must be the focus of
your initial infrastructure and data throughput tests.
7.1.3 Analyze your ETL scenarios
This section provides a checklist to help you identify the key characteristics of
your ETL application in a systematic fashion. The key distinctions and options
are presented for each item.
270 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition
Choose the one that best matches your situation. A checklist item can have
multiple answers because there might be different answers for different data
sources and tables in your project.
The checklist has the following sections:
???? Determining your data ingest pattern
???? Transformations involving the target database
???? Data volume and latency
???? Populating summary (or aggregate) tables
Determine your ETL pattern
In an operational data warehouse environment, the luxury of having an offline
window at the end of each day to process data in large batches is not always
available. It is expected that data is presented for processing at frequent intervals
during the day and that data must be ingested online without affecting the
availability of data to the business. The different patterns can be described as
follows:
???? Continuous feed
Data arrives continually in the form of individual records from a data source or
data feed using messaging middleware (or by OS pipe, or through SQL
operations). The ETL processes run continuously and ingest each insert and
update as it arrives. Thus, new data is constantly becoming available to
business users rather than at fixed intervals.
???? Concurrent batch (“Intra-day batch”)
Several times a day, data is extracted from source system and prepared for
ingesting into the target database. The ETL processes data in batches (files)
as they arrive or on a schedule. The target table is updated at scheduled
intervals, ranging from twice a day to every 15 minutes.
???? Dedicated batch window (“daily batch”)
After the close of the business day (for example, 5 p.m.), data is extracted
from a source system and prepared for ingesting into the target database.
The ETL application populates the target project table during a dedicated,
scheduled batch window (for example, 5 p.m. to midnight).
A given database, star-schema (or even a given dimension or fact table) might be
populated using more than one pattern.
Although the pattern labels emphasize that each pattern differs in terms of
latency, that is not the only or primary difference. Each pattern requires a
somewhat different approach in articulating service level objectives and deciding
which ingest methods might be suitable.
Chapter 7. Understand and address data latency requirements 271
Transformations involving the target database
Data cleansing, surrogate key identification, or the application of business rules
are some of the transformation tasks that might have to occur before data is
ingested into the production database. Table 7-1 identifies the patterns that might
exist in your environment.
Table 7-1 Data transformation patterns
The options presented in Table 7-1 represent valid ways to allocate data
processing between the database server and an external server. Each approach
has situations where it is most appropriate, and the following examples illustrate
when each option is suitable:
???? Ingest ready
When ETL processing (for example, DataStage) exists to prepare the data,
and when there is a need to minimize the load on data server resources.
???? Store only
Same as for “Ingest ready” but data must be queried by users during
intermediate stages of transformation, or the data server has to manage (that
is, store) data during processing.
???? Process transformation
When there is a preference for using data server resources and capabilities,
perhaps because suitable ETL processing does not exist.
Label Description
None - Ingest/Insert ready When the data is first presented to the warehouse, the
records are ready to populate directly into the
production table.
Store only When the data is first presented to the warehouse, the
records still need one or more transformation steps.
The database is used to store the data during these
steps; data is extracted out of a staging table in the
database, transformed, then put back into the next
staging table or into the production tables.
Process transformations Same as the “Store only” option, but transformation
logic is executed within the database, usually using
stored procedures. (This is the so-called “ELT” or
“ETLT” approach.)
272 Solving Operational Business Intelligence with InfoSphere Warehouse Advanced Edition
These options have different effects on the database:
???? Staging tables
For each transformation step performed within the database, you will need do
design and manage a staging table for each transformation operation.
However, using this approach means that the data touches disk multiple
times and this can increase the overall ingest time.
???? Transformation logic
You have to design and manage processing logic in the database (usually by
using stored procedures).
???? Recovery
Your ability to recover data is based on your backup strategy, but is also
based on the number of transaction logs required to complete the recovery
and any exposure to data ingested using non-logged transactions, that is,
LOAD. Your ETL schedule and backup schedule must be aligned to mitigate
against longer recovery times.
Chapter 8, “Building a corporate backup and recovery strategy” on page 289,
discusses the use of DB2 Merge Backup in your backup strategy where full
database backups are replaced by more frequent incremental backups, thereby
helping to reduce the number of transaction logs required in a recovery scenario.
DB2 utilities for loading data
There are two general methods available within DB2 10.1 for loading data into
your data warehouse:
???? Load utility
The load utility is the fastest way to get data into the database layer. This is
achieved by loading data as non-logged transactions. The load utility is ideal
when loading data into staging tables where transformations can be applied
to the data before it is inserted into the production database as a logged
transaction, or attached to the production table as a data partition. The key
items here are data availability and recoverability.
Because the load utility has full access to the table, additional features are
possible. For example, GENERATED ALWAYS and SYSTEM_TIME columns
can be specified in the input file. Use the load utility when you have to load
data at faster speeds than the ingest utility can achieve and where other
applications do not need to access the table.
???? Ingest utility
The ingest utility, introduced in DB2 10, is the fastest method of getting data
into the database as a logged transaction. The ingest utility is compatible with
Chapter 7. Understand and address data latency requirements 273
previous versions of DB2 software because it has a client interface and is
ideal when data is to be loaded directly into the production database.
Use the ingest utility when you need other applications to access the table
while data is being ingested, or where you need the ability to pause and
continue processing or recover from error.
Data volume and latency
The volume and velocity of data to be ingested into a data warehouse can
present challenges when determining the approach to use. Identifying which one
of these represents your environment is important before you begin to implement
an architecture to support your data ingest needs.
Table 7-2 identifies the deployment patterns for approach to ETL development.
Table 7-2 Deployment patterns for approach to ETL development
High-performance design approaches include the need to incorporate the ability
to increase parallelism and volume in all aspects of ETL component design and
to take advantage of all the features available in DB2 to reduce maintenance
operations that also compete for resources outside of the query workload to a
minimum.
Here are a few situations where the “high performance” case is chosen:
???? Late in the development process during performance testing, the team
discovers that ingest service levels are not being met. Therefore, additional
design to achieve higher performance is required. At this point the hardware
has been purchased and it is unacceptable to request the purchase of
additional hardware.
???? The project team is “performance oriented”, wanting to obtain the best
performance possible from the available server resources.
Populating summary (or aggregate) tables
The ingest process has to incorporate the population of all the tables that are
affected by the new data, not simply the initial detail or atomic data table.
Table 7-3 on page 274 identifies the summary tables.
Label Description
Regular Regular ingest design practices should be sufficient to meet the
data volume and latency requirements.
High performance Special high-performance design approaches are needed to
meet the data volume and latency requirements.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset