Patterns for data integration

Data integration is implemented using three fundamental patterns:

  • Federation
  • Population
  • Synchronization

Federation

The federation pattern is a simple data integration pattern that provides access to different data sources, and gives the calling application the impression that these sources are a single, logical data source. This is achieved as follows:

  1. Expose a single consistent interface to the application.
  2. Translate the interface to whatever interface is needed for the underlying data.
  3. Compensate for any differences in function between the different data sources.
  4. Allow data from different sources to be combined into a single result set that is returned to the user.

This is illustrated in the following diagram:

Federation

The federation pattern as shown in this diagram can be broken down into the following logical building blocks:

  • The calling applications have the need for information, but they don't possess the information.
  • The federation building block uses metadata to determine where the data required is stored, and in what format. The metadata repository allows the decomposition of a single query executed against the federation building block, into individual requests to different data sources. To the user (the calling application), the information model appears to be a single virtual repository. The data is accessed via suitable adapters for each target repository. The federation component sends an individual result to the calling application, and integrates several different formats into a shared federated schema.
  • The source applications have the information that is important for the calling applications.

The federation pattern supports structured and unstructured data, together with read-only and read/write accesses to the underlying data sources. Read/write accesses should be limited, wherever possible, to a single data source, as otherwise a two-phase commit is needed, which can be difficult in distributed databases.

Uses

Federation is used for the following purposes:

  • The data needed by an application is distributed across different databases (for historic, technical, or organizational reasons)
  • Federation is more effective than other data integration technologies, when:
    • Near real-time access is needed for rapidly changing data
    • Making a consolidated copy of the data is not possible for technical, legal, or other reasons
    • Read/write access must be possible
    • Reducing or limiting the number of copies of the data is a goal
  • It is possible to continue to make use of existing investments

Population

The population pattern has a very simple model. It gathers data from one or more data sources, processes the data in an appropriate way, and applies it to a target database. In its simplest form, the population pattern is based on the read dataset-process data-write dataset model. This corresponds to the classic ETL (Extract, Transform, and Load process.

This is illustrated in the following diagram:

Population

The population pattern can be broken down into the following logical components:

  • The target applications have a need for information, which they do not possess. Therefore, a copy from another data source in a source application is required.
  • The population component reads one or more data sources in the source application, and writes the data to a data source in the target application. The rules for extracting data from the source application can be as complex as necessary. They range from simple rules, such as read all data, to more complex rules where only specific fields in specific records can be read under certain conditions. The loading rules for the target database can vary from a simple overwrite of the data, to a more complex process of inserting new records and updating existing ones. The metadata is used to describe these rules.
  • The source applications have the important information needed by the target applications.

Uses

Population is used for the following purposes:

  • A specialized copy of existing data (derived data) is needed:
    • Subsets of existing data sources
    • A modified version of an existing data source
    • Combinations of existing data sources
  • Only read access to the derived data in the target application is possible (or only a few write accesses).
  • In the case of a significant number of write accesses, the two-way synchronization pattern should be used.
  • The user must be provided with quick access to the information required, instead of being bombarded with too much, irrelevant, incorrect, or otherwise useless misinformation.
  • However, IT drivers often dictate the use of the population pattern. In other words, the copies of data are made for technical reasons. These drivers include:
    • Improved performance of user access
    • Load distribution across systems

Synchronization

The synchronization pattern (also known as the replication pattern) enables bidirectional update flows of data in a multi-copy database environment. The "two-way" synchronization aspect of this pattern is what distinguishes it from the "one-way" capabilities provided by the population pattern.

This is illustrated in the following diagram:

Synchronization

The synchronization pattern shown in this diagram can be broken down into the following logical components:

  • The target applications have a need for information, which they do not possess. Therefore, a copy from another data source in a source application is required.
  • At a simplistic level, the synchronization component can be compared to the population pattern, with the only difference being that the data flows in both directions. If the data elements flowing in both directions are fully independent, then two-way synchronization is no more than two separate instances of the population pattern. However, it is more common to find some overlap between the datasets flowing in either direction. In this case, conflict detection and resolution are needed.
  • The source applications have information which is relevant to the target applications.

Uses

Synchronization is used for the following purpose:

  • A specialized copy of existing data (derived data) is needed. This copy can take different forms:
    • Subsets of existing data sources
    • A modified version of an existing data source
    • Combinations of existing data sources

Multi-step synchronization

There is one variant of the synchronization pattern: the multi-step variant. The multi-step variant of the two-way synchronization pattern makes use of one instance of the population pattern, with its gather, process, and apply functions, for each of the two synchronization directions. An additional "reconcile" function is placed between the two data flows, and guarantees that there are no conflicts in the updates. If the opportunities for conflicts are minimal, this pattern can be constructed from existing population components. However, a specialized solution should be used for more complex situations.

The following diagram illustrates the "reuse" of the population pattern, once for each direction with the additional "reconcile" component in the middle.

Multi-step synchronization
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset