Technology Choices
Due to the wide variety of technology available from many vendors, the data reservoir reference architecture is primarily a logical architecture.
This chapter covers some of the technologies available from IBM to implement the data reservoir. It supplements the implementation notes in Chapter 6, “Roadmaps for the data reservoir” on page 133.
This chapter includes the following sections:
7.1 Technology for the data repositories
One of the hardest design decisions for the data reservoir is to determine how and where data will be stored in the data reservoir.
The data reservoir repositories in the data reservoir reference architecture are logical repositories. The intent is to characterize the different dispositions that data in a data reservoir is likely to have. The following are some examples of dispositions:
Shared operational data repositories are designed to hold consolidated data for use in real time. Therefore, these repositories have data structures that are optimized for online transaction processing (OLTP) access.
Deposited data is stored in whatever the owner of the data chooses.
Operational history repositories are formatted in the same way as the original source system that produced the data. The only change is the addition of data and time stamps showing when the values were copied to the operational history. People familiar with the data in the source systems can then use these repositories easily because they understand the context of the data values.
Audit data is in the format produced by the information protection tools that generate it. The data must be organized for the convenience of the analysts who are looking for suspicious activity.
Deep data typically has raw data from unstructured and semi-structured sources plus reference copies of enterprise data for correlation and validation during analytics processing.
Information Warehouses are structured stores that focus on creating a consolidated historical view of the organization.
It is possible that core data, such as data about customers, products, and key activities in the organization, is present in multiple data reservoir repositories that are formatted for different workloads.
Multiple data reservoir repositories of different types can be on the same infrastructure. Data repository infrastructure is undergoing a boom at the moment with new types of technology appearing every few months. Figure 7-1 shows a typical mapping of the data repositories to the IBM technology available in this publication.
Figure 7-1 Data reservoir repository mapping to technology
The Apache Hadoop-based IBM InfoSphere BigInsights is a versatile data management platform. It can store many types of data reservoir repositories. However, it is a fairly slow execution environment that is designed for batch workloads.
Where repositories, such as the information warehouse, appear on multiple technologies, it means that there is an implementation choice. The choice depends on the tools required and the non-functional requirements. So for example, the Information Warehouse can be implemented on IBM PureData™ Systems for Analytics if there are analytics deployed to it that need the specialist hardware acceleration that PureData Systems for Analytics offers.
Not all of the repositories in the data reservoir need to be collocated. For example, IBM Cloudant® is recommended for the object cache. Cloudant typically runs as a public cloud offering, making it a good solution for providing data to systems of engagement. Cloudant databases can be part of a data reservoir where the rest of the data repositories are on-premises.
Similarly, operational history stores for IBM CICS and IBM IMS™ systems can be in a z System DB2® database that uses the IBM DB2 Analytics Accelerator. The Accelerator provides fast access to this data for analytical queries. It enables the operational history repositories to be collocated with the original sources while still having them cataloged and available as part of the data reservoir.
The roadmaps had examples of existing data repositories being incorporated into the data reservoir. This is a natural approach. The key requirement is that these repositories are cataloged and conform to the governance program associated with the data reservoir. The repositories with a data reservoir can encompass multiple technologies from multiple vendors.
7.2 Technology for the integration and governance fabric
The data reservoir reference architecture assumes that the technology implementing the integration and governance fabric is IBM InfoSphere Information Server. The reason is because it encompasses both the governance philosophy for the data reservoir and the recognition that a big data environment is going to involve heterogeneous data and data stores.
Within information server, you can have these characteristics:
IBM InfoSphere Information Governance Catalog provides the catalog function.
IBM InfoSphere DataStage provides an information broker.
IBM Business Process Manager (BPM) provides the workflow engine.
IBM InfoSphere Information Server also includes various operational governance hubs for operational monitoring, stewardship, and compliance monitoring.
Information Server is complemented with the IBM InfoSphere Reference Data Manager product that implements the code hub.
Other types of information brokers in use in the data reservoir could be IBM InfoSphere Data Replication, IBM InfoSphere Federation Server, and IBM Integration Bus.
The IBM InfoSphere Optim and IBM InfoSphere Guardium portfolios provide various guards and monitoring capabilities for protecting data in the data reservoir. For example, IBM InfoSphere Optim Data Privacy provides masking libraries and IBM InfoSphere Guardium Data Encryption provides encryption of data both at rest and in motion.
InfoSphere Guardium also provides monitoring services for the data reservoir repositories to alert the data reservoir operations team if data is being accessed under suspicious circumstances.
7.3 Technology for the raw data interaction
IBM InfoSphere Information Server can also implement the raw data interaction subsystem:
IBM InfoSphere Information Governance Catalog provides the ability to locate the raw data that the data scientist or analyst requires.
IBM InfoSphere Data Click populates a sandbox with the data (or a sample of that data) and catalogs the sandbox, through a simple wizard.
7.4 Technology for the catalog
IBM InfoSphere Information Governance Catalog also provides the catalog repository and catalog interfaces (Figure 7-2).
Figure 7-2 The roles of the Information Governance Catalog in the data reservoir
The information governance catalog has four roles:
Setting up the governance program
Locating data and provisioning sandboxes
Curating repositories and sources of information
Viewing lineage to understand the origin of data
7.5 Technology for the view-based interaction subsystem
The view-based interaction subsystem has perhaps the largest number of options in terms of how it is implemented. This subsystem reaches to the business communities. The goal in its implementation is to enable the tools that the business users want to use. Many of these tools work with simple files such as CSV files or relational databases. These formats are typically used in the Published data stores found in view-based interaction.
For the assess and feedback part of view-based interaction, often a search engine, such as Apache Solr (supported by IBM InfoSphere BigInsights), is used to enable business users to search text-based data. This capability can be augmented with information virtualization technology such as IBM InfoSphere Federation Server to provide simplified views to the data.
7.6 Technology for the continuous analytics subsystem
Within the continuous analytics subsystem are two types of engine. The streaming analytics engine is designed for processing a constant stream of information. It is looking to detect the occurrence of patterns within that data. IBM InfoSphere Streams is an ideal product for implementing this engine.
The event correlation engine processes discrete messages or events. For simple cases where the events are discrete and can be processed by a stateless engine, IBM Integration Bus is a good choice. Where events need to be correlated together, IBM Operational Decision Manager (ODM) is a better choice.
7.7 Summary
This chapter provided a high-level mapping of the data reservoir components to IBM technology. During a data reservoir deployment, this information can be used as a starting point. However, the data reservoir requires that a proper operational model is developed to ensure that its technology meets the non-functional requirements of the organization.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset