Images

CHAPTER

6

Common Challenges

The challenge is to integrate data in a way that provides complete and valuable information to be used by the business. Data integrations are not simple because of the different systems and types of data being pulled together. There might not have been a plan to use data from a particular application, or there might not be a complete source for the data needed. Finding the source of data that is needed and then validating the maintenance against that source are very critical. Depending on the number of sources and copies of data, the complexity might even increase.

Planning for data integrations from the beginning makes the process easier, but designing applications to anticipate data needs sounds like the more complicated task because the world of Big Data and information are in constant change. This is not always going to be a technical issue or solution, and other groups will need to be involved to help solve the problem from the business side. Certain tools and technology can help along the way, and there will need to be changes in applications or how data is viewed or sourced. The tools, data sources, and types of data add to the common issues that make data integrations challenging.

Examining the Issues

It would be nice if we could just issue some simple queries to get the needed data and then be able to join that information. Unfortunately, the sources of data are often from different databases, applications, and other reports. Issues arise due to the volume of data and the ever-increasing amount of applications companies are using. Planning for the common issues creates more effective data integrations. So, what are the common issues, and where are the areas that can cause trouble in working on the data integrations?

A couple areas have been mentioned already, and other issues we will dive into in the next chapter, such as data cleansing. Let’s look at some of the typical issues with data integrations and then discuss some options to handle or plan for them to make the design and processes more manageable.

Here is the quick summary of the common issues that will be covered in the rest of the chapter:

Images   Design   Not planning for data integration.

Images   Change   Things change—that is a given. This includes new development, new requirements, and companies merging or spinning off.

Images   Business purpose   The purpose of the data might be specific to a business need or process, which makes it harder to integrate.

Images   Standardization   Multiple database platforms and application systems.

Images   Data cleansing   Data quality and incomplete data.

Images   Volumes of data   The amount of data can be incredibly high, with multiple sources.

Images   Multiple systems to integrate   Not just one process, but several systems needing the same data.

Images   Latency of data   Systems needing real-time data limits for how data integrations might be used.

Images   Tools   Tools have limitations, and having different tools in an environment might cause conflicts.

Images   Metadata information   Details about what the data is in order to use it in other systems.

Images   Knowledge of data integrations   Information about the data, what integrations are available, and thoughts about its size can all be overwhelming.

Images   Testing   The availability of data in different systems might make testing of the integrations difficult.

Images   Performance   Data volume and velocity, growing analytic workload, and real-time data all pose possible performance issues.

Design for Integrations

Common issues start with the design. It is possible that applications have been created without thinking the data would be needed for another purpose, or some applications might be legacy systems. This type of application contains all the data that’s needed and is not looking for outside sources for other data. In most companies, this type of design is no longer a reality because of the variety of applications in a company and the opportunity to use different information that comes through in other areas. This could be something as simple as leveraging reference data in the company in multiple systems. Designing applications with the belief that the data will be contained is really no longer an option. Data will need to be shared with others as well as pulled in from other sources. In new development planning, web services for data or data APIs (application program interfaces) are beneficial for using the data sources for the integrations. Understanding that information being gathered in an application might be beneficial to other applications will make the design of the application ready for integrations.

Design requirements should come from the business users because they should have the details about what data is needed and understand the data assets and value of the data to the company. Technology can help with solutions in design and work with the business for options to include in the data integrations, but partnering here will help plan for long-term data integrations. It is not an easy task to consider all the different uses of the data when designing a system, but at least conscious decisions should be made that include integrating data or providing data services. Enough planning already goes into the different applications that are either collecting or reporting on the data, and taking on considerations of which data might be needed by others makes the development process more difficult. Therefore, integrations are going to be an afterthought.

Design is also not just about the data and what sources are used or provided; it also includes processes and analysis of the source systems. This is why design is a common issue for data integrations, because it involves the business, technology, data sources, and processes around data loading and integrations. With all of this coordination, there are going to be areas that need addressing. Business data owners should establish what data they want to share for data integrations and should have acceptance criteria for the development of the data integrations. Technology should help them plan the systems that hold the data and perform the integrations—and all of these are part of the design requirements. Not thinking about how data moves from one system to another at application design time might make the application difficult to implement further down the line.

Incomplete designs for data applications can cause issues with data integrations. Planning and design need to include the following to reduce these issues with data:

Images   Involving the business and data owners in making decisions around what data is needed and what data is available in the application.

Images   Planning for reference data and other data that might be needed. The design should have a way to pull data from outside sources, whether it just be reference data or data from other applications.

Images   Anticipating that a data system has information that might have other uses. This involves having responsible data owners that understand the data that could be available and then developing access to that data through web services or some other API.

Data integrations are planned to be able to pull many diverse sources of data, but the diversity of the sources can actually cause issues and make the data difficult to integrate. Designing integrations with this in mind is important, and we will talk more about this in the tool discussion later in this chapter.

Change

Even with the best design decisions, with businesses always changing, new systems will need to be incorporated or spun off. Change is a common issue and common constant. Companies merging and needing to combine data is a typical issue that needs to be dealt with. Depending on the data quality work that was completed in the different systems, they might be at different stages of integration and data quality. The systems coming together might also have different metadata meanings. For example, an account might mean something different in each system, and then when the data is merged, there will be multiple meanings just for the account, so mapping the data together is close to impossible. This might sound more like a data quality and data cleansing issue, but it is a common issue for data integrations concerning what the data actually is. Understanding the source of the data and the meaning of the source is key to the data integrations; the issue is with the data and with having to pull legacy, merged, or multiple systems together.

New data coming into the systems might not just be from merging companies and systems; it might also come from new applications being developed. As products are being developed, applications and systems are being designed. As discussed earlier concerning the issues with designing the system and planning that data will be used by other applications, the business needs to understand that data is constantly changing and being made available. New sources of data are being developed and changed from these new applications. There might be information that is newly available that could make the data integrations easier. If this is the case, is change really an issue? Not necessarily, but how can the change be incorporated and how can the change be included? Is it going to be easy to deal with, or is it going to take significant time to include the change?

As data changes and new sources become available, as part of the data integration or loading processes, there should be ways to verify and include new data. This is an issue we will discuss more when we talk about data cleansing. ODI will also capture the business rules and allow for changes to these rules and then capture the changes. Using tools helps mitigate the issues with changes, because they can capture details about the changes and implement new business rules to validate and use the new data and integration. Figure 6-1 shows an example of capturing the knowledge module and modifications.

Images

FIGURE 6-1. ODI knowledge module

Figure 6-2 shows an example of the tasks under the knowledge module for defining rules and keeping comments about information on the processes. Additional tool requirements are discussed later in this chapter in the “Tool Issues” section.

Images

FIGURE 6-2. ODI knowledge module tasks

Data integrations are going to be part of the data flow, and as you have already learned, this is a process that needs validation and changes. So as the data is changing, the process should also be changing and updated to correct data issues, add new sources, and validate the changes.

Another common type of change is migrations, which involve changing applications and moving data from one application to a new application or an existing application. The merging of application data to be represented either in a new application altogether or in one of the existing applications involves quite a bit of data movement and figuring out what data belongs where. The data integration issue with this type of migration or merging of applications is best done with data mapping and cleansing. It is possible that there was detail on the data in the previous application, but if not, then reviewing the data for what is important is the place to start. That goes for both systems. When mapping the columns in the different systems to be able to perform the integrations, you need to know which tools or master data management plan is able to address the issue of merging or migrating the data. So why is this an issue? Well, deciding what data is important causes the issues, and figuring out how to keep the data that is needed is how to resolve the issue. The data integrations are actually merging the data and doing the data cleanup, otherwise there are risks that there will be duplicates or incomplete data. You can find an example in the next chapter of how to make this work with data cleansing.

The issue is that changes are going to happen; there are ways to document, plan, and include these changes for data integrations. Tools are going to be most useful in helping incorporate and validate the changes. When the changes in data and processes are not included, the data integrations are not going to be able to use the new data, or something might actually break because of the changes.

Business Purpose

Data might have certain security restrictions around it, or it might have a particular meaning around it for a specific department or team that when viewed differently might not be useful or might be misleading. For whatever reason, teams typically do not to share data and information, and depending on the data they have stored, a source of the data might be available in some format but not necessarily in the database. The business purpose and data needed for one team might not be the same for another. Many times, a value might have different calculations, or other factors might be involved. Concern arises about sharing this information with other teams because the data might be taken a different way.

Communication of the business purpose and details about the value can help clear this up and make the data usable by others. The problem is agreeing on the same values and the environment around the values. The reference data and the basic underlying data might be shareable, but as the application is developed, the data values and the way the data is looked at can be very specific to a certain business need. The timing of the data might not match up with how others view it, or different layers of the data might not mean the same thing across the different business groups. This is where master data management comes into play. However, if the department or reason for having the data is inflexible or too specific, the data might not be able to be integrated.

Many debates have taken place concerning what the between dates are and what a column should be named because of different meanings for the groups involved. Agreeing on some baselines for the data is beneficial for all groups involved, and presenting the data in its simplest form allows others to consume the information. The simplest form of the data should verify that others cannot misrepresent it, but should not include all the final calculations. The source of the data can then be used in the data integrations with other systems if what the values mean is documented, as well as how the values should be calculated or used. If the groups need to have different column names for the data, then verifying the integrations and knowing where the data came from can get confusing. However, this can be done using mapping tables, tools, and master data.

The difficulty with the data integrations arises from the meanings that the business gave the data just to serve a purpose or a single solution. There needs to be additional work to the meaning of the data or another generic view of the data for it to be consumed by other teams. It seems with an issue like this that there would be multiple sets of similar or even the same data supporting single applications without the opportunity to leverage other data. Also, the cost of applications might not be shared equally across the different teams. This can cause additional walls to be put up, thus making the data integrations more difficult.

Reusing the data might seem to make sense; however, having an agreement that the data is going be available or defines what it is changes that. Data agreements concerning how data can be used and what it is being used for are a necessity from the data owner’s perspective. If the owner is planning on just using the information for a specific purpose, they might not want to maintain it for others to use. Therefore, data quality might be an issue without the support of a data owner.

Adding to the challenge of the business purpose for the data are a couple of external controls that might limit the data availability for integrations. Security and regulatory issues need to be taken into consideration when working with the data. The proper security controls need to be in place at the stages where the data is being consumed. Some data will need to remain in the existing data store, and access might be limited to very specific systems and users. The controls around the data come from data governance and should be part of the master data management plan. Data integrations are definitely easier if there are no worries about the controls and all of the data is available for public access; however, that is not the case. Most of the data will need to be protected along the way, and having data security as part of the business rules for the data integrations will be key for this. Regulations and compliance will also need to report on these data flows and understand where the data is and how it is being accessed. Security and regulations bring challenges with data integrations. Therefore, the data is not going to be guaranteed to be open to all integrations. Also, security must be considered throughout the data flows. Controlling the access of the data in the integration process and using systems that provide for data masking, redaction, and encryption will help at the different stages. The integration plan might just have to change if data is not able to move due to compliance or restrictions on it. Again, this is a challenge, but it’s something that can be included in the data integration plan, as part of the business rules, and in considering security and data protection throughout the integration process.

Needing the data and having a business purpose for it is just as important of a reason as the data itself for performing the data integration. The business owner of the data needs to agree to have the data used in other ways and support the different transitions; otherwise, data quality issues will arise. Communication between teams about the information and the level of data needed, as well as planning to use data for other purposes, will help with the data integrations.

The challenges with data integrations cannot always be resolved with technology or tools, as you can see with these first three issues. There is normally a way to load, merge, insert, and exact data between systems, but agreeing on what the data is and should be used for is another story. Without the data owners and application teams being involved in making the decisions around the data definitions and being able to share the information across the systems, data integrations would probably not happen.

A technologist can get people in a room to discuss issues and communicate, and as an owner of the technology it would be easy to describe the steps that need to happen within the system. This will help the business and data owners look at the data again to make decisions about it. The manual steps should be considered as part of the data cleanup. The jobs that will need to be completed should be discussed with the business in order to know how the data is used. Providing ways to capture the data, along with information about how the data can be used as well as the meaning of the values, is a great way for the technology to help solve this issue. Leveraging the business purposes and getting the people to come together to agree, as well as getting the details to support the various data integrations, are what is needed to take care of this first set of common challenges.

Standardization

Because many systems have a column named ACCOUNT, that should mean that these fields match everywhere. If it were only that easy. Having consistent names is a goal in any database system, but this is not always possible across systems, across departments, and when data is being pulled in from multiple systems. Not everything called ACCOUNT is the same type of account or has the same meaning for the data. Additional fields should be used to standardize or to provide additional information such as the type of account, the account data owner, and the system the data is used in. This, of course, is now starting to sound like master data management—and it almost is.

With standardization, the names of columns can be the same or different, but in gathering information for data integrations, additional details can start to be collected to help build what could be master data. However, as a first step, one should provide enough information to be able to start to use the data for integrations with other systems, because the meaning of the values are understood and there is additional information to make sure that it is being used properly. Again, this information can be stored in mapping tables that are used to stage the data integrations, or tools such as ODI can store the mapping information and the rules around the data.

Standardization in this sense doesn’t mean having the same name for the columns but rather standardizing on a definition of what the data is and the additional details that should be provided. Standards for data also mean getting the form in which the data should be consumed, without the business-specific calculations and adjustments. Again, the data owners need to be in agreement to provide a set of standards for the data concerning the detail, format, and baseline of information.

First and foremost, standardization is a process-related challenge. In working through this issue, analysis, discussion, and even compromise between the different parties are going to be key. It points back to the fact that not all of the challenges can be solved with technology or tools—process analysis and cooperation among the business partners are needed as well. However, a technical implementation with the decided configuration data using tables or other representations is the second step in this challenge.

The creation of standards makes sense, and using standards makes sense. However, maintaining them and keeping them for others to know can be a challenge. Storing the standard information with the data integration system and having the business and data owners maintain that information along with the other rules they have for the data will keep the details current and available for reference.

Data Problems

Data quality is probably the biggest common issue with data integrations. It might not appear that way since other issues were listed first here in this chapter, but data cleansing has its own chapter following this one. It needs to be put into a completely separate chapter so it can be discussed in more detail. Data quality and cleansing take care of data validations and incomplete data. However, there are other data issues as well.

Heterogeneous data comes from all kinds of platforms—text files, relational and non-relational databases, legacy formats, Big Data systems, and new formats. Integrations need to handle a variety of data formats, and working on turning all of the data into the same format doesn’t necessarily add value to the process or help with the data quality. If the data formats are different, this adds to the challenge of using data from different sources. A possible solution has been to convert these formats into a standard format that can be used. There is a balance between querying the data directly or loading the data into a data warehouse. The queries can provide additional attributes of the data. Loading the data into a data warehouse that can convert all of these different sources and data formats into the same format and a single source.

The Oracle database can take in these different formats as they are and still query them using the same syntax as other SQL statements. Different data types such as JSON and XML can be stored as in the tables and then queried. Add on top of that the use of external tables to read directly from files, and now at least there is not a lot of time spent on the parsing and translating of the files or data formats. They can just be queried in their native form to extract the data and know what the information is. In Chapter 8 there are some examples on JSON and XML for more detail on how this is accomplished. Being able to query these data formats is not only beneficial for having the data pulled together in data marts or data warehouses for consumption and integration, but it also allows other applications to use this data more efficiently as well. If another application requires the data, it doesn’t have to extract it in relational database form but can pull it out in JSON or XML because of the translations that are provided by the database.

It’s not always necessary for data integrations to bring all the data (no matter what the format) together to be able to present it in a constant format that everyone can use. The decision doesn’t need to be made for how the data should be presented in a single source to applications. The applications can receive the data in the format needed as long as they understand the data and the meaning behind the values so that the right information is provided. As long as the databases continue to develop and can process the new data formats coming out of the Big Data stores, then the effort can be placed more on the quality of the data and understanding of the data and then the translation of the data into consistent formats (which may eventually need to change for future uses anyway). Some data might be too high in volume, or there might be other reasons for not pursuing the effort of transforming and integrating the data. However, there are many analytical purposes for which bringing the data together would make more sense. For schema-on-read technologies such as NoSQL databases and document-based technologies such as Hadoop and JSON, the performance is much lower if the data is not consolidated. Our discussion on data integrations with Big Data will continue in Chapter 8.

Another data issue is just the sheer volume of data. Fortunately and unfortunately for the data professional, the amount of data and information is not decreasing, but ever increasing. The velocity and volume of data that is moving through the enterprise is really amazing if we consider that 300GB was considered a “large” database around 15 years ago. Today it is normal to have over 100TB being stored for one system—and even more, approaching the petabyte range. If data is being integrated and then stored in data warehouses or common areas for access, the storage for these systems is just going to continue to grow. More storage capacity and planning will be needed for these systems to house all this data and copies of data. Depending on the performance and different timing of the transformations, consolidating the data might be the way to go for the integration efforts. Data warehouses still offer a viable option for this, allowing for the complex data integrations of multiple platforms. Moreover, the integration and transformation logic has to be built over and over again into many analyses, if not done beforehand for a data warehouse. Because it is possible to pull the data together without consolidating it into a single store, these issues should be taken into consideration.

When planning data integrations, you will need to include this additional storage capacity. Besides the copies, the loading and transformations that are needed for data cleansing and integrations will take up space. It is better to overestimate the storage that will be used than to run out of space during the process. Of course, additional copies and the possible number of staging tables for integrations should be considered if they are a viable part of the process, not only for the storage concerns but also for the issues with synchronizing data across all of the staging tables.

Synchronizing Data and Copies: Multiple Systems

Here’s some data, there’s some data. Everywhere there are copies and more copies—and after data integrations, there’s even more of the same data in multiple systems. Having different copies of data requires that it be synchronized and updated as the data sets change. When there is data in multiple systems in different platforms, as we discussed with the data formats, one option is to integrate the data to one environment. After the data is brought into one environment, other applications can repurpose it. Along the way, the data needs to remain synchronized with the original source, if the original source is known.

The data originated from some place, and this data source is the cleansed, golden source of information. The golden source of data is the one that should be updated and modified. If there are discrepancies along the way, the golden source of data should be the single source of truth. In real life, the golden source of data is often not very shiny. The data this source provides requires a lot of cleansing before it can be used for other purposes, but it is still a consistent source and state of the data. As one would assume, if this source is copied over and over again, there are more steps that can cause issues by getting further away from the original source. Ideally, the applications needing to integrate this data would go to the golden source. Depending on how access to the source is provided or the format, this might make most of the data integration efforts easier and thus avoid having multiple copies of the data.

How does one avoid having copies of the data, and how can the golden source of data be used if not in a system with easy access? As discussed with the data format issue, there are ways in the Oracle databases to query a wide variety of data sources, formats, and types. Instead of copying the data, look for ways to query and use the data. Federated queries or querying structures will need to be optimized for the given analytic purposes, and even the query options make sense for certain systems (although they might cause additional performance challenges in others). With this option of querying the data, the integration doesn’t require that all of the data has to move to one place or one source. The data can be used and even avoid the synchronization process to a single source.

However, what if there is a debate on which source is the golden source? This is possible if there are multiple systems collecting the same information. Data owners might be able to dispute the source of truth, which is another reason for having a master data management system to catalog, document, and bring the debates to conclusion concerning which source to use. If there is not a way to get around the issue of multiple systems collecting the same initial data, the decision must be made as to which data is the golden source. The verification and synchronization between the various sets of initial data does not matter because the one set is chosen for the golden source, and the copies are not used and so don’t need to be synchronized.

Even though we just finished discussing reasons to avoid making copies of data due to space issues and having to ensure that the data is consistent and updated, there are reasons to have separate copies. Replication can be used when you’re building a data warehouse or are providing sources of data that are more accessible. Oracle GoldenGate is discussed in Chapter 4; it provides ways to synchronize data between systems and make sure values are updated. Using Oracle GoldenGate provides an ideal solution for taking the source and reproducing the data for the other applications and the data warehouse, and it manages the concerns around synchronization. It might also address the next issue: data latency.

Latency

Data latency is any lapse in time from when the data is updated to when it is available. If the application or report is using the data from an integration process and the data takes a set amount of time to be available, that time needs to be understood.

When the business owners are asked when they need the data, a typical response is “in real time.” One might immediately think that is going to be an issue with data integrations, because if the data is changed in one area, how is it going to be in real time in another application unless that application is accessing the same data at that time. And on top of that, one data integration is not normally the only thing going on, but rather multiple sources of information are coming together to provide the details and data. Several sources of data might not even be updated on a regular basis, so how does that affect the latency? How does the process know when the data is changing? Is the updated data needed in the process at the same time that it changes?

Remember our discussion on the size of the data and what happens to latency when very large datasets are being imported and synchronized? Chances are that it is not going to be available as quickly, especially if you add data cleansing to the mix.

The first step in dealing with data latency is not running out and buying more solid state drives and processing power, but rather getting down to the understanding of what “real time” means to the business owner. The important thing here is really understanding the requirement concerning latency. Understanding when the data is needed is the concern around latency, and the process needs to provide the data in the time that it really is needed. Low latency translates into higher data integration costs. The specific business requirements solving for the real-time data can be done, but the cost might be too high, and so data integrations should try to solve this in the most cost-efficient way. Having data there “fast enough” instead of requiring it in real time will reduce the costs and will provide important information for the process of the data integrations.

Start with the business requirements to get to the point of what data is actually needed and what the real-time requirements are for the different sets of data. If the data is not needed in real time, then how old can it be? What makes sense based on normal activity and realistic expectations of the system in providing the data integrations?

After a complete examination of the data sources, how available they are, and when data is normally available, decisions can be made about the data integrations. In mapping the data latency needs with the data integrations options, staging and firing the integrations can begin in the order that is needed to meet the expectations. Before a true understanding of what “real time” means for the application and what data it is referring to, it will be difficult to design and architect the options to solve how data gets integrated.

Tool Issues

As discussed in previous chapters, tools can be used to help automate, synchronize, and take care of complex data integrations. GoldenGate is a tool that replicates the data and synchronizes the data sources, and ODI provides a tool that not only handles the complex integrations but also stores the business rules and details around the processes. Furthermore, ODI allows you to configure and orchestrate the use of GoldenGate. In the next chapter, we discuss how important tools are for data quality.

Performing the data integrations manually, given the large amounts of information in the enterprise, would just be an overwhelming task. Even the issues discussed thus far in this chapter would make it extremely difficult to successfully have a repeatable process with manual steps.

Tools are needed for the processes, but tools need requirements and they need to be managed. Also, different tools might be needed at various steps along the way—from using SQL and external tables, to employing ODI, GoldenGate, Oracle Enterprise Manager Cloud Control (OEM), and data quality tools. These all provide specific tasks and solutions for the processes along the way. OEM Cloud Control is the one-stop shop for managing and monitoring all the other resources for databases, GoldenGate, and ODI. Figure 6-3 shows ODI being used for table mappings and business rules.

Images

FIGURE 6-3. A sample ODI table mapping

The tools used for data quality (which are discussed further in the next chapter) need to capture information about how to perform the data cleansing, allow for business rule changes, and constantly verify and apply these rules. Requirements gathering for the tools would be key in selecting the right one for the job.

There seems to be a pattern here: diving into the requirements and getting the full story before getting into solving the problem, designing the process, and performing the data integrations. Tools are the same, because even though they help automate and provide a repository of information for the data integrations, they still require maintenance and updates. Reporting tools help with alerts and reminders for checking rules. An example is lineage analysis for the entire data integration data flow using Oracle BI EE.

The tool requirements should actually match up with some of the issues discussed in this chapter. The functionality of the tools should include performing data integrations, replication, and reporting. Adding a self-service option to the tool, so that the business owners can get more involved in the process, is another requirement; this allows the data owners to manage their requirements and data details.

Here is a list of requirements that address some of these issues:

Images   Support business rules   Track changes

Images   Validate rules and set up processes   Store standards

Images   Support multiple data formats   Support multiple platforms

Images   Manage metadata   Provide data cleansing

Images   Verify integrations   Constraints and filtering

Images   Reporting   Access to data owners

Images   ETL tool   Real-time data capture

Images   Monitoring   Process development

In reviewing these options for the tools and realizing that they provide more automation, monitoring, and business requirements gathering and changes, we can see that tools are still an issue. Because of tool management, even tools that meet all the possible requirements add maintenance to the environment. Upgrades, provisioning, and availability will need to be supported. If there are integrations that require the data in real time or near real time, then the system will need to be configured to be highly available and supported as such. Also, if data integrations are provided, the reports will need to show what failed along the way for the data integrations, and ideally there will be contact information and service agreements to know whether a process is running longer.

Some of the information in the tools is only as good as the information managed. The tools should be able to alert or report on processes or when information was last updated. These notifications and alerts support the business and data owners, and lets them know when to go in and review the integration processes.

Data quality tools are discussed in the next chapter, but the tools reviewed in previous chapters support better data integrations. Decisions about tools need to be made just like decisions concerning data sources and processes. Adopting practices to review these issues with the business and data owners, technologists, and those supporting the integrations will build and design more effective integrations that align with all the areas.

Managing Mapping Tables: Metadata

As discussed in the “Business Purpose” section earlier in the chapter, the information around the columns and descriptions are part of the metadata. Metadata can be found in several different formats, such as database dictionaries, text files, processes, code, spreadsheets, and database tables. The business and data owners also hold metadata as well as information that might not be part of the system, such as the knowledge they have in their heads and the experience they have. The previous discussion about metadata and the purpose of the data focused on the responsibility of the data owner, values, and communicating what is available. Mapping tables and storing the metadata is another part of including the details in the database and/or tools. Other areas of metadata include information about aggregation hierarchies in dimensions, configuration information for a data integration process, constraint information, data quality checks, and more. This section focuses on one aspect of metadata.

Metadata can be captured as part of the data integration process just as the mapping of the data integrations can be stored. Figure 6-4 shows the mappings with filters in ODI.

Images

FIGURE 6-4. ODI table mapping for integrations

Notes and memos can document the processes and create data models for use by the designer in the ODI. A place to store the metadata and changes can be applied; this information is at the center of the data integration. Keeping the information in a tool allows it to be enhanced and extended to capture all the business details. ODI is designed as a data flow tool. Structural modeling and commenting these models are as important as the data flows. These comments and details can been handled and modeled using ODI or SQL Developer Data Modeler.

The metadata provides a way for columns to be different in multiple applications but have a consistent definition to reconcile those differences. Associations can be drawn between entities for the data environment, and the metadata environment can supply that capability. Some of the technology keeps the metadata updated through tools and the ability to pull information from the database dictionary information. Therefore, the extensible information needs to be maintained.

The technical metadata is part of the ETL process and data integrations. It has the constraints, column details, data types and tables, and record attributes. The data integrations should have the load time, update, backup, and last run information, which can be used for reporting on the process.

Master data management has procedures that include updating the metadata. Where possible, master data management automatically updates the information that is added about the tables and columns and provides a tool to include the data owner updates. Consistency in the metadata is important, because having metadata that is old or unusable is no different from not having metadata. It might even be that the metadata needs data cleansing and quality checking. Areas like this for incomplete and out-of-date metadata are a significant issue when the metadata is used for the data integrations. Master data management processes work on improving the processes around updating the metadata. Then by monitoring and reporting the metadata and attributes of the data, the process completes the information on the master data to technology operators and business owners.

Additional software solutions for managing the metadata are available, including Oracle Enterprise Metadata Management (OEMM). Using tools such as OEMM will allow for automated gathering of catalog details and other structures stored in relational and non-relational databases such as Hadoop. OEMM can be used to manage the metadata and provide the details for working through the challenges of the data integrations.

Testing

Loading data and integrating data from all over the enterprise pose problems for verifying the testing. Testing is an important part of applications. It ensures the data and processes are providing the correct information. Quite a few layers are part of the data integrations, all of which require testing and validating.

Applications’ data is normally persisted, updated, and inserted, and it can have a normal testing path. Applications normally have different sets of data to work through for testing. Data integrations might not persist data, making it a little more difficult for a full set of testing. Sources of data might only be available in certain systems, and even though these sources can be copied or moved to another system for testing, a full test couldn’t be done because of handling the sources differently than they would be in production.

In testing data integrations, the data can be pulled into a development environment and the code can be tested against a development source. Data integration testing might even be more like testing for data quality, such as checking for data completeness, transformations, duplicates, and integrity. The data needs to be verified that it has moved to the target as expected, and if not, the failure should be reported on or the process should include the output source of data. New data sources should be tested before being part of the production integration to flush out any data issues and areas that might produce duplicates, incomplete data, or data integrity issues.

Tools should be able to provide more automated testing, but extra work might need to be done to stage data for the development areas. This is at least where the data can be validated to see how it can be integrated, but some of the data flows might have to be adjusted for the development sources. ODI can have a test environment set up with the same processes and point to the development data sources. Figure 6-5 shows the ODI processes that can run. ODI can run through the same processes in production to be able to provide the same output and validations of the processes

Images

FIGURE 6-5. ODI process step example

Using the message logs and reports would also confirm the reports for production. It might take a little more setup for the data sources, but getting the testing done with the same process should work. Unfortunately, having more systems and some additional setup costs tends to extend the budgeting for testing. This might be a piece that gets left to the side. The additional costs for implementing a test environment for the integrations will prove valuable. Testing will help keep the issues at a minimum in production environments, and save costs in the long run. Testing can improve data quality as well to reduce issues. The advantages to testing outweigh the extra setup and having additional data sources.

Thorough testing, especially integration testing, is a vital part of the data integration exercise. It is time consuming and pulls in several resources, so it might devour a significant part of the budget. Testing validates the business rules, proper mappings, and the success of the data integration.

Performance

As mentioned before, performance is a challenge due to the large amount of data and the business requirements that need to be met concerning when the data should be available.

Processes, queries, and transformation tasks have to run efficiently to be able to perform the data integrations. With the ever-increasing size of data and greater business requirements, issues around performance are going to continue. Planning for sizing and resources should be a large part of the data integration strategy. Running data cleansing and quality steps can also create a strain when not enough resources are available to run on larger data sets. The tuning of queries is not going to be enough, because it is more than just the queries: analytics, mapping, and other data pulls from other sources all add to the work that needs to happen. The tuning includes everything from the data in the databases to the servers working through the transformations and the housing of the tools for data integrations.

The challenge is meeting the business requirements for the data and having large sets of data to process. The testing becomes important to verify that the data integrations can run in the given timeframe and deliver the data as needed.

Summary

Data integration brings with it a fair share of common challenges and issues. The common challenges bring up opportunities to tighten up design and processes. Enterprise data is useful for more than one area and the reason for the data integration process in the first place, but without planning and tools, it is an extremely difficult task to take on.

Designing and setting the business rules can be more challenging because they are not easily solved by technical solutions. They are going to be resolved by business decisions and very clearly defining requirements. Having the business just state that it needs real-time data without defining why and specifying what data they are looking for in “real time” will cause difficulties with different systems having those types of solutions in place. Therefore, the business and data owners need to agree on the data they want and specifically scope out the requirements.

Defining the requirements for the tools that are needed to support the data integrations will allow for selection of tools that can automate and manage the processes. It is important to get to fewer manual steps in the integrations; not only to reduce the risk of the implementation, but also to provide more consistent data.

The data source doesn’t need to be absolutely perfect to be the source of truth or the golden set of data. It just needs to be handled like it is the primary source of data and as the source that is being fed. The golden source of data shouldn’t be copied just to put it with another source. Rather, a decision needs to be made as to how the data should be stored. If the data should be collected into a single source, then rules and processes have to be put into place to synchronize the data. Otherwise, the data can be used from the different sources through queries and database types.

Knowledge about the data that is available and the new data coming out of the applications as well as the data stores for the overall company are all key pieces for effective data integrations. The data owners and technologists will have to work through the issues together to produce and support effective data integrations.

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset