Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1

Introduction to Data Warehousing

Abstract

This chapter introduces basic terminology of data warehousing, its applications, and the business context. It provides a brief description of its history and where it is heading. Basic data warehouse architectures that have been established in the industry are presented. Issues faced by data warehouse practitioners are explained, including topics such as big data, changing business requirements, performance issues, complexity, auditability, restart checkpoints, and fluctuation of team members.

Keywords

data

data warehouse

big data

decision support systems

scalability

business intelligence

Information has become a major asset to any organization. Corporate users from all levels, including operational management, middle management and senior management, are requesting information to be able to make rational decisions and add value to the business [1, p334f]. Each level has different requirements for the requested information, but common dimensions include that the information be accurate, complete and consistent, to name only a few [2, p133]. A rational manager will use available and trusted information as a basis for informed decisions that might potentially affect the bottom line of the business.

When we use the terms data or information, we often use them interchangeably. However, both terms and the terms knowledge and wisdom have significant and discrete meanings. But they are also interrelated in the information hierarchy (Figure 1.1).

Figure 1.1 The information hierarchy [3, p3-9].

Data, at the bottom of the hierarchy, are specific, objective facts or observations. Examples could be expressed as statements such as “Flight DL404 arrives at 08:30 a.m.” or “LAX is in California, USA.” Such facts have no intrinsic meaning if standing alone but can be easily captured, transmitted, and stored electronically [4, p12].

Information is a condensed form of the underlying data. Business people turn data into information by organizing it into units of analysis (e.g., customers, products, dates) and endow it with relevance and purpose [5, p45-53]. It is important for this relevance and purpose that the information is considered within the context it is received and used. Managers from one functional department have different information needs than managers from other departments and view the information from their own perspective. In the same way, these information needs vary across the organizational hierarchy. As a rule of thumb, the higher an information user sits in the organizational hierarchy, the more summarized (or condensed) information is required [4, p12f].

Knowledge, towards the top of the information hierarchy, is information that has been synthesized and contextualized to provide value. Managers use information and add their own experience, judgment and wisdom to create knowledge, which is richer and deeper than information and therefore more valuable. It is a mix of the underlying information with values, rules, and additional information from other contexts.

The highest level of the pyramid is represented by wisdom, which places knowledge from the underlying layer into a framework that allows it to be applied to unknown and not necessarily intuitive situations [6]. Because knowledge and wisdom are hard to structure and often tacit, it is difficult to capture them on machines and hard to transfer [4, p13]. For that reason, it is not the goal of data warehousing to create knowledge or wisdom. Instead, data warehousing (or business intelligence) focuses on the aggregation, consolidation and summarization of data into information by transferring the data into the right context.

Due to the value that information provides to users within the organization, the information assets must be readily available when the user requests them and have the expected quality. In the past, this analysis has been conducted directly on operational systems, such as an e-commerce store or a customer relationship management (CRM) system. However, because of the massive volumes of data in today’s organizations, the extraction of useful and important information from such raw data becomes a problem for the analytical business user [7, p1]. Another problem is that there are often isolated databases, called “data islands,” in a typical organization. The only connections between these data islands and other data sources are business keys, which are used to identify business objects in both systems. Therefore, the integration of the disparate data sources has to be done on these business keys at some point but often exceeds the capabilities of the ordinary business analyst.

Users in operations often query or update data of a specific business object in their daily work. These operations are performed using transactional queries. Examples include the issue of a support ticket, the booking of an airplane ticket or the transmission of an email. In these cases, the operational user works on business objects that are part of their business processes. Users within the middle or senior management often have other tasks to complete. They want to get information from the business or business unit that they are responsible for. They use this information to make their managerial decisions. For that purpose, they often issue analytical queries against the database to summarize data over time. By doing so, they transform the raw data, for example sales transactions, to more useful information, e. g., a sales report by month and customer. Such analytical queries are different from transactional queries because the first often aggregate or summarize a lot of raw data. If a business user issues an analytical query against an operational database, the relational database management system (RDBMS) has to retrieve all underlying records from disk storage in order to execute the aggregation.

1.1. History of Data Warehousing

Before the emergence of data warehousing, users had to query required information directly from raw data stored in operational systems, as described in the introduction of this chapter. Such raw data is often stored in relational databases serving the user’s application. While querying an operational database has the advantage that business users are able to receive real-time information from these systems, using analytical queries to transform the raw data to useful information slows down the operational database. This is due to the aggregation that requires the reading of a large number of records on the fly to provide a summary of transactions (e.g., sales per month, earnings per year, etc.). Having both operational and analytical users on the same database often overloads the database and impacts the usability of the data for both parties [7, p1].

1.1.1. Decision Support Systems

In order to allow quick access to the information required by decision-making processes, enterprise organizations introduced decision support systems (DSS). Such systems combine various expandable and interactive IT techniques and tools to support managers in decision-making by processing and analyzing data.

To achieve its goals, a DSS is comprised of an analytical models database that is fed with selected data extracted from source systems. Source systems are the operational systems that are available within an organization, but can include any other source of enterprise data. Examples might include exchange rates, weather information or any other information that is required by managers to make informed decisions. The raw data is aggregated within the analytical models database or on the way into the system [1, p57]. ETL (extract, transform, load) tools that have been developed to extract, transform, and load data from data sources to targets do the loading:

The analytical models database in Figure 1.2 is loaded by an ETL process with data from five data sources. The data is then aggregated either by the ETL process (in the data preparation process) or when the business user queries the data. Business users can query the analytical models database with ad-hoc queries and other complex analysis against the database. In many cases, the data has been prepared for their purpose and contains only relevant information. Because the decision support system is separated from the source systems, interactions with the DSS will not slow down operational systems [7, p1].

The next section discusses data warehouse systems that are covered in this book. These systems were introduced in the 1990s and have provided the data backend of decision support systems since then [7, p3].

1.1.2. Data Warehouse Systems

A data warehouse system (DWH) is a data-driven decision support system that supports the decision-making process in a strategic sense and, in addition, operational decision-making, for example real-time analytics to detect credit card fraud or on-the-fly recommendations of products and services [8]. The data warehouse provides nonvolatile, subject-oriented data that is integrated and consistent to business users on all targeted levels. Subject orientation differs from the functional orientation of an ERP or operational system by the focus on a subject area for analysis. Examples for subject areas of an insurance company might be customer, policy, premium and claim. The subject areas product, order, vendor, bill of material and raw materials, on the other hand, are examples for a manufacturing company [9, p29]. This view of an organization allows the integrated analysis of all data related to the same real-world event or object.

Before business users can use the information provided by a data warehouse, the data is loaded from source systems into the data warehouse. As described in the introduction of this chapter, the integration of the various data sources within or external to the organization is performed on the business keys in many cases. This becomes a problem if a business object, such as a customer, has different business keys in each system. This might be the case if a customer number in an organization is alphanumeric but one of the operational systems only allows numeric numbers for business keys. Other problems occur when the database of an operational system includes dirty data, which is often the case when invalid or outdated, or when no business rules are in place. Examples for dirty data include typos, transmission errors, or unreadable text that has been processed by OCR. Before such dirty data can be presented to a business user in traditional data warehousing, the data must be cleansed, which is part of the loading process of a data mart. Other issues include different data types or character encodings of the data across source systems [9, p30f]. However, there are exceptions to this data cleansing: for example, if data quality should be reported to the business user.

Another task that is often performed when loading the data into the data warehouse is some aggregation of raw data to fit the required granularity. The granularity of data is the unit of data that the data warehouse supports. An example of different granularity of data is the difference between a salesman and a sales region. In some cases, business users only want to analyze the sales within a region and are not interested in the sales of a given salesman. Another reason for this might be legal issues, for example an agreement or legal binding with a labor union. In other cases, business analysts actually want to analyze the sales of a salesman, for example when calculating the sales commission. In most cases, data warehouse engineers follow the goal to load at the finest granularity possible, to allow multiple levels for analysis. In some cases, however, the operational systems only provide raw data at a coarse granularity.

An important characteristic of many data warehouses is that historic data is kept. All data that has been loaded into the data warehouse is stored and made available for time-variant analysis. This allows the analysis of changes to the data over time and is a frequent requirement by business users, e.g., to analyze the development of sales in a given region over the last quarters. Because the data in a data warehouse is historic and, in most cases, is not available anymore in the source system, the data is nonvolatile [9, p29]. This is also an important requirement for the auditability of an information system [10, p131].

The next section introduces enterprise data warehouses, which are a further development of data warehouses, and provides a centralized view of the entire organization.

1.2. The Enterprise Data Warehouse Environment

Enterprise data warehouses (EDW) have emerged from ordinary data warehouses, which have been described in the last section. Instead of focusing on a single subject area for analysis, an enterprise data warehouse tries to represent all of an organization’s business data and its business rules. The data in the warehouse is then presented in a way that all required subject areas are available to business users [11].

The next sections present common business requirements for enterprise data warehouses.

1.2.1. Access

Access to the EDW requires that the end-users be able to connect to the data warehouse with the proposed client workstations. The connection must be immediate, on demand and with high performance [12, pxxiii]. However, access means much more for the users than the availability, especially the business users: it should be easy to understand the meaning of the information presented by the system. That includes the correct labelling of the data warehouse contents. It also includes the availability of appropriate applications to analyze, present and use the information provided by the data warehouse [12, p3].

1.2.2. Multiple Subject Areas

Because every function or department of an enterprise has different requirements for the data to be analyzed, the enterprise data warehouse must provide multiple subject areas to meet the needs of its individual users. Each subject area contains the data that is relevant to the user. The data is requested and the data warehouse provides the expected version of the truth, which means that it follows the required definition of the information [11].

In order to achieve this goal, all raw data that is required for the subject areas is integrated, cleansed, and loaded into the enterprise data warehouse. It is then used to build data marts that have been developed for a specific subject area. Such data marts are also called dependent data marts because they depend on the data warehouse as the source of data. In contrast, independent data marts source the data directly from the operational systems. Because this approach requires the same cleansing and integration efforts as building the data warehouse, it is often simpler to load the data from a central data warehouse [13].

1.2.3. Single Version of Truth

The integration of all business data available in an organization serves the goal of having a single version of truth of its data [11]. There are many operational systems, or even ordinary data warehouses, available in a typical organization. While some of these systems are integrated, there is often a disparity of the data stored within the operational databases. This might be due to synchronization delays or errors, manual inputs or different raw data sources for the operational data. The effect is that there are different versions of the truth within the organization, for example about the shipment address of a customer. It is up to the business to decide how to cleanse the data when loading it into the enterprise data warehouse and this often requires the selection of leading systems or data source priorities. In some cases, an automatic selection and validation based on business rules is sufficient; in other cases, a manual selection is required to achieve a validated, single version of truth.

The consistent, single version of truth of an enterprise data warehouse is an important goal for the consumers of data warehouses [12, pxxiv; 14, p23]. However, different departments often require a unique version of the truth because of a different definition of “what is the truth” [15]. That is why an enterprise data warehouse provides multiple subject areas, as covered in the previous section. Each subject area provides the required information for its individual users in the required context.

1.2.4. Single Version of Facts

While the goal of the “single version of truth” is to provide an integrated, cleansed version of the organizational information, that is, the aggregated and condensed data in a given context, “the single version of facts” goal is to provide all the data, all the time. In such case, the EDW should store and potentially provide all raw data that is critical for the mission of the organization (see next section). The lead author of this book was one of the first people in the data warehousing industry to promote this idea, especially due to compliance issues. Eventually, it led to the invention of Data Vault and is a key principle in Data Vault 2.0 modeling and is implemented in the Raw Data Vault.

The single version of facts is also important under auditing and compliance requirements, which is covered in section 1.2.10. We will learn later in this book that Data Vault based EDWs provide both versions: the single version of truth and the single version of facts.

1.2.5. Mission Criticality

Due to the importance of the data warehouse as the basis for strategic business decisions, the central data warehouse has become a mission-critical corporate asset. Furthermore, data warehouses not only provide aggregated data for business decisions; they also feed enriched information back to operational systems to support the processing of transactions, to create personalized offers and to present upsell promotions [16, p9].

Mission criticality also requires a specific level of quality of the data warehouse data [12, pxxv]. If source systems don’t provide the raw data in the required quality, it is the job of the data warehouse to fix any data quality issues and improve the data quality by means of data cleansing, data integration or any other useful methods.

1.2.6. Scalability

Scalability is the ability of the data warehouse architecture to adapt to higher data volumes and an increasing number of users’ requests that have to be met [17, p7]. The architecture should be built in a way that supports adding more data, not only more data volume, but also more complex data. If the data volume grows over the capabilities of the hardware, it should be possible to distribute the data warehouse across multiple machines and fully use the capabilities of the added hardware. This concept is called massively parallel processing (MPP). If the architecture is not scalable, adding more hardware has no or only minimal effect when reaching a certain level of build-out.

Another problem in data warehousing is that changing the data warehouse is often complex because of existing dependencies. While building the first version of the data warehouse was easily done, the second version takes more time. This is because the architecture of the data warehouse was not built with those changes in mind.

Section 1.4 discusses several data warehouse architectures. We will propose an alternate architecture in Chapter 2, Scalable Data Warehousing Architecture. The advantage of this architecture lies in its scalability regarding the absorption of changes to the data model, among other advantages.

1.2.7. Big Data

Big data is not only “a lot of data” or “more data that I can handle.” We define big data as data having three characteristics: volume, velocity and variety.

The first characteristic is volume. What someone calls “big data” often means that the data is much more than this person is used to handling. However, this statement is highly subjective. Big data for one person or one company might be one gigabyte of raw data but this is rather small data for a person who loads terabytes or even petabytes of data. Loading data in real-time has different requirements and therefore a different definition of big data than loading data in nightly batches or near real-time (near real-time means that the data from operational systems is available in the data mart within a time frame of typically 15 minutes). The definition of big data also depends on the hardware available to the data warehouse.

The second characteristic of big data is velocity. It is not only that there is a lot of static data available in the source systems. Loading this data can become a complex task. However, data stored in an operational system is changing frequently. The more data that is available in a source system, the more changes are applied to it. Therefore, the typical big data project has to deal with lots of updates, data changes or new data that is added to the source system.

The third characteristic is variety. Big data often doesn’t have the same structure. Instead, the data structures of big data might change over time or, such as in “unstructured” datasets (e.g., texts, multimedia), has no ordinary structure at all: instead of using columns and rows as relational tables, unstructured datasets use other types of structures, such as linguistic structures. From a computing standpoint, these structures are considered unstructured because the structure is not as obvious as in relational tables. In other cases, there are data from so many different small data sources that the sum of this data is “big data” with high variety in data structures.

Because there is more and more data available nowadays, a data warehouse structure must not only be able to scale (which refers to the volume), it should also be able to deal with velocity and variety of the incoming data. In other cases, data is always in motion: by that we mean that it is currently processed or transferred in packets that are smaller than the actual data asset. Consider for example the transfer of data over a TCP/IP network: the data which needs to be transmitted is usually divided in smaller chunks and stored in IP packets, which are then transmitted over the network. This adds other problems to big data because data is flowing into and out of the network device. In order to analyze the data, it has to be collected, combined, and aggregated – in some cases at real-time. This raises the bar on what and how big data is architected and planned for and leads us to performance issues, which are covered in the next section.

1.2.8. Performance Issues

Another issue in data warehousing is the performance of the system. Performance is important when loading a new batch of source data into the data warehouse because the load process includes cleaning and integrating the data into existing data within the timeframe available. Often, this timeframe is limited to the time when no users are working with the system, usually during the night. Another reason for performance is the usability of the data warehouse, which depends on the response time of the system to analytical user queries.

The performance of data warehouse systems is influenced by the way a database system stores its data on disk: data is stored in pages with a fixed data size. For example, a Microsoft SQL Server allocates 8 KB disk space for each page [18]. Each page holds some records of a particular table. The wider a table is and the more columns a table has, the fewer rows fit into one page. In order to access the contents of a given column for a given row, the whole page where this piece of data exists must be read. Because analytical queries, which are often used in data warehousing, typically aggregate information, many pages must be read for accessing the contents of only one row. A typical example of an aggregation is to sum up the sales of a given region; this could be the sum of an invoice_total column. If there are many columns in the table, a lot of data must be read that is not required to perform the aggregation. Therefore, a goal in data warehousing is to reduce the width of columns in order to improve performance. Similar concepts apply to the loading of new data into the data warehouse.

Other ways to improve the performance of data warehouse systems include (1) the parallelization of loading patterns and (2) the distribution of data over multiple nodes in MPP settings like in the NoSQL databases. Instead of loading one table after the other, the goal of the first option is to load multiple tables at once. The second option increases performance by the distribution of data to multiple nodes. Both ways are critical to the success of Data Vault 2.0 in such environments and have influenced the changes in Data Vault modeling compared to the initial release (Data Vault 1.0).

1.2.9. Complexity

Data warehouse systems often have complexity issues due to many business requirements. Technical complexity issues arise from three areas: sourcing issues, transformation issues, and target issues.

Sourcing issues are problems that arise from the system from which the data is extracted. The following are typical examples of problems [19, p16f]:

• Limited availability of the source systems.

• Cross-system joins, filters or aggregates.

• Indexing issues in the source data.

• Missing source keys or even missing whole source data sets.

• Bad or out-of-range source data.

• Complexities of the source system’s data structures.

• CPU, RAM, and disk load of the source system.

• Transactional record locks.

Transformation issues arise during the transformation of the data to meet the expectations of the target. Often, the following operations are performed directly within the transformation:

• Cleansing.

• Data quality management and data alignment.

• Joins, consolidation, aggregation, and filtering.

• Sequence assignments that often lead to lack of parallelism.

• Data type corrections and error handling.

• Sorting issues, including the need for large caches, frequent disk overflows, and huge keys.

• Application of business rules directly within the transformation of the source data.

• Multiple targets or sources within one data flow.

• Single transformation bottlenecks.

The last area of issues is located at the target. These issues arise when loading the data into the target and include:

• Lack of database tuning.

• Index updates which lead to deadlocks.

• Mixing insert, update, and delete statements in one data flow. This forces the execution of these statements in specific orders, which hinders parallelization.

• Loading multiple targets at once.

• Wide targets, as discussed in section 1.2.8.

• Lack of control over target partitioning.

A common reason for these issues is that many data warehouse systems are trying to achieve too much in one loading cycle instead of splitting up the work. The result is that many loading processes become too complicated, which reduces overall performance and increases maintenance costs. In the end, it also affects the agility and performance of the whole team because they have to fix these issues instead of implementing new features.

1.2.10. Auditing and Compliance

A typical requirement for a data warehouse is the ability to provide information about the source and extraction time of the data stored within the system. There are various reasons for this requirement:

• Data warehouse developers are trying to trace down potential errors and try to understand the flow of the data into the system.

• The value of data depends on the source or age of data. This information might be used in business rules.

• Compliance requires the traceability of data flows and processes for information that is used as the basis of business decisions. It must be clear where the data comes from and when it has been loaded into the data warehouse [20, p4f].

Inmon, however, presents reasons for not adding auditability in the data warehouse [9, p61]:

• Auditing requires data to be loaded into the data warehouse that would not be loaded without such requirement.

• It might change the timing of data to be loaded into the data warehouse. For example, if the data warehouse would be the only place that provides auditing, it could require loading all changes to the operational data instead of the daily batch loads typical in many data warehousing projects.

• Backup and restore requirements change drastically when auditing capabilities are required.

• Auditing the source of data forces the data warehouse to load source data with the very lowest granularity.

It is our opinion that auditability should be limited to answering questions such as:

• From where is this particular data asset extracted?

• When has the data been extracted?

• What was the process that extracted the data?

• Where was this data used?

The data warehouse should not answer the question of how the data was acquired by the operational system. This answer can often only be provided by the source system itself. Only in some cases, the data warehouse will receive information about the user and time when the record was created or modified. If this data is available to the data warehouse, we tend to store this information for informational purposes only.

To support the auditability of the data warehouse, we add meta-information to the data to track the data source and load date and time. However, it is more complicated to answer the question of where the data was used because data marts often aggregate data to create information to be used by the business users. In order to enable data warehouse maintainers to answer such questions, the data warehouse processes should be simple and easy to understand.

1.2.11. Costs

Another challenge in data warehousing is to keep the costs as low as possible because IT in general is considered as a cost factor by management. The cost of a data warehouse is influenced by many factors, starting from cost of storage to cost of low quality and bad planning. Another cost factor is that business requirements change over time, requiring the data warehouse to adapt to these changed requirements.

The cost of storage is an often unaccounted for cost factor in data warehousing. At the beginning of a data warehouse project, the costs are typically low. If the data warehouse started as a shadow IT project, i.e., projects driven by the business, implemented by external IT consultants and bypassing internal IT, the costs might have even been hidden in the budget of another project or activity. However, when some time has passed and the amount of data that is processed by the data warehouse has increased, the storage cost increases as well. In some cases, this happens exponentially and does not only include the costs for adding new disks. If more data is added to the data warehouse, faster network access is required to access the data; more computing power is required to process the data; and better (and more expensive) hard disk controllers are required to access the disks [9, p335ff].

However, ever-increasing storage costs are not the big cost factor in data warehousing. Cost factors include:

• Cost of storage [21, p47]

• Cost of low quality

• Cost of bad planning

• Cost of changing business requirements (see next section as well)

The cost of low quality and bad planning is an even bigger factor: even if the project team has carefully planned the data warehouse and ensured the quality, there is nothing it can do against changing business requirements, except anticipatory planning [21, p335].

This is particularly evident when the business requirements are located upstream of the data warehouse. As introduced earlier, this not only negatively affects performance, but it also drives up the cost of maintenance. Business requirements should not be embedded in the data warehouse loading cycle but rather should be moved downstream to the data mart loading – closer to the business users. This allows the team to be agile and control maintenance and development costs (through auto-generation) and provides better, more rapid response to changing business requirements. In other words, it controls costs of data mart production as well.

The agility of the team is directly proportional to the amount of complexity built into the data handling processes. By separating the complex business requirements into respective components, multiple loading sections of the architecture become streamlined; to a point where a majority of the implementation can actually be generated. The mechanics of this separation provide for extreme agility when responding to business requirement changes.

1.2.12. Other Business Requirements

Today’s business environment is characterized by rapidly changing conditions and uncertainty. Therefore, it is common that business requirements change quite frequently. Data warehouse developers try to prevent changes to the data warehouse by careful planning and anticipatory design. This approach often follows traditional waterfall software development methods. In such approaches, there are often four phases [22, p162f]:

1. Setting up the requirements for the data warehouse.

2. Architectural planning and design of the data warehouse.

3. Development of the data warehouse.

4. Testing of the data warehouse.

In contrast, agile software development methods have been designed to improve software by using customer feedback to converge on solutions [23]. To support this requirement, the data warehouse must be adaptive and resilient to change [12, p3]. A change to the existing data warehouse structures should not invalidate existing data or applications. One of the major advantages of agile methods is the ability to quickly react on business changes, as we will learn in Chapter 3, Data Vault 2.0 Methodology.

To support both the data warehouse engineers as well as the data warehouse business users, a set of tools to query, analyze and present information is required [12, pxxiv]. Examples include reporting tools, query analyzers, OLAP (on-line analytical processing) browsers, data mining tools, etc. Microsoft SQL Server 2014 includes these tools out-of-the-box.

Another business requirement is the ability of the project team to cope with the natural fluctuation of team members. An important success factor in data warehousing is to keep the knowledge and skills of the data warehouse members within the team, regardless of the retirement or withdrawal of key members. Solutions for this include a well-documented data warehouse system and an easily understandable design. Another solution is to use business intelligence (BI) solutions from a major vendor, such as Microsoft, that is well known in the industry and supported by other vendors and consultancy firms [20, p9].

These are major components of Data Vault 2.0, and the innovation contained within. DV2.0 addresses Big Data, NoSQL, performance, team agility, complexity, and a host of other issues by defining standards and best practices around modeling, implementation, methodology, and architecture.

1.3. Introduction to Data Vault 2.0

Data Vault really represents a system of business intelligence. The true name of the Data Vault System is: Common Foundational Warehouse Architecture. The system includes a number of aspects that relate to the business of designing, implementing, and managing a data warehouse. A bit of historical research into Data Vault 1.0 shows that Data Vault 1.0 is highly focused on Data Vault Modeling, that is to say, a dedication to the physical and logical data models that construct the raw enterprise data warehouse. Data Vault 2.0, on the other hand, has expanded, and includes many of the necessary components for success in the endeavor of data warehousing and business intelligence. These components are:

• Data Vault 2.0 Modeling – Changes to the model for performance and scalability

• Data Vault 2.0 Methodology – Following Scrum and Agile best practices

• Data Vault 2.0 Architecture – Including NoSQL systems and big-data systems

• Data Vault 2.0 Implementation – Pattern based, automation, generation CMMI level 5

Each of these components plays a key role in the overall success of an enterprise data warehousing project. These components are combined with industry-known and time-tested best practices ranging from CMMI (Capability Maturity Model Integration), to Six Sigma, TQM (total quality management) and PMP (Project Management Professional). Data Vault 2.0 modeling now includes changes that allow the models to interact seamlessly with (or live on) NoSQL and Big Data systems. Data Vault 2.0 Methodology focuses on 2 to 3 week sprint cycles with adaptations and optimizations for repeatable data warehousing tasks. Data Vault 2.0 Architecture includes NoSQL, real-time feeds, and big data systems for unstructured data handling and big data integration. Data Vault 2.0 Implementation focuses on automation and generation patterns for time savings, error reduction, and rapid productivity of the data warehousing team.

1.4. Data Warehouse Architecture

To meet technical expectations, data warehouse engineers can use various architectures to build data warehouses. Common data warehouse architectures are based on layered approaches, which is often the case in information systems. Two of these typical architectures are described in the next sections.

1.4.1. Typical Two-Layer Architecture

Kimball has introduced an often-used, two-layer architecture [24, p114]. In this architecture, which is presented in Figure 1.3, there are only two layers that are part of the data warehouse system itself.

Figure 1.3 The Kimball Data Lifecycle [25]. Figure adapted by author from [25]. Copyright by IAS Inc. Reprinted with permission.

The raw data from the source systems is loaded into the stage area. The goal is to have an exact copy of all data that should be loaded into the data warehouse. The main purpose of the stage area is to reduce the number of operations on the source system and the time to extract the data from it. The tables in the stage area are modeled after the tables in the source system. A stage area is required when the transformations are complex and cannot be performed on-the-fly or when data arrives from multiple source systems at different times [17, p33].

Once the data has been loaded to the stage area, Kimball suggests loading the data into the data warehouse. This data warehouse has been modeled after the dimensional model and is made up of data marts (representing the business processes), “bound together with […] conformed dimensions” [25]. It was first proposed by Kimball in 1996. The dimensional model is a de-facto standard that is easy to query by business users and analytical tools, such as OLAP front-ends or engines. Because it is a logical association of conformed data marts, business rules have to be implemented before the data warehouse layer in order to conform and align the datasets. We will discuss dimensional modeling in Chapter 7, Dimensional Modeling. Data access applications use the dimensional model to present the information to the user and allow ad-hoc analysis.

The advantage of a two-layered architecture is that it is easy to build a dimensional store from the source data as compared to other architectures. However, the disadvantage is that it is more complex to build a second dimensional model from the same source data because the data needs to be loaded again from the staging. It is not possible to reuse existing ETL packages [17, p34f].

1.4.2. Typical Three-Layer Architecture

To overcome the limitations of a two-layer architecture, another commonly found architecture is based on three layers (Figure 1.4).

Figure 1.4 The Inmon Data Warehouse [25]. Figure adapted by author from [25]. Copyright by IAS Inc. Reprinted with permission.

This architecture has been introduced by Inmon and introduces an atomic data warehouse, often a normalized operational data store (ODS) between the staging area and the dimensional model. The stage area in this architecture follows that of the two-layer architecture. The data warehouse, however, holds raw data modeled in a third-normal form. It integrates all data of the enterprise, but is still based on physical tables from the source systems. By doing so, it acts similarly to a large operational database.

On top of the normalized view of the business data, there is a dimensional model. Business users can access and analyze the data using subject-oriented data marts, similar to the two-layer architecture. However, it is much easier to create new data marts from the data available in the operational data store because the data is already cleaned and integrated. Therefore, it is not required to perform data cleaning and integration for building new data marts [17, p38]. In practice, two-layer data warehouses often have multiple data marts, serving the requirements by heterogeneous user groups, by providing different subject areas to its users.

However, it is more complex and requires more data processing to build the entire data warehouse, including the operational data store and the dependent data marts. Another problem is that changes to the data model can become a burden if many data marts depend on the operational data store. We will discuss an alternate, three-layer architecture to enable faster changes to the data warehouse in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Introduction to Data Warehousing

Create new playlist

Sign In

Sign Up