Information has become a major asset to any organization. Corporate users from all levels, including operational management, middle management and senior management, are requesting information to be able to make rational decisions and add value to the business
[1, p334f]. Each level has different requirements for the requested information, but common dimensions include that the information be accurate, complete and consistent, to name only a few
[2, p133]. A rational manager will use available and trusted information as a basis for informed decisions that might potentially affect the bottom line of the business.
When we use the terms
data or
information, we often use them interchangeably. However, both terms and the terms
knowledge and
wisdom have significant and discrete meanings. But they are also interrelated in the information hierarchy (
Figure 1.1).
Data, at the bottom of the hierarchy, are specific, objective facts or observations. Examples could be expressed as statements such as “Flight DL404 arrives at 08:30 a.m.” or “LAX is in California, USA.” Such facts have no intrinsic meaning if standing alone but can be easily captured, transmitted, and stored electronically
[4, p12].
Information is a condensed form of the underlying data. Business people turn data into information by organizing it into units of analysis (e.g., customers, products, dates) and endow it with relevance and purpose
[5, p45-53]. It is important for this relevance and purpose that the information is considered within the context it is received and used. Managers from one functional department have different information needs than managers from other departments and view the information from their own perspective. In the same way, these information needs vary across the organizational hierarchy. As a rule of thumb, the higher an information user sits in the organizational hierarchy, the more summarized (or condensed) information is required
[4, p12f].
Knowledge, towards the top of the information hierarchy, is information that has been synthesized and contextualized to provide value. Managers use information and add their own experience, judgment and wisdom to create knowledge, which is richer and deeper than information and therefore more valuable. It is a mix of the underlying information with values, rules, and additional information from other contexts.
The highest level of the pyramid is represented by wisdom, which places knowledge from the underlying layer into a framework that allows it to be applied to unknown and not necessarily intuitive situations
[6]. Because knowledge and wisdom are hard to structure and often tacit, it is difficult to capture them on machines and hard to transfer
[4, p13]. For that reason, it is not the goal of data warehousing to create knowledge or wisdom. Instead, data warehousing (or business intelligence) focuses on the aggregation, consolidation and summarization of data into information by transferring the data into the right context.
Due to the value that information provides to users within the organization, the information assets must be readily available when the user requests them and have the expected quality. In the past, this
analysis has been conducted directly on operational systems, such as an e-commerce store or a customer relationship management (CRM) system. However, because of the massive volumes of data in today’s organizations, the extraction of useful and important information from such raw data becomes a problem for the analytical business user
[7, p1]. Another problem is that there are often isolated databases, called “data islands,” in a typical organization. The only connections between these data islands and other data sources are business keys, which are used to identify business objects in both systems. Therefore, the integration of the disparate data sources has to be done on these business keys at some point but often exceeds the capabilities of the ordinary business analyst.
Users in operations often query or update data of a specific business object in their daily work. These operations are performed using transactional queries. Examples include the issue of a support ticket, the booking of an airplane ticket or the transmission of an email. In these cases, the operational user works on business objects that are part of their business processes. Users within the middle or senior management often have other tasks to complete. They want to get information from the business or business unit that they are responsible for. They use this information to make their managerial decisions. For that purpose, they often issue analytical queries against the database to summarize data over time. By doing so, they transform the raw data, for example sales transactions, to more useful information, e. g., a sales report by month and customer. Such analytical queries are different from transactional queries because the first often aggregate or summarize a lot of raw data. If a business user issues an analytical query against an operational database, the relational database management system (RDBMS) has to retrieve all underlying records from disk storage in order to execute the aggregation.
1.2. The Enterprise Data Warehouse Environment
Enterprise data warehouses (EDW) have emerged from ordinary data warehouses, which have been described in the last section. Instead of focusing on a single subject area for analysis, an enterprise data warehouse tries to represent all of an organization’s business data and its business rules. The data in the warehouse is then presented in a way that all required subject areas are available to business users
[11].
The next sections present common business requirements for enterprise data warehouses.
1.2.1. Access
Access to the EDW requires that the end-users be able to connect to the data warehouse with the proposed client workstations. The connection must be immediate, on demand and with high performance
[12, pxxiii]. However, access means much more for the users than the availability, especially the business users: it should be easy to understand the meaning of the information presented by the system. That includes the correct labelling of the data warehouse contents. It also includes the availability of appropriate applications to analyze, present and use the information provided by the data warehouse
[12, p3].
1.2.2. Multiple Subject Areas
Because every function or department of an enterprise has different requirements for the data to be analyzed, the enterprise data warehouse must provide multiple subject areas to meet the needs of its individual users. Each subject area contains the data that is relevant to the user. The data is requested and the data warehouse provides the expected version of the truth, which means that it follows the required definition of the information
[11].
In order to achieve this goal, all raw data that is required for the subject areas is integrated, cleansed, and loaded into the enterprise data warehouse. It is then used to build data marts that have been developed for a specific subject area. Such data marts are also called dependent data marts because they depend on the data warehouse as the source of data. In contrast, independent data marts source the data directly from the operational systems. Because this approach requires the same cleansing and integration efforts as building the data warehouse, it is often simpler to load the data from a central data warehouse
[13].
1.2.3. Single Version of Truth
The integration of all business data available in an organization serves the goal of having a single version of truth of its data
[11]. There are many operational systems, or even ordinary data warehouses, available in a typical organization. While some of these systems are integrated, there is often a disparity of the data stored within the operational databases. This might be due to synchronization delays or errors, manual inputs or different raw data sources for the operational data. The effect is that there are different versions of the truth within the organization, for example about the shipment address of a customer. It is up to the business to decide how to cleanse the data when loading it into the enterprise data warehouse and this often requires the selection of leading systems or data source priorities. In some cases, an automatic selection and validation based on business rules is sufficient; in other cases, a manual selection is required to achieve a validated, single version of truth.
The consistent, single version of truth of an enterprise data warehouse is an important goal for the consumers of data warehouses
[12, pxxiv; 14, p23]. However, different departments often require a unique version of the truth because of a different definition of “what is the truth”
[15]. That is why an enterprise data warehouse provides multiple subject areas, as covered in the previous section. Each subject area provides the required information for its individual users in the required context.
1.2.4. Single Version of Facts
While the goal of the “single version of truth” is to provide an integrated, cleansed version of the organizational information, that is, the aggregated and condensed data in a given context, “the single version of facts” goal is to provide all the data, all the time. In such case, the EDW should store and potentially provide all raw data that is critical for the mission of the organization (see next section). The lead author of this book was one of the first people in the data warehousing industry to promote this idea, especially due to compliance issues. Eventually, it led to the invention of Data Vault and is a key principle in Data Vault 2.0 modeling and is implemented in the Raw Data Vault.
The single version of facts is also important under auditing and compliance requirements, which is covered in
section 1.2.10. We will learn later in this book that Data Vault based EDWs provide both versions: the single version of truth and the single version of facts.
1.2.5. Mission Criticality
Due to the importance of the data warehouse as the basis for strategic business decisions, the central data warehouse has become a mission-critical corporate asset. Furthermore, data warehouses not only provide aggregated data for business decisions; they also feed enriched information back to operational systems to support the processing of transactions, to create personalized offers and to present upsell promotions
[16, p9].
Mission criticality also requires a specific level of quality of the data warehouse data
[12, pxxv]. If source systems don’t provide the raw data in the required quality, it is the job of the data warehouse to fix any data quality issues and improve the data quality by means of data cleansing, data integration or any other useful methods.
1.2.6. Scalability
Scalability is the ability of the data warehouse architecture to adapt to higher data volumes and an increasing number of users’ requests that have to be met
[17, p7]. The architecture should be built in a way that supports adding more data, not only more data volume, but also more complex data. If the data volume grows over the capabilities of the hardware, it should be possible to distribute the data warehouse across multiple machines and fully use the capabilities of the added hardware. This concept is called massively parallel processing (MPP). If the architecture is not scalable, adding more hardware has no or only minimal effect when reaching a certain level of build-out.
Another problem in data warehousing is that changing the data warehouse is often complex because of existing dependencies. While building the first version of the data warehouse was easily done, the second version takes more time. This is because the architecture of the data warehouse was not built with those changes in mind.
Section 1.4 discusses several data warehouse architectures. We will propose an alternate architecture in
Chapter 2, Scalable Data Warehousing Architecture. The advantage of this architecture lies in its scalability regarding the absorption of changes to the data model, among other advantages.
1.2.7. Big Data
Big data is not only “a lot of data” or “more data that I can handle.” We define big data as data having three characteristics: volume, velocity and variety.
The first characteristic is volume. What someone calls “big data” often means that the data is much more than this person is used to handling. However, this statement is highly subjective. Big data for one person or one company might be one gigabyte of raw data but this is rather small data for a person who loads terabytes or even petabytes of data. Loading data in real-time has different requirements and therefore a different definition of big data than loading data in nightly batches or near real-time (near real-time means that the data from operational systems is available in the data mart within a time frame of typically 15 minutes). The definition of big data also depends on the hardware available to the data warehouse.
The second characteristic of big data is velocity. It is not only that there is a lot of static data available in the source systems. Loading this data can become a complex task. However, data stored in an operational system is changing frequently. The more data that is available in a source system, the more changes are applied to it. Therefore, the typical big data project has to deal with lots of updates, data changes or new data that is added to the source system.
The third characteristic is variety. Big data often doesn’t have the same structure. Instead, the data structures of big data might change over time or, such as in “unstructured” datasets (e.g., texts, multimedia), has no ordinary structure at all: instead of using columns and rows as relational tables, unstructured datasets use other types of structures, such as linguistic structures. From a computing standpoint, these structures are considered unstructured because the structure is not as obvious as in relational tables. In other cases, there are data from so many different small data sources that the sum of this data is “big data” with high variety in data structures.
Because there is more and more data available nowadays, a data warehouse structure must not only be able to scale (which refers to the volume), it should also be able to deal with velocity and variety of the incoming data. In other cases, data is always in motion: by that we mean that it is currently processed or transferred in packets that are smaller than the actual data asset. Consider for example the transfer of data over a TCP/IP network: the data which needs to be transmitted is usually divided in smaller chunks and stored in IP packets, which are then transmitted over the network. This adds other problems to big data because data is flowing into and out of the network device. In order to analyze the data, it has to be collected, combined, and aggregated – in some cases at real-time. This raises the bar on what and how big data is architected and planned for and leads us to performance issues, which are covered in the next section.
1.2.8. Performance Issues
Another issue in data warehousing is the performance of the system. Performance is important when loading a new batch of source data into the data warehouse because the load process includes cleaning and integrating the data into existing data within the timeframe available. Often, this timeframe is limited to the time when no users are working with the system, usually during the night. Another reason for performance is the usability of the data warehouse, which depends on the response time of the system to analytical user queries.
The performance of data warehouse systems is influenced by the way a database system stores its data on disk: data is stored in pages with a fixed data size. For example, a Microsoft SQL Server allocates 8 KB disk space for each page
[18]. Each page holds some records of a particular table. The wider a table is and the more columns a table has, the fewer rows fit into one page. In order to access the contents of a given column for a given row, the whole page where this piece of data exists must be read. Because analytical queries, which are often used in data warehousing, typically aggregate information, many pages must be read for accessing the contents of only one row. A typical example of an aggregation is to sum up the sales of a given region; this could be the sum of an
invoice_total column. If there are many columns in the table, a lot of data must be read that is not required to perform the aggregation. Therefore, a goal in data warehousing is to reduce the width of columns in order to improve performance. Similar concepts apply to the loading of new data into the data warehouse.
Other ways to improve the performance of data warehouse systems include (1) the parallelization of loading patterns and (2) the distribution of data over multiple nodes in MPP settings like in the NoSQL databases. Instead of loading one table after the other, the goal of the first option is to load multiple tables at once. The second option increases performance by the distribution of data to multiple nodes. Both ways are critical to the success of Data Vault 2.0 in such environments and have influenced the changes in Data Vault modeling compared to the initial release (Data Vault 1.0).
1.2.9. Complexity
Data warehouse systems often have complexity issues due to many business requirements. Technical complexity issues arise from three areas: sourcing issues, transformation issues, and target issues.
Sourcing issues are problems that arise from the system from which the data is extracted. The following are typical examples of problems
[19, p16f]:
• Limited availability of the source systems.
• Cross-system joins, filters or aggregates.
• Indexing issues in the source data.
• Missing source keys or even missing whole source data sets.
• Bad or out-of-range source data.
• Complexities of the source system’s data structures.
• CPU, RAM, and disk load of the source system.
• Transactional record locks.
Transformation issues arise during the transformation of the data to meet the expectations of the target. Often, the following operations are performed directly within the transformation:
• Cleansing.
• Data quality management and data alignment.
• Joins, consolidation, aggregation, and filtering.
• Sequence assignments that often lead to lack of parallelism.
• Data type corrections and error handling.
• Sorting issues, including the need for large caches, frequent disk overflows, and huge keys.
• Application of business rules directly within the transformation of the source data.
• Multiple targets or sources within one data flow.
• Single transformation bottlenecks.
The last area of issues is located at the target. These issues arise when loading the data into the target and include:
• Lack of database tuning.
• Index updates which lead to deadlocks.
• Mixing insert, update, and delete statements in one data flow. This forces the execution of these statements in specific orders, which hinders parallelization.
• Loading multiple targets at once.
• Lack of control over target partitioning.
A common reason for these issues is that many data warehouse systems are trying to achieve too much in one loading cycle instead of splitting up the work. The result is that many loading processes become too complicated, which reduces overall performance and increases maintenance costs. In the end, it also affects the agility and performance of the whole team because they have to fix these issues instead of implementing new features.
1.2.10. Auditing and Compliance
A typical requirement for a data warehouse is the ability to provide information about the source and extraction time of the data stored within the system. There are various reasons for this requirement:
• Data warehouse developers are trying to trace down potential errors and try to understand the flow of the data into the system.
• The value of data depends on the source or age of data. This information might be used in business rules.
• Compliance requires the traceability of data flows and processes for information that is used as the basis of business decisions. It must be clear where the data comes from and when it has been loaded into the data warehouse
[20, p4f].
Inmon, however, presents reasons for not adding auditability in the data warehouse
[9, p61]:
• Auditing requires data to be loaded into the data warehouse that would not be loaded without such requirement.
• It might change the timing of data to be loaded into the data warehouse. For example, if the data warehouse would be the only place that provides auditing, it could require loading all changes to the operational data instead of the daily batch loads typical in many data warehousing projects.
• Backup and restore requirements change drastically when auditing capabilities are required.
• Auditing the source of data forces the data warehouse to load source data with the very lowest granularity.
It is our opinion that auditability should be limited to answering questions such as:
• From where is this particular data asset extracted?
• When has the data been extracted?
• What was the process that extracted the data?
• Where was this data used?
The data warehouse should not answer the question of how the data was acquired by the operational system. This answer can often only be provided by the source system itself. Only in some cases, the data warehouse will receive information about the user and time when the record was created or modified. If this data is available to the data warehouse, we tend to store this information for informational purposes only.
To support the auditability of the data warehouse, we add meta-information to the data to track the data source and load date and time. However, it is more complicated to answer the question of where the data was used because data marts often aggregate data to create information to be used by the business users. In order to enable data warehouse maintainers to answer such questions, the data warehouse processes should be simple and easy to understand.
1.2.11. Costs
Another challenge in data warehousing is to keep the costs as low as possible because IT in general is considered as a cost factor by management. The cost of a data warehouse is influenced by many factors, starting from cost of storage to cost of low quality and bad planning. Another cost factor is that business requirements change over time, requiring the data warehouse to adapt to these changed requirements.
The cost of storage is an often unaccounted for cost factor in data warehousing. At the beginning of a data warehouse project, the costs are typically low. If the data warehouse started as a
shadow IT project, i.e., projects driven by the business, implemented by external IT consultants and bypassing internal IT, the costs might have even been hidden in the budget of another project or activity. However, when some time has passed and the amount of data that is processed by the data warehouse has increased, the storage cost increases as well. In some cases, this happens exponentially and does not only include the costs for adding new disks. If more data is added to the data warehouse, faster network access is required to access the data; more computing power is required to process the data; and better (and more expensive) hard disk controllers are required to access the disks
[9, p335ff].
However, ever-increasing storage costs are not the big cost factor in data warehousing. Cost factors include:
• Cost of low quality
• Cost of bad planning
• Cost of changing business requirements (see next section as well)
The cost of low quality and bad planning is an even bigger factor: even if the project team has carefully planned the data warehouse and ensured the quality, there is nothing it can do against changing business requirements, except anticipatory planning
[21, p335].
This is particularly evident when the business requirements are located upstream of the data warehouse. As introduced earlier, this not only negatively affects performance, but it also drives up the cost of maintenance. Business requirements should not be embedded in the data warehouse loading cycle but rather should be moved downstream to the data mart loading – closer to the business users. This allows the team to be agile and control maintenance and development costs (through auto-generation) and provides better, more rapid response to changing business requirements. In other words, it controls costs of data mart production as well.
The agility of the team is directly proportional to the amount of complexity built into the data handling processes. By separating the complex business requirements into respective components, multiple loading sections of the architecture become streamlined; to a point where a majority of the implementation can actually be generated. The mechanics of this separation provide for extreme agility when responding to business requirement changes.
1.2.12. Other Business Requirements
Today’s business environment is characterized by rapidly changing conditions and uncertainty. Therefore, it is common that business requirements change quite frequently. Data warehouse developers try to prevent changes to the data warehouse by careful planning and anticipatory design. This approach often follows traditional waterfall software development methods. In such approaches, there are often four phases
[22, p162f]:
1. Setting up the requirements for the data warehouse.
2. Architectural planning and design of the data warehouse.
3. Development of the data warehouse.
4. Testing of the data warehouse.
In contrast, agile software development methods have been designed to improve software by using customer feedback to converge on solutions
[23]. To support this requirement, the data warehouse must be adaptive and resilient to change
[12, p3]. A change to the existing data warehouse structures should not invalidate existing data or applications. One of the major advantages of agile methods is the ability to quickly react on business changes, as we will learn in
Chapter 3, Data Vault 2.0 Methodology.
To support both the data warehouse engineers as well as the data warehouse business users, a set of tools to query, analyze and present information is required
[12, pxxiv]. Examples include reporting tools, query analyzers, OLAP (on-line analytical processing) browsers, data mining tools, etc. Microsoft SQL Server 2014 includes these tools out-of-the-box.
Another business requirement is the ability of the project team to cope with the natural fluctuation of team members. An important success factor in data warehousing is to keep the knowledge and skills of the data warehouse members within the team, regardless of the retirement or withdrawal of key members. Solutions for this include a well-documented data warehouse system and an easily understandable design. Another solution is to use business intelligence (BI) solutions from a major vendor, such as Microsoft, that is well known in the industry and supported by other vendors and consultancy firms
[20, p9].
These are major components of Data Vault 2.0, and the innovation contained within. DV2.0 addresses Big Data, NoSQL, performance, team agility, complexity, and a host of other issues by defining standards and best practices around modeling, implementation, methodology, and architecture.
1.3. Introduction to Data Vault 2.0
Data Vault really represents a system of business intelligence. The true name of the Data Vault System is: Common Foundational Warehouse Architecture. The system includes a number of aspects that relate to the business of designing, implementing, and managing a data warehouse. A bit of historical research into Data Vault 1.0 shows that Data Vault 1.0 is highly focused on Data Vault Modeling, that is to say, a dedication to the physical and logical data models that construct the raw enterprise data warehouse. Data Vault 2.0, on the other hand, has expanded, and includes many of the necessary components for success in the endeavor of data warehousing and business intelligence. These components are:
• Data Vault 2.0 Modeling – Changes to the model for performance and scalability
• Data Vault 2.0 Methodology – Following Scrum and Agile best practices
• Data Vault 2.0 Architecture – Including NoSQL systems and big-data systems
• Data Vault 2.0 Implementation – Pattern based, automation, generation CMMI level 5
Each of these components plays a key role in the overall success of an enterprise data warehousing project. These components are combined with industry-known and time-tested best practices ranging from CMMI (Capability Maturity Model Integration), to Six Sigma, TQM (total quality management) and PMP (Project Management Professional). Data Vault 2.0 modeling now includes changes that allow the models to interact seamlessly with (or live on) NoSQL and Big Data systems. Data Vault 2.0 Methodology focuses on 2 to 3 week sprint cycles with adaptations and optimizations for repeatable data warehousing tasks. Data Vault 2.0 Architecture includes NoSQL, real-time feeds, and big data systems for unstructured data handling and big data integration. Data Vault 2.0 Implementation focuses on automation and generation patterns for time savings, error reduction, and rapid productivity of the data warehousing team.