Chapter 5
Business Analytics at the Data Warehouse Level

During the last couple of years, a lot of changes have happened at the data warehouse level, and we can expect many more changes in the future. One of the major changes was called by the phrase Big Data. The reports that created this term came from McKinsey Global Institute in June 2011. The report also addressed the concern about the future lag of skilled analysts, but this we will discuss in the next chapter. In this chapter we will only focus on the data warehousing aspects of the Big Data term.

The Big Data phrase was coined to put focus on the fact that there is more data available for organizations to store and commercially benefit from than ever before. Just think of the huge amount of data provided by Facebook, Twitter, and Google. Often, this oversupply of data is summed up in 3 Vs, standing for high volumes of data, high variability of data types, and high velocity in the data generation. More cynical minds may add that this has always been the case. It is just more clear for us, now that we know what we can use the data for, due to the digitalization of the process landscape.

The huge amount of data may lead to problems. One concrete example of data problems most companies are facing is multiple data systems, which leads to data‐driven optimization made per process and never across the full value chain. This means that large companies, which are the ones that relatively invest the most in data, cannot realize their scale advantages based on data. Additionally, many companies still suffer from low data quality, which makes the business reluctant to trust the data provided by its data warehouse section. In addition, the business typically does not realize that their data warehouse section only stores the data on behalf of the business, and that the data quality issue hence is a problem that they must be solved by themselves. The trend is, however, positive, and we see more and more cases where the ownership of each individual column in a data warehouse is assigned to an individual named responsible business unit, based on who will suffer the most if the data quality is low.

Another trend we see is symbolized by the arrival of a little yellow toy elephant called Hadoop. This open‐source file distribution system is free and allows organizations to store and process huge amounts of raw data at a relatively low cost. Accessing the data stored via these file distribution systems is, however, not easy, which means that there are still additional costs associated with using the data for traditional BI reporting and operational systems. But at least organizations can now join the era of Big Data and store social media information, Web logs, reports, external data bases dumped locally, and the like, and analyze this data before investing more into it.

Another newer area is the increased use of cloud computing. This term means that many systems are moved away from on‐premises installations (in the building) to external Web servers. However, data privacy, legislation and other operational processes, often still makes it necessary for the data to be stored on premises in the individual organizations.

In Chapter 4, we looked at the processes that transform raw warehouse data into information and knowledge. Later, in Chapter 6, we will look at the typical data creating source systems that constitute the real input to a data warehouse.

In this chapter, we discuss how to store data to best support business processes and thereby the request for value creation. We'll look at the advantages of having a data warehouse and explain the architecture and processes in a data warehouse. We look briefly at the concept of master data management, too, and touch upon service‐oriented architecture (SOA). Finally, we discuss the approaches to be adapted by analysts and business users to different parts of a data warehouse, based on which information domain they wish to use.

WHY A DATA WAREHOUSE?

The point of having a data warehouse is to give the organization a common information platform, which ensures consistent, integrated, and valid data across source systems and business areas. This is essential if a company wants to obtain the most complete picture possible of its customers.

To gather information about our customers from many different systems to generate a 360‐degree profile based on the information we have about our customers already, we have to join information from a large number of independent systems, such as:

  • Billing systems (systems printing bills)
  • Reminder systems (systems sending out reminders, if customers do not pay on time, and credit scores)
  • Debt collection systems (status on cases that were outsourced for external collection)
  • Customer relationship management (CRM) systems (systems for storing history about customer meetings and calls)
  • Product and purchasing information (which products and services a customer has purchased over time)
  • Customer information (names, addresses, opening of accounts, cancellations, special contracts, segmentations, etc.)
  • Corporate information (industry codes, number of employees, accounts figures)
  • Campaign history (who received which campaigns and when)
  • Web logs (information about customer behavior on our portals)
  • Social network information (e.g., Facebook and Twitter)
  • Various questionnaire surveys carried out over time
  • Human resources (HR) information (information about employees, time sheets, their competencies, and history)
  • Production information (production processes, inventory management, procurement)
  • Generation of key performance indicators (KPIs; used for monitoring current processes, but can be used to optimize processes at a later stage)
  • Data mining results (segmentations, added sales models, loyalty segmentations, up‐sale models, and loyalty segmentations, all of which have their history added when they are placed in a data warehouse)

As shown, the business analytics (BA) function receives input from different primary source systems and combines and uses these in a different context than initially intended. A billing system, for instance, was built to send out bills, and when they have been sent, it's up to the reminder system to monitor whether reminders should be sent out. Consequently, we might as well delete the information about the bills that were sent to customers if we don't want to use it in other contexts. Other contexts might be: profit and loss, preparing accounts, monitoring sales, value‐based segmentation or activity‐based costing activities—contexts that require the combination of information about customers across our primary systems over time and that make this data available to the organization's analytical competencies. BA is not possible without access to a combined data foundation from the organization's data‐creating source systems. In fact, that is exactly what a data warehouse does.

A data warehouse consists of a technical part and a business part. The technical part must ensure that the organization's data is collected from its source systems and that it is stored, combined, structured, and cleansed regardless of the source system platform. The business content of a data warehouse must ensure that the desired key figures and reports can be created.

There are many good arguments for integrating data into an overall data warehouse, including:

  • To avoid information islands and manual processes in connection with the organization's primary systems
  • To avoid overloading of source systems with daily reporting and analysis
  • To integrate data from many different source systems
  • To create a historical data foundation that can be changed/ removed in source systems (e.g., saving the orders historically, even if the enterprise resource planning [ERP] system “deletes” open orders on invoicing)
  • To aggregate performance and data for business needs
  • To add new business terms, rules, and logic to data (e.g., rules that do not exist in source systems)
  • To establish central reporting and analysis environments
  • To hold documentation of metadata centrally upon collection of data
  • To secure scalability to ensure future handling of increased data volumes
  • To ensure consistency and valid data definitions across business areas and countries (this principle is called one version of the truth)

Overall, a well‐planned data warehouse enables the organization to create a qualitative, well‐documented, true set of figures with history across source systems and business areas—and as a scalable solution.

ARCHITECTURE AND PROCESSES IN A DATA WAREHOUSE

The architecture and processes in an enterprise data warehouse (EDW) will typically look as illustrated in Exhibit 5.1. The exhibit is the pivot for the rest of this chapter.

Image described by caption and surrounding text.

Exhibit 5.1 Architecture and Processes in a Data Warehouse

As opposed to the approach we've used so far in this book, we will now discuss the data warehouse based on the direction in which data and information actually move (from the bottom up). Our point of departure in previous chapters has been the direction that is dictated by the requirements for information (from the top‐down). The bottom‐up approach here is chosen for pedagogical reasons and reflects the processes that take place in a data warehouse. This does not, however, change the fact that the purpose of a data warehouse is to collect information required by the organization's business side.

As is shown by the arrows in Exhibit 5.1, the extract, transform, and load (ETL) processes create dynamics and transformation in a data warehouse. We must be able to extract source data into the data warehouse, transform it, merge it, and load it to different locations. These ETL processes are created by an ETL developer.

ETL is a data warehouse process that always includes these actions:

  • Extract data from a source table.
  • Transform data for business use.
  • Load to target table in the data warehouse or different locations outside the data warehouse.

The first part of the ETL process is an extraction from a source table, staging table, or from a table within the actual data warehouse. A series of business rules or functions are used on the extracted data in the transformation phase. In other words, it may be necessary to use one or more of the transformation types in the following section.

Selection of Certain Columns To Be Loaded

It's necessary to choose the columns that should be loaded. Here are the conditions under which columns need to be loaded:

  • Translating coded values. For example, the source system is storing “M” for man and “W” for woman, but the data warehouse wants to store the value 1 for man and 2 for woman.
  • Mapping of values. For example, mapping of the values “Man,” “M” and “Mr.” into the new value 1.
  • Calculating a new calculated value. For example, sales = number × unit price.
  • Joining from different sources. For example, to look‐up or merge.
  • Summing up of several rows of data. For example, total sales for all regions.
  • Generating a surrogate key. This is a unique value attributed to a row or an object in the database. The surrogate key is not in the source system; it is attributed by the ETL tool.
  • Transposing. Changing multiple columns to multiple rows or vice versa.

In the load phase of the ETL process, data is entered in the data warehouse or moved from one area of the data warehouse to another. There is always a target table filled with the results of the transformation in the load procedure. Depending on the organization's requirements, this process can vary greatly. For example, in some data warehouses, old data is overwritten by new data. Systems of a certain complexity are able to create data history simply by making “notes” in the data warehouse if a change occurs in the source data (e.g., if a customer has moved to a new address).

Exhibit 5.2 shows a simple ETL job, where data is extracted from the source table (Staging). Then the selected fields are transferred to the temporary table (Temp), which, through the load object, is sent on to the table (Staging) in the staging area. The transformation of the job is simple, since it's simply a case of selecting a subset of the columns or fields of the source table. The load procedure of the ETL job may overwrite the old rows in the target table or insert new rows.

A process diagram of a simple ETL job with text in ovals and boxes connected from left to right by rightward arrows.

Exhibit 5.2 Example of a Simple ETL Job

A more complex part of an ETL job is shown in Exhibit 5.3. Here data is extracted from three staging tables. Note that only selected columns and rows are extracted with a filter function; an example of this could be rows that are valid for only a certain period. These three temporary tables in the center of Exhibit 5.3 are joined using Structured Query Language (SQL). SQL is a programming language used when manipulating data in a database or a data warehouse. The SQL join may link information about position (unemployed, employee, self‐employed, etc.) to information about property evaluations and lending information. There may also be conditions (business rules) that filter out all noncorporate customers. The procedure is a transformation and joining of data, which ends up in the temporary table (Temp Table 4). The table with the joined information about loan applicants (again, Temp Table 4) then flows on in the ETL job with further transformations based on business rules, until it is finally loaded to a target table in the staging area, the actual data warehouse, or for reporting and analytics in a data mart.

A process diagram of a part of ETL job with SQL join with three rows of text in ovals and boxes connected from left to right by rightward arrows.

Exhibit 5.3 Part of ETL Job with SQL Join

When initiating ETL processes and choosing tools, there are certain things to bear in mind. ETL processes can be very complex, and significant operational problems may arise if the ETL tools are not in order. Further complexity may be a consequence of many source systems with many different updating cycles. Some are updated every minute, and others on a weekly basis. A good ETL tool must be able to withhold certain data until all sources are synchronized.

The degree of scalability in the performance of the ETL tool in its lifetime and use should also be taken into consideration in the analysis phase. This includes an understanding of the volume of data to be processed. The ETL tool may need to be scalable in order to process terabytes of data, if such data volumes are included.

Even though ETL processes can be performed in any programming language, it's fairly complicated to do so from scratch. To an increasing extent, organizations buy ETL tools to create ETL processes. A good tool must be able to communicate with many different relational databases and read the different file formats that are used in the organization. Many vendors' ETL tools also offer data profiling, data quality, and metadata handling (we'll describe these processes in the following section). That is, a broader spectrum than extracting, transforming, and loading data is now necessary in a good tool.

The scope of data values or the data quality in a data source may be reduced compared to the expectations held by designers when the transformation rules were specified. Data profiling of a source system is recommended to identify the usability of the transformations on all imaginable future data values.

Staging Area and Operational Data Stores

ETL processes transfer business source data from the operational systems (e.g., the accounting system) to a staging area, usually either raw and unprocessed or transformed by means of simple business rules. The staging area is a temporary storing facility in the area before the data warehouse (see Exhibit 5.1). Source systems use different types of formats on databases (e.g., relational databases such as Oracle, DB2, SQL Server, MySQL, SAS, or flat text files). After extraction, data is converted to a format that the ETL tools can subsequently use to transform this data. In the staging area, data is typically arranged as flat files in a simple text format or in the preferred format of the data warehouse, which could be Oracle. Normally, new data extracts or rows will be added to tables in the staging area. The purpose is to accumulate the history of the base systems.

In the staging area, many subsequent complex ETL processes may be performed which, upon completion, are scheduled for processing with an operations management tool. The tables may be transformed hundreds of times on several levels before data is ready to leave for the actual data warehouse.

If the business needs to access data with only a few minutes' delay—for example, because the contents are risks calculated on the portfolio values of the bank—it may make sense to implement an operational data store (ODS). This will enable business users to access this data instantly. Typically, it will not be a requirement that data in a data warehouse be accessible for business analyses until the following day, even though the trend of the future is real‐time information. Pervasive BA, as we've mentioned earlier, requires real‐time data from the data warehouse. The ETL jobs that update rows in a data warehouse and in data marts will usually run overnight, and be ready with fresh data the next morning, when business users arrive for work. In some situations, however, instant access is required, in which case an ODS is needed.

In regard to digital processes like multichannel marketing systems and apps pulling operational data, the data will typically not be provided directly by the data warehouse, but from operational data platforms that manage the real time interaction with customers. Albeit with some delay, these interactions will be written to the data warehouse, just like these operational platforms, with some delay, will be populated by the data warehouse.

Causes and Effects of Poor Data Quality

Data quality is a result of how complete the data is, whether there are duplicates, and the level of accuracy and consistency across the overall organization. Most data quality projects have been linked to individual BA or CRM projects. Organizations know that correct data (e.g., complete and accurate customer contact data for CRM) is essential to achieve a positive return on these investments. Therefore, they are beginning to understand the significant advantage that is associated with focusing on data quality at a strategic level.

Data quality is central in all data integration initiatives, too. Data from a data warehouse can't be used in an efficient way until it has been analyzed and cleansed. In terms of data warehouses, it's becoming more and more common to install an actual storage facility or a firewall, which ensures quality when data is loaded from the staging area to the actual data warehouse. To ensure that poor data quality from external sources does not destroy or reduce the quality of internal processes and applications, organizations should establish this data quality firewall in their data warehouse. Analogous to a network firewall, whose objective is to keep hackers, viruses, and other undesirables out of the organization's network, the data quality firewall must keep data of poor quality out of internal processes and applications. The firewall can analyze incoming data as well as cleanse data by means of known patterns of problems, so that data will be of a certain quality, before it arrives in the data warehouse. Poor data that cannot be cleansed will be rejected by the firewall. The proactive way to improve the data quality is to subsequently identify poor data and add new patterns in the cleansing procedures of the firewall or track them back to the perpetrators and communicate the quality problems to the data source owners.

Poor data quality is very costly and can cause breakdowns in the organization's value chains (e.g., no items in stock) and lead to impaired decision‐making at management and operational levels. Equally, it may lead to substandard customer service, which will cause dissatisfaction and cancellation of business. Lack of trust in reporting is another problem that will delay budgeting processes. In other words, poor data quality affects the organization's competiveness negatively.

The first step toward improved data quality in the data warehouse will typically be the deployment of tools for data profiling. By means of advanced software, basic statistical analyses are performed to search for frequencies and column widths on the data in the tables. Based on the statistics, we can see, for example, frequencies on nonexistent or missing postal codes as well as the number of rows without a customer name. Incorrect values of sales figures in transaction tables can be identified by means of analyses of the numeric widths of the columns. Algorithms searching for different ways of spelling the same content are carried out with the purpose of finding customers who appear under several names. For example, “Mr. Thomas D. Marchand” could be the same customer as “Thomas D. Marchand.” Is it the same customer twice? Software packages can disclose whether data fits valid patterns and formats. Phone numbers, for instance, must have the format 311‐555‐1212 and not 3115551212 or 31 15 121 2. Data profiling can also identify superfluous data and whether business rules are observed (e.g., whether two fields contain the same data and whether sales and distributions are calculated correctly in the source system). Some programs offer functionality for calculating indicators or KPIs for data quality, which enable the business to follow the development in data quality over time.

Poor data quality may also be a result of the BA function introducing new requirements. If a source system is registering only the date of a business transaction (e.g., 12 April 2010), the BA initiative cannot analyze the sales distribution over the hours of the working day. That initiative will not be possible unless the source system is reprogrammed to register business transactions with a timestamp such as “12APR2010:12:40:31.” Data will now show that the transaction took place 40 minutes and 31 seconds past 12, on 12 April 2010. The data quality is now secured, and the BA initiative can be carried out.

Data profiling is thus an analysis of the problems we are facing. In the next phase, the improvement of data quality, the process starts with the development of better data. In other words, this means correcting errors, securing accuracy, and validating and standardizing data with a view to increase their reliability. Based on data profiling, tools introduce intelligent algorithms to cleanse and improve data. Fuzzy merge technology is frequently used here. Using this technology means that duplicate rows can often be removed, so that customers appear only once in the system. Rows without customer names can be removed. Data with incorrect postal codes can be corrected, or removed. Phone numbers are adjusted to the desired format, such as XXX‐XXX‐XXXX.

Data cleansing is a process that identifies and corrects (or removes) ruined or incorrect rows in a table. After the cleansing, the data set will be consistent with other data sets elsewhere in the system. Ruined data can be a result of user entries or transmission errors. The actual data cleansing process may involve a comparison between entered values and a known list of possible values. The validation may be hard, so that all rows without valid postal codes are rejected or deleted, or it can be soft, which means that values are adjusted if they partly resemble the listed values. As mentioned previously, data quality tools are usually implemented when data is removed from the staging area to the data warehouse. Simply put, data moves through a kind of firewall of cleansing tools. Not all errors, however, can be corrected by the data quality tools. Entry error by users can be difficult to identify, and some of them will come through in the data profiling as very high or low values. Missing data caused by fields that have not been filled in should be corrected by means of validation procedures in the source system (for details, see Chapter 6). It should not be optional, for instance, whether the business user in sales selects one individual customer or not.

The Data Warehouse: Functions, Components, and Examples

In the actual data warehouse, the processed and merged figures from the source systems are presented (e.g., transactions, inventory, and master data). A modern data warehouse typically works as a storage area for the organization's dimensions as well as a metadata repository. First, we'll look at the dimensions of the business, and then we'll explain the concept of the metadata repository.

From the staging area, the data sources are collected, joined, and transformed in the actual data warehouse. One of the most important processes is that the business's transactions (facts) are then enriched with dimensions such as organizational relationship and placed in the product hierarchy before data is sent on to the data mart area. This will then enable analysts and business users to prepare interactive reports via “slice and dice” techniques (i.e., breaking down figures into their components). As a starting point, a business transaction has no dimensions when it arrives in the data warehouse from the staging area. That means that we cannot answer questions about when, where, who, what, or why. A business transaction is merely a fact or an event, which in itself is completely useless for reporting and analysis purposes.

An example of a meaningless statement for an analyst is “Our sales were $25.5 million.” The business will typically want answers to questions about when, for what, where, by whom, for whom, in which currency? And dimensions are exactly what enable business users or the analyst to answer the following questions:

  • When did it happen? Which year, quarter, month, week, day, time?
  • Where and to whom did it happen? Which salesperson, which department, which business area, which country?
  • What happened? What did we make on which product and on which product group?

All these questions are relevant to the analyst.

Dimensional modeling is a popular way of organizing data in a data warehouse for analysis and reporting—and not without reason. The starting point is the previously listed transactions or facts. It may also be helpful to look at the organization's facts as events. These fact rows are enriched with dimensions in a data warehouse to provide perspective.

The dimensions in Exhibit 5.4 surrounding the facts or transactions put the sales figures, revenue figures, and cost figures into a perspective. This type of illustration is also called a star schema. Among other things, it gives business users and analysts the opportunity to get answers from the data warehouse such as these:

  • Our sales in product group 1 in December in the United States, measured in the currency U.S. dollars, were 2 million.
  • Sales in department 2 of business area 1 in the first quarter in Europe, measured in the currency euros, were 800,000.

Note that the dimensions answer questions about when, for what, where, for whom, and by whom. Business reality is viewed multidimensionally to create optimum insight. Generally speaking, the multidimensional perspective enables the business to answer the question: “Why did things turn out as they did?”

A diagram with five ovals with text titles Organization, Time, Market, Currency, Product around a box with the text title Facts. A hierarchy of items is listed under each title.

Exhibit 5.4 Fact‐Based Transactions Surrounded by Multidimensional Perspectives

Note the hierarchies in the dimensions in Exhibit 5.4. The organization consists, for instance, of a number of business areas. Under each of these areas, we've got a number of departments, and in each department, we've got a number of employees. These hierarchies provide us with the opportunity to slice and dice the information. A sales figure for the overall organization can be broken down into business areas. Each business area can then be broken down into departments, and the department figures can be broken down into individual employees. Note that these features are especially helpful when the business—on a daily basis—is analyzing information by itself and is therefore not drawing on quantitative analyst resources.

A modern data warehouse will normally contain a metadata repository. Here information is stored about business data. The simplest definition of metadata is data about data. For example: for a camera, data is a digital photo; metadata will typically contain information about the date the photo was taken, the settings of the camera, name of manufacturer, size, and resolution. Metadata facilitates the understanding of data with a view to using and managing data. Metadata has been given a central role as businesses are becoming increasingly demanding of documentation. Libraries have registered metadata about books to facilitate searches. This metadata includes title, genre, publishing year, author of the book, and so forth. Without metadata, it would be difficult or almost impossible to find the relevant data (books).

The documentation of data and tables is of equal importance, and the demands for metadata registration in the data warehouse have grown considerably in recent years. Previously, it was sufficient if tables and fields had meaningful names. The simplest way to create metadata about tables and fields is to give these meaningful names. For instance, consider a revenue table containing the two fields, Revenue and Time. That should make it obvious what the table contains! The problem is, however, that users other than the ones who made the table might interpret the contents of the revenue field differently. Is it revenue with or without value‐added tax (VAT)? Are discounts included in the figures? Is the revenue figure in U.S. dollars or euros? And those are just a few different interpretation possibilities.

Understandably, the previous metadata registration is no longer sufficient. A better metadata registration can be performed using labels on the fields of the tables. The revenue field could have a label explaining the contents precisely: revenue excluding VAT including discounts in U.S. dollars. That would increase the quality of the metadata, and the data could be used by other users. But we still have the problem that users must be able to search through tables for fields with, for instance, revenue contents (just as when we search for books at the library).

Many BA vendors have taken action on the consequence of customers' need for advanced metadata registration and search options for the average user. They have created one single metadata layer in text format (XML format, in fact), that points to physical tables, fields, users, servers, programs, and reports. This layer can be found in the metadata repository of the data warehouse (see Exhibit 5.5).

A diagram with two text boxes above and below and downward arrows in between for a metadata repository process.

Exhibit 5.5 The Metadata Repository

The metadata repository has become one of the BA vendors' most important upgrading and sales arguments—and the arguments are compelling. The metadata layer produces documentation about everything that goes on in the data warehouse and the front‐end portal. Some software developers are working along the lines that reports cannot be produced if they are not registered in the metadata repository of the data warehouse. Similarly, a physical table is not available to the reporting environment without metadata registration. The same situation occurs with users, servers, and so on. A metadata repository has become key, since all Web inquiries must go through the metadata layer via the metadata server. This results in visibility and documentation of everything that goes on, and this is considered increasingly important to the business. These days, it's almost unthinkable to build a data warehouse structure without a central metadata repository. In the top layer of the BA platform, where users access reports and data, this metadata repository also enables users to search on data definitions and reports from the Web interface as if they were books at the library.

The users of Apple's iTunes software know about metadata registration in XML format. An XML file in an iTunes library on a personal computer contains all the information about tracks, albums, and artists, and so forth. iTunes uses this vital file to navigate. If an iTunes user copies one MP3 music file with Microsoft Explorer to a physical music library outside iTunes, it won't appear in the user's iTunes music collection, and he or she won't be able to search for it or play it, because information about the existence of the file and other data is not metadata that is registered via his or her iTunes software.

Data marts for the support of business processes are the end products delivered by the data warehouse and thus contain information for business users. A data mart is a specialized version of a data warehouse. Like a data warehouse, a data mart is a snapshot of operational data to help business users make decisions or make strategic analyses (e.g., based on historical trends). The difference between a data mart and a data warehouse is that data marts are created based on the particular reporting needs of specific, well‐defined user groups, and data marts provide easy business access to relevant information. A data mart is thus designed to answer the users' specific questions. Relevant dimensions for the business area have been linked to data, and specific business rules apply to help users move about in the desired dimensions and hierarchies. An organization may have several data marts for different functions, such as marketing, sales, finance, human resources, and others. A data mart is normally structured as a dimensional model such as a star schema, made up of fact tables and dimension tables and using specific business rules. An online analytical processing (OLAP) cube or a pivot table is a way of arranging data in areas (arrays) to facilitate quick data analyses, and it is often used for data marts. The arrays are called cubes.

An organization may have many data marts, each of which might be relevant to one or more business units for which they have been designed. Many business units have assumed “ownership” of their data marts, and this includes hardware, software, and data. This ownership enables each business unit or department or business area to use, manipulate, and develop their data to suit their needs, without changing any information in other data marts or centrally in the data warehouse.

Another reason for gathering data in small marts is that time is saved in connection with queries, simply because there is less data to process. This means, too, that two different data marts may present exactly the same information, except that one presents it in much more detail, which can then be used if the user decides he or she needs detailed information.

When data has been joined and enriched with dimensions in the data warehouse, data can be extracted for business use to data marts. These ETL processes will use many different business rules according to individual user needs. A data mart may cover the accounting function's need for a consolidated set of figures with the specific business rules required. Another data mart may cover the need for performance monitoring of the organization's sales processes.

As stated previously, the database for data marts may be relational or OLAP cubes. The functional difference between these two types of data is of great significance to analysts and business users, among other reasons because the difference affects response times and analytical scope. A relational data model is a model where data is organized using common characteristics. The order of the rows doesn't matter; only the number of rows is important because the number affects how quickly extracts can be performed. The order of the columns is of no importance, either. Transaction‐based tables are always relational, as explained in the following section.

The sales table in Exhibit 5.6 has seven columns and three rows, and is a simple example of what a relational transaction table looks like. In this kind of table, we can quickly add a new transaction from the shop, when another item is sold.

ID ISBN Title Type Date Number Price
1 1234 Peter Pan Paperback 23‐10‐2010 1  59.00
2 5678 The Hobbit Hardback 24‐10‐2010 1 159.00
3 9101 Moby Dick Paperback 25‐10‐2010 2  79.00

Exhibit 5.6 A Book Seller's Sales Table

For adding, processing, and extracting data from relational tables, we use the programming language SQL, which is a formalized way of talking to databases. If we want to know the book shop's revenue distributed on the types “paperback” and “hardback,” we can send the following SQL syntax off to the database:

Create Table REVENUE as
Select TYPE, SUM (NUMBER * PRICE) as REVENUE
From   Sales_Table 
Group by TYPE 
Order by TYPE

We will then receive a revenue dataset that looks like Exhibit 5.7.

Type Revenue
Hardback 159.00
Paperback 217.00

Exhibit 5.7 A Book Shop's Revenue

Large enterprises, such as Walmart, have several hundred million transactions a year, and it doesn't take a lot of imagination to see that response times in reporting can be very long if a preceding summary of the relational tables is not carried out, or if a reporting mart with a database in the shape of an OLAP cube is not created.

Once the OLAP cube is created, we can't just add another row, as with relational tables. By using this method of arranging data in cubes, we avoid the limitation of relational databases, as these are not suitable for instant (real‐time) analysis of large data volumes. The relational database is more suited for creating rows in a table of a series of transactions. Even though many reporting tools are developed for relational data, these are slow when it comes to summing up large databases. In an OLAP cube, all summing up and calculating are carried out beforehand; we just pull out a value from the cube, so to speak, and we therefore do not need to sum up something like a million rows in the extract.

In the OLAP cube in Exhibit 5.8, each little subcube contains a sales figure that is calculated beforehand for the different sets of dimensional values. The sale of a certain product (coats) in a certain country (Denmark) within a certain period (the month of July) could be the little dark cube in Exhibit 5.8. The clever thing about the OLAP cube is that when the business user requests information about the sales of coats in Denmark in July, all the involved transactions do not need to be summed up. Instead the extract application runs straight into the cube via some index values, and extracts one single, precalculated, summed‐up figure, which is then returned to the user's client software.

A schematic diagram of a cube divided into smaller cubes with time, product, and country along the axes. Sale of coats in Denmark in July is marked with an arrow pointing at a little dark cube.

Exhibit 5.8 OLAP Cube with Sales Figures and Three Dimensions or Perspectives

The OLAP cube can be seen as an expansion of a two‐dimensional spreadsheet. A controller (financial analyst) will need to analyze financial data by product, by period of time, by town, type of revenue, or cost and then compare actuals with budget figures. Each of these dimensions may have in‐built hierarchies. The controller will begin at a summarized level (such as the total difference between actual revenue and budgeted revenue), and then perform a drill‐down or slice‐and‐dice in the cube to discover entities, products, sales people, or periods of time that are accountable for the difference in the total figures.

Note that the size of an OLAP cube increases exponentially when more dimensions are added to the cube or when the number of categories in individual dimensions grows; this naturally affects performance and response times.

Alternative Ways of Storing Data

As mentioned earlier, one of the newer trends in data warehousing is the emergence of open‐source systems like Hadoop. These systems can store large amounts of data on multiple servers (one large file might be spread over 50 servers). Also, these systems can replicate the data, so should one server fail, the data can be picked from another server, and similarly, should one server be busy, the calculations can be done somewhere else or parallel within the cluster of hardware that is managed by Hadoop's filing distribution system. Hadoop systems do not move large datasets to the calculating software (the traditional way); they move the calculating software to the large data sets—and this saves time.

This new technology means that the cost of software and hardware (which historically has been high when building a data warehouse) in the future might be reduced.

Another benefit of Hadoop's file distribution systems is that data can be stored quickly, as it will be in relatively raw format. The downside, however, is that data in this format will continue to be raw. When we have stored the data, we will not be able to change it. In other words, it is a “store once and read many times” system.

With the relatively low cost of storing data using filing distribution systems, organizations have a good way to store data that might be relevant in the future. Should the data, based on a closer examination of data analysts and new business requirements, be considered valuable in the future, it is likely to be moved into a more traditional data warehouse environment. The need for moving this data into an ordinary data warehouse is because the data quality might need to be improved before it can be used. Also, reporting from file distribution systems can be a complex affair, not suited for nonspecialists. In addition, a simple look up query of a single case name (e.g., the last produced Peter Pan movie) might take hours in the Hadoop system, as the whole file needs to read before an answer can be given. In the data warehouse environment the file/table can be indexed, and this will significantly reduce the search time to find specific data points such as the above case name example.

Business Analytics Portal: Functions and Examples

BA tools and portals aim to deliver information to operational decision makers. In most cases, this information comes from relational databases and OLAP cubes in the organization's data warehouse, and the information is presented to business users as scorecards, dashboards, conventional retrospective sales reports, business performance management (BPM) dashboards, and analytical applications. End users access the BA portal on top of the organization's data warehouse (see Exhibit 5.1). The BA portal constitutes a small part of the overall process to deliver BA decision support for the business. A rule of thumb is that the portal part constitutes only 15 percent of the work; 85 percent of the work lies in the data collection and processing in the data warehouse.

In the past, BA tools have been developed from individual applications to serve as critical plug‐ins in the organization's global portals. The requirements for vendors of BA portals are therefore now focused on their ability to completely integrate all kinds of BA tools into one global portal, which then contains all relevant information for decision makers, whether it be employees, partners, customers, or vendors.

Developments are moving quickly toward global BA portals, and the key to a successful implementation of such a portal that completely integrates BA tools is to acknowledge that the portal is not just the launching pad for applications. The portal must deliver true business value and allow users access to business critical information in a dynamic and secure way. Further, the information must be available via different online and mobile units, all of which must be easy for end users to operate.

In the pages that follow, we'll provide examples of user‐friendly BA front ends and dashboards, which can be accessed on a modern BA portal like the SAS Information Delivery Portal. In Exhibit 5.9, we see a dashboard with graphics for the monitoring of the performance of product groups.

Image described by caption and surrounding text.

Exhibit 5.9 Business Performance Dashboard for KPI Monitoring

This BPM dashboard generated by the SAS Enterprise BI Server enables business users to follow product performance (actual profits vs. budget profits) on a monthly or quarterly basis, totaled, distributed on different product groups, channels, and the geographical markets where they are sold. The information in the dashboard is lag information by nature, but users can, with simple projections based on trends and colors, form an opinion about needs for future performance improvements. In this way, the information can move business processes forward, too (lead information). We then get an answer to the question: “What do we need to do tomorrow?”

Exhibit 5.10 shows a chart produced with SAS/ETS software. The software provides forecasts of demand for services so that organizations can maximize staff resources. It can automatically account for seasonal fluctuations and trends, and can select the best method for generating the demand forecasts. Efficient staff allocations mean customers' needs will be met with no wasted resources.

A plot with Date on the horizontal axis, Forecasts on the vertical axis, and a curve, dashed line, open circles, and shaded area in the plotted area. There is a legend at the bottom.

Exhibit 5.10 An Example of Forecasting with SAS/ETS Software

The software package includes both prebuilt and customized reports, allowing us to gain the most from our data. Built‐in analysis allows us to understand how visitors navigate our Web sites and how they flow through a user‐defined set of pages. In addition to showing the drop‐off at each step, the reports track users as they flow into, out of, and through the funnel. The report captures where visitors enter the funnel and where they go once they exit. Expandable and collapsible lists of pages show the most popular customer paths.

With text‐mining analytics, it is possible to classify documents into predefined or data‐driven categories, find explicit relationships or associations between documents, and incorporate textual data with structured inputs (see Exhibit 5.11). The dynamic exploration component helps us discover patterns in large document collections and allows us to combine those insights with our predictive analytics to gain maximum value from all of our information. By using text‐mining analytics, world leaders such as Netanyahu and Obama can find out what's being said about them in forums and lists and magazine articles and newspapers, as well as on Twitter and other social media sites—in real time—and can have it all boiled down and analyzed.

Image described by caption and surrounding text.

Exhibit 5.11 Screenshot of SAS Text Miner

To gain an edge in today's competitive market, powerful advanced analytic solutions are required to extract knowledge from vast stores of data and act on it. More and more organizations are turning to predictive analytics and data mining software to uncover patterns in data and discover hidden relationships. In Exhibit 5.12, the screenshot from SAS Enterprise Miner enables the identification of market basket profiles. We can also interactively subset the rules based on lift, confidence, and support chain length.

Image described by caption and surrounding text.

Exhibit 5.12 Data Mining Software Identifying Market Basket Profiles

TIPS AND TECHNIQUES IN DATA WAREHOUSING

Master Data Management

Problems concerning data quality often lead to discussions about master data management (MDM). When the complexity of the business increases, and the data volumes explode, the business turns toward MDM as an intelligent way of consolidating and managing data. MDM provides a unified view of data, when data is integrated from different data sources. In organizations that have been growing for a long time, the different business areas will typically have developed different master data concerning customers, transactions, and products. The same applies to merging companies. The need for identical definitions will arise across business areas, national borders, and/or merging companies. For instance, the definition can include what precisely a customer is and which customer data is registered. Using MDM, the business can consolidate these sources to a master reference file, which then feeds information back to the applications. Accuracy and consistency are thus secured across the entire organization.

Service‐Oriented Architecture

SOA is a way of thinking about how to use the organization's resources based on a service approach and with the objective of providing a more efficient achievement of overall business targets. It is therefore not a product that can be purchased, but rather a design philosophy about how to structure a solution. SOA entails integration across systems. Each IT resource, whether it's an application, a system, or a database, can be reached via a service device. This service function is available via interfaces. Web services are an implementation form that uses specific standards and protocols, when they are executed as SOA solutions.

SOA makes systems more flexible in terms of business needs, simpler to develop further, and easier to maintain and manage. Implementing solutions with an SOA facilitates the organization's planning for the future—including when changes occur—and helps it to respond proactively rather than reactively.

What specifically constitutes a service? A service is a program with which the user can interact through well‐defined standards for the exchange of messages. Services must be designed for stability and accessibility. There is no point in “making SOA for the sake of SOA.” SOA is undertaken to support the needs of the business. It is especially suitable for organizations that make data and applications available to a large number of customers. An enterprise like Weather.com makes its services available to many customers via SOA. When customers draw on these enterprises' services, they do so via Web interfaces and stable programs that are always available. Communication between systems takes place via well‐defined standards/protocols for the exchange of data. Instead of each customer developing various extracts from the Weather.com databases, Weather.com places an application at the customer's disposal that serves the client with data when a well‐defined set of parameters is received in the right protocol format.

In relation to our data warehouse model in Exhibit 5.1, SOA will be represented as interfaces from source systems inward, pointing to the staging area, or interfaces from data marts toward users' applications via the BA platform. In addition to making data easily accessible, they must obviously be easily understandable, so users are given outlines of, say, temperature and maps, and not just meteorological terminology and map coordinates from the easily accessible applications.

How Should Data Be Accessed?

In this chapter, we have looked at the typical ingredients in a data warehouse solution, from source data to the front‐end solution. If we now look at a data warehouse solution in relation to the different information domains that were introduced in Chapter 4, there are some correlations worth mentioning. There are two types of direct users of a data warehouse: the business user and the BA analyst. Since the BA analyst will always make his or her analyses based on business needs, the business user will always be the end user.

When the business user approaches a BA analyst for assistance in accessing the data warehouse, he or she may have two reasons for this. The first reason is that the business user may be looking for more than mere information; he or she may also be looking for an analyst to enrich and interpret this information—that is, to deliver information to the business. This naturally means that the analyst must have business insight, as explained in Chapter 4.

The other reason for a business user to draw on analytical resources is that he or she does not have access to the desired data and is looking for information. As illustrated in Exhibit 5.13, the analyst often has many points of access to the data warehouse. The reason is not that the analyst needs to know more than the rest of the business, but rather that a data warehouse is a dynamic entity that continually adapts to the needs of the business. And since it may take some time for the business to acknowledge and formulate a new need for information, and for the required information to be delivered as standard reports, there will be an implementation period where the analyst will be delivering the required information as manual reports.

Image described by caption and surrounding text.

Exhibit 5.13 The BA Analyst's Various Accesses to the Data Warehouse

Access to Business Analytics Portals

Analysts don't actually need access to BA portals because, if they have direct access to the data they are retrieving, they can access the data via their analytical programs. When someone from the business side then requests a report, the analyst can refer to the data managers in the section of the data warehouse that is responsible for the development of reports. In smaller organizations, however, the roles of data managers and analysts will typically be performed by the same person, which means that already here we are starting to see that it doesn't make sense to separate the two roles. There are many other good reasons, however, why analysts, on an equal footing with business users, should have access to and be keen users of front‐end solutions.

Analysts will often make considerable use of BA portals because they provide fast and easy access to data. Obviously, analysts are not using this data in connection with the continual development of manual reports, because if there is a front‐end access to the required information, the end user can retrieve the data there themselves. An analyst will typically use a BA portal in connection with the development of new reports, score cards, or dashboards, in cases where any of these are deemed too complex for the end user to develop. After completed development, they are delivered to end users, who can then continually update the contents of the reports when new data is entered into the data mart or on request.

Analysts use front‐end solutions, too, in connection with the validation of data drawn directly from the data warehouse. If an analysis is based on a linkage of five different extracts collected by means of SQL directly from the data warehouse, there will typically be hundreds of lines of programming to validate. This will, of course, not be a case of a one‐to‐one validation, because then the analyst would have used the front‐end solutions only. But it may be in connection with segmentations based on information that is available only from the data warehouse and not from the data marts, where we, for instance, want to validate that we have included all customers in our analysis. We therefore want to compare the number of segmented customers with the official number of customers. If these two figures are not the same, we have an error in our segmentation. Since analysts also frequently have a high degree of contact with the people who make decisions based on data warehouse information, analysts must train their end users in the functionality offered by the BA portal.

Access to Data Mart Areas

It is tempting to argue that analysts do not need direct access to retrieve data from data marts, if they already have access to data via BA portals. The thinking here is that it is possible to transfer data to the front end and save it as a flat file, which can then be transferred to a statistics or data mining program. However, several arguments are against this: First, it's important to be able to automate processes, which is the opposite of manual processes where the analyst retrieves, saves, and imports data via physical routines. Second, in connection with data mining solutions, where the same data must be presented in the same way month after month, there are great savings to be had in automating such processes. It therefore becomes a question of what the analyst should be spending his or her time on: moving data around or analyzing it.

Some front ends pose problems in terms of being able to collect only a certain number of rows at a time or having the user be timed out after a short period, which is not the case with analytical programs. Their limitation is the hardware and the bandwidth. Direct access to data marts can therefore ensure that all rows are included. Another argument for analysts to have access to both front‐end solutions and data marts is that a front‐end can function as an SQL generator. If our analysts' expertise is not precisely the development of SQL, SQL requests can be generated in the front‐end solution and copied and pasted into a statistics program, which can also use the access of the front‐end solution to the data mart. In this way, we can automate data delivery via open database connectivity (ODBC) entries and ensure that our analysts stay focused on analytics, which is difficult enough. In connection with data mining projects, where data access can be a considerable consumer of time, this specifically means that the data access can be carried out faster, if we focus solely on easily accessible information from our data marts. Once our data mining processes are up and running, we can then start to collect more raw information directly from the data warehouse. Moreover, this approach supports what is generally recommended when establishing BA projects: Think big, start small, and deliver fast.

Access to Data Warehouse Areas

Access to these areas enables analysts, via front‐end solutions, to provide the business with answers that they may otherwise not get. The reason that only analysts can access some data is not a question about withholding information in data. It's a question of establishing front‐end solutions that present information in the most accessible way possible, based on users' needs and the skills of the business users in general.

A supermarket, for example, may be reporting on revenue from different products, summarizing products into categories such as milk and butter. But since it would be too difficult for the user to get an overview, or it would take up too much space in the data mart, we can choose to omit information about which particular milk or butter brand. If we want this information, we'll have to drill right down into the data warehouse.

There are several reasons why a business wants answers to questions that cannot be clarified via the BA portal. Needs may change over time, questions may arise in connection with ad hoc tasks that do not require regular reporting, or a business may want to implement complex analytical solutions. Data mining could be such a case, where the person driving the solution is called a data miner and not a typical business user. Data mining is generally associated with a business looking to use information as a strategic resource, not because the analytical method as such can justify this, but because it is a strategic decision that forms the basis of the investment in this competence area. As a result, data mining projects of a certain complexity depend on the data miner or analyst working with direct access to the data warehouse. Furthermore, the results of the data mining process (segmentations and recommended actions in relation to customers) will usually be distributed via the data warehouse to the CRM systems whose users are typically acting on the data mining information.

Access to Source Systems

Access to source systems is something analysts do not always have if a data warehouse has already been established. This access is usually not automated, which is the reason people often associate considerable time consumption with the use of information obtained directly from source systems. In addition to this, data quality can be quite variable, depending on what the information is used for in the source system. Furthermore, the source system itself might suffer in performance if accessed as a data warehouse. Developments in these cases will be that if information of significant value is identified in the source systems, a process will be started to ensure that this information becomes accessible in the data warehouse. If the business does not have a data warehouse, analysts always work directly with source systems, in spite of the weaknesses this entails. Over time, an analyst will usually want to carry out regular runs every month for reporting; the data set generated from this can therefore be seen as a data warehouse at the simplest level and as a quantity to be optimized.

In recent years, companies have come to realize the strengths and weaknesses of the Internet. There has been a growing understanding of the fact that, for instance, customer behavior on a company's Web site is relevant CRM information, too. It is not without relevance, for instance, for a telecom company to know whether a given customer has checked the company's Web site to find out how to cancel his or her subscription. If this information is then combined with information about the customer's value as well, it can be decided whether a “loyalty call” should be made to the customer in question. Long term, the company could structure its Internet portals so that the users' way of moving around these portals will affect the way this customer is treated. The Web site thereby becomes a questionnaire completed by the customer via his or her clicking, instead of merely providing information and automated services.

Generally speaking, though, Web log files are not yet providing the company with information about how customers and users use Web portals. In connection with commercials, however, there are exceptions, such as Amazon.com and a few search engines that are related to commercials that specifically and successfully use Web information as a strategic resource that can provide them with a competitive advantage. In these extreme cases, it seems possible to collect and use information to drive the company's strategy.

SUMMARY

In this chapter, we have discussed how to store data to best support business processes and thereby the request for value creation. We also looked into the architecture and processes in a data warehouse.

BA is not possible without access to a combined data foundation from the organization's data‐creating source systems. In fact, that is exactly what a data warehouse does: It increases the usability and availability of source data, as will be explored further in Chapter 6.

One central enterprise data warehouse ensures consistent, integrated, and valid data definitions across business areas and countries. This principle of one version of the truth is fundamental for companies to avoid spending much time with contradictory reports and deviating business plans (budgets).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset