CHAPTER 10
Data Development: Making It Organized

Data is the lifeblood of any analytical exercise and usually one of the bigger challenges. Sourcing, organizing, and stitching together data is typically where a large amount of time is spent in building an analytical solution. During these steps of pulling together the datasets, the quality of the data will be key. If the data is missing, incorrect, or inconsistent, the results of the analysis will be partial, or worse, incorrect. Once the data is compiled, determining the right analytical structure is important for performance, integrity, and scalability. Application of business rules and transformation of fields are also concepts that need to be addressed in order to make various datasets suitable for analysis. In this chapter we will cover several of the basic data concepts important to analytics—quality, type, organization, and transformation.

Data Quality

The quality of the data may be the most important factor in determining the ability to produce usable insights from a dataset. The source of the data is often a vital indicator of the quality of the data. Is the data from an enterprise resource planning (ERP) system or a legacy system prone to human error? Does the data entry system allow free-form text or is it mostly lists the user selects from? Is the data structured or unstructured? All of these questions play a key role in determining the quality of the dataset and its reliability.

The core tenets of data quality include: completeness, consistency, duplication, conformity, accuracy, integrity, timeliness, availability, and history. Let's cover each of these concepts using the datasets in Tables 10.1 and 10.2 as examples.

Table 10.1 Customer Record

Customer ID First Name Last Name Email Mobile Phone Home Address City State
598234 Jerry Twain [email protected] 404–927–5555 131 Buckhead Street Atlanta GA
598235 Tricia Hemingway 455 Finley Havana La Havana
598236 George King 212–499–5555 949 Broadway, Apt 6 New York NY
598237 Jenny Dickens [email protected] 415–809–5555
598238 Jane Orwell [email protected] 333 Main Street Boston Massachusetts
598239 Olivia Faulkner [email protected] 5001 Denny Way Seattle Washington
598240 Grant Woolf [email protected] 310–290–5555 322 Hoover St Los Angeles CA
598241 Gale Rowling [email protected] 2950 Landis St San Diego CA
598242 Tommy Austen [email protected] 310–922–5555 323 Hoover St Los Angeles California
598240 Jerry Twain [email protected] 404–927–5555 3434 Marietta Blvd Atlanta GA

Table 10.2 Transaction History

Transaction ID Customer ID Product Description Quantity Price Total Amount
45077–01 598241 Nine West Blazer 1 $213.00 $213.00
45077–02 598239 Calvin Klein Dress 1 $158.00 $158.00
45077–03 598239 Kate Spade Shoes 1 $312.00 $312.00
45077–04 598235 Ralph Lauren Polo Shirt 1 $56.00 $56.00
Tommy Hilfiger Tank Top 1 $34.00 $34.00
45077–06 598235 INC Socks 3 $5.50 $16.50
45077–07 598235 Ray Ban Sunglasses 1 $110.00 $110.00
45077–08 598242 Nike Shorts 1 $42.00 $42.00
45077–09 598242 Nike Dryfit Shirt—White 1 $32.00 $32.00
45077–10 598242 Nike Dryfit Shirt—Blue 1 $35.00 $35.00
45077–11 598241 INC Blouse 2 $46.00 $46.00
45077–12 598241 Charter Club Blouse 1 $35.00 $35.00
45077–13 Bass Loafers 1 $225.00 $225.00
45077–14 598234 Nautica Sweater 1 $77.00 $77.00
45077–15 598234 Perry Ellis Collared Shirt 1 $69.00 $69.00

Completeness

The completeness of the data refers to any missing or partial data in the dataset. This can be due to a number of factors, including source systems that allow for free-form text. Another common source of incomplete data is non-mandatory fields in source systems that should require the information to be filled out.

In Table 10.1, we see that email addresses are inconsistent and phone information is sparse. We can also see that some of the address information and transactional keys are incomplete. The incomplete data for Customer ID presents a challenge later in this chapter. This type of missing information is referred to in the context of completeness of the data.

Consistency

Data consistency requires that the values of the data must be consistent throughout the dataset. Any data must be valid according to defined rules. For example, a Social Security number is defined as a nine-digit number. This should be consistent throughout the various databases.

Consistency is not always as cut-and-dried as a Social Security number. For example, a metric definition for Average Transaction may be different in two parts of the organization. One department may view Average Transaction as the quantity of items purchased in an average transaction or its basket size; for instance, the average basket size is 2.7 items. Another department may view Average Transaction as the average dollar amount in a purchase, for instance, $83.69.

While technically both definitions may be correct, it is easy to see how the vagueness of the metric definition may produce two different interpretations.

Duplication

Data duplication refers to multiple records in a database that are exact or partial duplicates of each other. For example, two records of the same customer might exist in the same dataset. The problem with duplicate data is that the data consumer is unsure which one is correct. Duplication of data can occur often in a database for a number of reasons, such as the architecture of the data storage system, a source system that allows for multiple entries, and bad joins when connecting data.

For example, in the Customer Record dataset in Table 10.1, we observe two records for the same customer, Jerry Twain. We observe this might be the same person due to the mobile number being the same. Otherwise we might treat this as two different individuals. Since we now have two records for the same person, which one is the correct version? It appears that Jerry moved at one point, but we do not know which address is the correct address, 131 Buckhead Street or 3434 Marietta Blvd.

Conformity

Conformity refers to how well the data adheres to standards. For example, if we were to refer to a country code for China as CHN, but in other parts of the data the name of the country is spelled out as China, this makes sorting, grouping, and comparing the data difficult.

In Table 10.1, we see that the standards for state are not consistent in the dataset. A few of the records refer to California as CA while other records spell out the full name. If we attempt to do any type of aggregation or sorting based on state, we will end up with inaccurate and nonconforming results.

Accuracy

When data is inaccurate it is difficult for people to trust it. For data to be accurate, it must be consistent, nonduplicate, and complete. If there is a formula issue or incorrect descriptive elements such as address, Zip code, or phone number, the user of the information will not know what records are correct and which are inaccurate.

In Table 10.2, we see a lot of missing data along with some mathematical issues where it appears that the point-of-sale system did not record a transaction correctly. In transaction 45077–11 we see that Gale purchased two INC Blouses, but was only charged for one. This is a lost revenue opportunity as well as a calculation issue that makes one doubt the accuracy of the Total Amount field.

Integrity

The integrity of data refers to the accuracy and consistency of the data over its lifecycle, ensuring that when the data moves from system to system it maintains a level of quality and standardization. This is typically performed by keys that are unique to that record that enable it to connect to other tables or datasets. There are two main types of data integrity: entity integrity and referential integrity. Let's explore both.

Entity integrity is concerned with the use of a primary key, which is a unique value for that record that cannot be assigned to any other record. Referential integrity refers to the need for a primary key in one table to exist in another table in order to link the two tables together. The primary key is the principle tool for linking together datasets or tables within a database. All records must be populated with a unique primary key; null or duplicate values are not allowed.

For example, in Table 10.2, we see that not all records have a Transaction ID, which is the primary key for this table. This should never be allowed in a dataset as we do not have a reference value for the record and will not be able to guarantee that it is unique.

When combining datasets or tables with one another, a foreign key is often employed. A foreign key is a field in one table that uniquely identifies a record in another table or dataset. It helps connect one table with another while keeping the record values unique.

For example, when looking at the dataset in Table 10.2, we see that the Customer ID field has several missing Customer IDs. This is unfortunate as we no longer have a reference for these transactions to know who purchased what products.

Timeliness, Availability, and History

Depending on the use of the data, the timeliness, availability, and history can be key factors. The timeliness of the data refers to how often the data is updated. If the dataset is updated only once a year, the data may become stale toward the end of the year, but may still be suitable for long-term trend analysis.

The availability of the data refers to how often it will be accessible for analysis. Is the data available for a onetime pull? Can it be pulled on demand, or is it generated according to a schedule? Depending on the type of analysis you want to perform, availability of the data can determine your ability to analyze the data. For example, our analysis needs daily updates to see the most recent purchases, but due to system availability we are only able to access the data on a monthly basis. This could severely impact the type of analysis and responses we are able to assemble.

Lastly, the history of the data refers to how much history is available. Is the volume of transactions so large that the system only keeps 60 days' worth of data, or is it an ERP system that has more than 10 years of data? Data science typically drives a need for more history rather than less. If there is not much history to pull from, it may be hard to forecast or predict future events.

Dirty Data, Now What?

Now that we understand the importance of quality data, there are several tools available to assist with de-duplication of data or matching to solve for which record is the correct one. These tools also can transform the data based on business rules. For example, you might have a business rule that converts any fully spelled-out state into its abbreviation, such as California to CA.

These tools fall into the category of data cleansing or ETL tools. Typical tools include Data Stage, Pentaho, Jaspersoft, Informatica, Ab Initio, Talend, KETL, and many others.

Data Types

Metadata

Before we dive into the rich and various types of data we can encounter, let's take a minute to talk about metadata. Metadata is a set of information about data, or simply put, data about data. In order to understand a particular element in a record, we need to understand how to interpret the data element. Metadata provides the information that helps someone interpret a piece of information.

According to Ralph Kimball, there are two types of metadata: technical and business. Business metadata is the definition of the element, business purpose, and calculation. This definition is focused more on the business user to provide an understanding of what the intent of the data element is for business use.

Example of Business Metadata
Name of Field: Customer ID
Business Purpose: Unique identity for each customer
Calculation: Sequentially assigned number based on when the customer either purchases a product or signs up for a loyalty program

Technical metadata comprises the technical components of a data element such as database location, granularity, indexes, data type, column name, and relationship between tables. Technical metadata is more relevant to the IT department's use of the data element.

Example of Technical Metadata
Data Type: Integer
Length: 9 digits long
Primary Key: Yes
Location: Customer Record

Metadata is an important resource for the business and IT organizations to govern and standardize across functional areas within a company. Existing and new elements should be tracked and cataloged. We talk more about metadata when we get further into our discussion of analytical data structure.

The World of Data

Data comes in many shapes and sizes. Different types of data are better suited for different types of analysis. Financial information is well-suited for reporting, trending, and forecasting, operational data is well-suited for statistical analysis, and marketing data for descriptive analytics. Let's review a few of the key data types you are likely to encounter.

Structured versus Unstructured Data

The degree of organization of the data dictates whether it is structured versus unstructured. Structured data is easily organized into a table in a database with columns, rows, and well-defined data types. Structured data is typically found in relational databases, OLAP datasets, and structured datasets.

Unstructured data lacks column-and-row organization and typically comprises text objects such as a blog, social media posts, or email messages. Techniques such as text mining are used to create organization from this type of information. Digital images are another form of unstructured data that can be analyzed for patterns, as found in diagnostic imaging.

Operational versus Analytical Data

Operational processes typically create rich flows of data in a company. Operational databases are structured and optimized for quick insert into and retrieval of single records from a database. For example, a high-volume point-of-sale system is optimized to get data into the database quickly so that the purchase transaction can be processed promptly without causing the customer to wait. Operational databases are highly relational in nature with many thin and long tables. These table structures are designed to be highly efficient and are not easily accessed for analytics.

Analytical data structures are optimized to retrieve large volumes of data needed for analysis of trends and patterns. They typically require fewer joins to other tables and may have repeating values in them. These structures often have pre-aggregated information to allow for quick access to summarized information. Examples include data warehouses, OLAP cubes, and reporting data structures.

Data Objects

Analytical data is stored and accessed in data objects such as flat files, dimensional models, columnar databases, and OLAP cube structures. Each of these types of objects lend themselves to a particular type of analysis or performance.

Flat files are big long files with many columns and rows. They usually have repeating values in the dataset to allow for quick calculations, summarizations, and filtering of information. They perform well for dashboard reporting and statistical analysis.

Dimensional models are relational data structures that are optimized for reporting systems. They typically have facts and dimensions that are organized to model the business function. The fact tables are organized to be long and thin with few repeating values. Facts are the quantitative measures in a business such as sales amount or unit quantity. Dimensions are wide with lots of columns and have repeating values. They are typically descriptive in nature, such as customer name, address, or product type.

OLAP cube objects are pre-aggregated datasets allowing the user the ability to slice-and-dice a large dimensional dataset quickly. The OLAP cube object is commonly used for financial information, allowing the user to view detail and aggregated information seamlessly.

Columnar databases are growing in popularity, especially for large datasets, including big data technologies. Both relational and columnar database use traditional database languages to access data, like SQL, and both service well for visualization tools. However, by storing data in column versus rows, which is a relational database, the query can access the data it needs faster by scanning just the columns versus having to scan a large set of rows.

Data Organization

Now that we have an understanding of the nature of data that we encounter, we turn to discussing the processes we need to implement to make the data useful to our analytical purposes.

Data Movement

The data that we would like to bring together for an analytic project is typically found in different places in many shapes and forms. If we are fortunate, most of it is housed in a well-maintained data warehouse where the data has been extracted from operational sources, cleansed, standardized, and quality checked. This process is typically referred to as ETL, an acronym for extract, transform, and load.

Granularity

Identifying the granularity of our data is one of the first efforts we undertake to understand the data prior to analyzing it. The concept of granularity refers to the degree of distinguishable elements a field value may contain. A high-granularity field will contain many unique elements whereas a low-granularity field will contain few. For example, in a table with three years of history, an Order Year field is typically low granularity with only three possible elements, 2013, 2014, and 2015. In the same table, an Order Time field with precision to the minute will be much lower granularity with up to 1,576,800 possible values in the three-year span.

This lowest level of dimensional field values or combination of field values that result in a single record is referred to as the grain of the data. It may be represented by a single field, the primary key, or a combination of key fields, referred to as a composite primary key. Most fact tables typically have a time dimension grain such as year, month, day, hour, or second.

Granularity limits what can be analyzed and what can be joined. For example, we have a Customer Sales table that aggregates monthly sales at a customer segment level, such as High, Medium, or Low value. With this table, we are unable to speak to the range of monthly revenue per customer within a customer segment because we have lost visibility to individual customers and are only able to identify them as a group.

Aggregated tables, while useful for delivering performance and simplicity of analysis, in many cases lead to frustration when diagnostic paths are blocked. It is difficult to predict in advance which data paths are required in each and every use case. As a result, the preferred solution is to attempt to stay as close as possible to the grain of the source data to the degree data storage space and processing power permit. Aggregated tables are useful as side tables to meet query response performance requirements for downstream reporting dashboards when detailed drill paths are not necessary.

Structuring for Analytics

If we have access to an analytical data mart or a schema on our corporate database, we typically structure our data in three layers, as illustrated in Figure 10.1. We discuss each layer in more detail later but let's first review the structure at a high level. The bottom of the structure typically contains the least analytically complex data; it is usually associated with source operating systems or data warehouses. As we progress upward through the analytic and reporting layers, the transformations become increasingly complex and the tables more consolidated although the granularity of the data many not decrease. That is, at the analytic and reporting table layers, we typically pre-join fact and dimension tables in order to improve performance for reporting and analysis tools but avoid aggregating unless absolutely necessary. From an analytic perspective it is generally preferable to retain the highest degree of granularity your database resources and storage space will permit.

Image described by caption and surrounding text.

Figure 10.1 Analytical Data Structure

The Base Layer

The base layer comprises the tables into which we load the data we extract from available data sources—operational systems, data warehouse, flat files, or external data sources such as cloud services. The goal of the base layer is to get the data from the various sources into one place. In order to maintain data integrity, auditing and quality checking processes should be put in place when moving data from source systems to our analytical structure. In addition, it is best to retain consistency with source field names, data types, and table structure at the base layer.

We might choose to employ some simple transformations at this level to de-normalize fact tables coming from the data warehouse. This means that we may pre-join our fact tables to some of the most common dimensional tables to replace foreign key values with human-understandable descriptive values. For example, we might replace numerical product ID 1234 with its descriptive element of “Blue Socks.” As mentioned earlier, de-normalization is less efficient from a data storage perspective, but it greatly facilitates the ability to comprehend the data.

The Analytic Layer

We refer to the second layer as the analytic layer, comprising the tables built from the base layer. These tables transform the data from the base layers into structures that support our analysis. Typical tables at this layer include measures enriched with a wide variety of dimensional attributes to analyze the behavior of the subject of analysis. There is usually some form of a customer table that collects attributes of the customer behavior, product or service types, descriptive geographic information, demographic information, analytical model scores, and so forth. Other common tables at this layer include Product, Billing, Customer Service, Orders, and Shipments tables to name a few. Each of these tables would be enriched as needed with dimensional attribute information to support the downstream analytics.

The Reporting Layer

In many cases, tables in the analytic layer provide the information sufficient to deliver the analysis needed. We can simply connect our analytic or data visualization tool directly to build our analysis. In other cases, the analysis may call for complex transformations that place a heavy performance burden on the front-end tool through which users access our analysis. In these cases, there is one more table type we need to develop in what we refer to as the reporting layer. Typical table types found at this level wrestle with issues of granularity such as fact-to-fact joins, hierarchical joins, cross-record comparisons, rolling 12-month averages, and year-to-date metrics.

Challenges

One of the biggest challenges when developing an analytic data mart is to determine which transformations or calculations to have the reporting tool perform and which ones to build into the analytic data mart ETL processes. Generally speaking, calculations performed at the reporting tool layer offer the greatest flexibility to provide dynamic analysis; however, there can be significant performance issues depending on the resources available. It is important to engage end-users in these decisions to determine the degree to which they are willing to trade off flexibility versus performance. Some users may place a high value on returning results quickly, whereas other users require dynamic ability to modify elements of the analysis more easily and are willing to wait a bit longer for results to return.

Regardless of whether we have access to an analytic data mart, a schema, or the internal data store, we inevitably produce data objects that fit into one of these categories. At the start of the analytic project, we plan adequate time to thoroughly understand the character of our data and determine the types of data objects we need to support our analytics.

Data Transformation

In addition to organizing our data for analytics, another critical process we need to undertake is transforming our source to support the analytics we plan to conduct. Source data rarely comes directly in the form needed for analysis. Once we have ensured the correct extraction of data, we apply several levels of transformations to ready the data for analysis. Let's discuss the most basic types of transformations we encounter—ETL, analytic and reporting business rules, and metric transformation.

ETL Business Rules Transformations

As we discussed earlier, ETL transformations typically apply transformations to the data to ensure conformity with corporate standards, such as standardizing country codes and state names. Other transformations are typically handled at the ETL level and include removing extra spaces at the beginning or ending of field values, conversion to standard data types such as currency and time zone, and standardization of field names.

Analytic and Reporting Business Rules Transformations

In addition to data type and field transformations, we also transform our data to comply with business rules associated with our topic of analysis. It is important that we thoroughly understand the business rules governing our analysis prior to embarking on our analytic journey to ensure that we set out properly equipped.

For example, we may be analyzing the performance of a marketing campaign with a test and control group of customers. The test group received a marketing email promoting a program, whereas the control group did not. We want to know if the test group was more likely to purchase the product as a result of the marketing campaign versus the control group. Customers received the email throughout the month of March and we have a monthly sales report of product purchases by customer from March through May. The agreed-upon business rules that we will want in our data transformations include attributing purchases by test group customers within 60 days of the receiving the email, implying that they have been influenced by the campaign.

We need to examine the stipulated analytics and reporting business rules early in the process to ensure that we have the data we need to perform the metric transformations required by the project.

Metric Transformations

The data measures that come directly from our data sources will typically be straightforward counts or measures of quantity, size, or volume. In order to discern patterns, trends, or relationships, we apply mathematical transformations to our base measures to develop metrics that enable insight into relationships, movement, and distribution. These transformations provide the power to our analysis. Defining, calculating, testing, and interpreting the transformations needed for the analysis can represent one of the most time-consuming aspects of the analysis. The analytic and reporting business rules of the project determine the mathematical transforms we apply to the measures and metrics of our project.

Common mathematical transformations we discuss are:

  • Rate metrics
  • Compound metrics
  • Distribution metrics
  • Rank or ordering metrics
  • Velocity metrics
  • Incremental or variance metrics
  • Statistical transformations
  • Algebraic transformations

Rate Metrics

The most common type of transformation involves calculating rates such as quantity per unit of time or size per unit of product. Price or cost per unit are commonly found rate metrics, as is production quantity per day.

equation

Distribution Metrics

Distribution metrics measure the relationship of the part-to-whole such as percent of total or cumulative percent. Service revenue as a percent of total revenue is one example.

equation

Ratio or Index Metrics

Ratio or index metrics involve dividing one measure or metric by another, such as a price efficiency metric, which measures the relationship between an effective product price and a benchmark price. If the ratio is greater than 1, then the product is commanding a higher price in the market relative to the benchmark and conversely if it is lower than 1.

equation

Compound Metrics

Whereas rate metrics typically divide one measure by another, compound metrics sum or multiply one metric by another. Compound metrics typically found in the business environment include number of man-hours for a project, which is calculated by the number of people assigned to a project multiplied by the hours per person to be spent on the project. FTE (full-time equivalent) is a related metric that sums the number of hours each person on a project or team spends on the project and then divides by a benchmark number of hours representative of a single full-time employee.

equation

Velocity Metrics

Velocity metrics capture the rate of change or movement of a metric from one period to the next. These metrics are typically represented as period-over-period growth rates. Annual revenue growth would be a commonly found value. Compound annual growth rate (CAGR) is another velocity metric commonly found in financial reporting. We cover this topic in the Data Science chapter (Chapter 9) as well.

equation

Incremental or Variance Metrics

Metrics that measure the difference between two measures are incremental metrics. Examples include variance of actuals to budget or the difference in sales between a group of customers offered a marketing promotion and a control group that did not receive the marketing offer or treatment. This type of incremental analysis is commonly referred to as lift analysis.

equation

Statistical Transformations

With statistical transformations, we start to move into the area of data science and data mining. There are numerous transformations developed for statistical analysis. Statistical metrics most commonly found in the business world include mean, median, standard deviation, standard error, z-score, r-squared, p-value, and t-value.

equation

Algebraic Transformations

Algebraic transformation uses more advanced forms of algebra or calculus, such as exponents, square roots, and logarithmic transformations. Net Present Value (NPV) is a typical algebraic transformation found in the business world. NPV represents the current value of a stream of future expected cash flows, taking into consideration the time value of money.

equation

Summary

In this section we have covered one of the more difficult but important aspects of analytics, that is, ensuring that the data is of good quality, that we have the appropriate data structure for our analysis, that it is properly organized, and that we have applied the mathematical transformations needed to expose underlying trends and patterns of behavior. This aspect of the project can be the most tedious and it is tempting to cut corners and skip steps in order to get on with the more enjoyable aspects of designing charts and visualizations. However, we can be sure that missteps or ignored steps in this stage of the project will be regretted further downstream.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset