Source data for business analytics (BA) has historically primarily been created by the company's operational systems—accounts entries are, for instance, created in the financial management system and sales data (order data) is created through order pages on the company's Web site. It is here that data quality is of the utmost importance, because this is where data is created.
The trend is moving toward also exploiting data that is generated outside the company's operational systems. Products sensor data (on customer's consumption patterns) is increasingly being sent back from, for example, televisions, hearing devices, or apps. Additionally, social media data is used in many instances, for example, to understand market trends or how customers look at the organization (for human resources [HR] purposes). This chapter will answer the question: How does a business collect source data? We will go through typical data‐generating systems in the business's immediate environment, and we'll also look at the difference between primary and secondary data, as well as external and internal analyses. We'll be looking at initiatives to improve the data quality of source systems. Finally, we'll present a way in which a business can prioritize from which source systems to collect project related data.
An interesting observation is that primary data from source systems meets an information need for a particular target group in the business. When the same data then becomes secondary data in the data warehouse framework, it meets a different information need for a different target group.
This chapter—and the concept of source data—might seem less relevant than the topics of the previous chapters as a topic for this book. However, consider that we are about to use information as a strategic innovative asset. This means we should know our strategy in terms of which competitive advantages we want to gain in the long run and which issues we want to overcome in the short run. We should also know which potential improved operational procedures and improved decision support all that our potential data sources can give us. When future and present business needs are linked with potential and present data sources, we are able to see information as a strategic asset and lead our business with confidence into the future. The point is that strategies do not come out of nowhere. They are based on planning processes that are no better than the planners. If planners do not see the potential in source data, they cannot create strategies that take information into account. If an organization is headed by those who are outdated in their thinking, those who are relying on the same old tricks as the industrial winners back in the 1990s, consider this: Can these leaders carry the organization through the analytical age to come?
Source systems is not a general term to be used for some systems and not others. When we use the term source systems, our starting point is a given data warehouse, where source systems are the data sources on which the data warehouse is based. Many companies have several data warehouses that are more or less integrated in such a way that the data warehouses can function as source systems to each other, too.
When we talk about data‐generating systems, we can, however, specify which systems create data for the first time, and which don't. A checkout register is, for example, a data‐generating system, because when it is scanning products, it is also generating data files, and these files in turn tell the store which products at which time and at which price are leaving the store. When the day is over, the customer has gone home, and the register is balanced, the store can choose to delete the data in the register—but we don't always want to do that because this data can be used for many other things. When we choose to save the information, the data‐generating system becomes a source system for one or several specific data warehouses. Based on this data warehouse information, we can carry out a large number of analyses and business initiatives (e.g., inventory management, supply chain management, earnings analyses, multi‐purchase analyses, etc.).
New data is, in other words, not generated in a data warehouse. Data in a data warehouse comes from somewhere else, and is saved based on business rules and generated to meet the company's information requirements. Just as in the previous chapter, we have listed a number of source systems to give an impression of what source systems might be and how they can create value. Keep in mind that neither the list of source systems nor their value‐creating potential is exhaustive. Chapter 7, which looks at the organization of business intelligence (BI) competency centers, will provide more inspiration through ways in which to achieve strategic influence.
Some examples of data‐generating sources are:
The marketing function can also use this data to see how well “above the media line” campaigns are received within days, rather than months. The technique is based on curves showing the historical relationship between social media attention and actual sales. By tracking where a campaign is on a curve, we will at a very early point be able to estimate the future campaign outcome.
HR can use social media data for tracking candidates and how employees present the organization to the public.
Social network analysis can also be used when understanding who has an impact on whom in regard to the public opinion of the company, with the purpose of targeting influencers.
In the world of banking, a notification to a customer could also be sent if a credit card is used more than 30 miles from the position of the phone, as people usually carry both their phones and credit cards with them at all times.
Now that we have the source information, the question now becomes: How do we use which information? An efficient way of solving this problem is to list all data from generating and storing systems that may contain information that could potentially create value for the project at hand. Then each individual data source is assessed by the following two dimensions:
Sometimes we may find ourselves in situations where we decide to disregard relevant information if this information is too difficult to access. Similarly, we may have easily accessible information with only a marginal relevance to the task at hand. This way of prioritizing information is, for instance, used in data mining, particularly in connection with customer information, which may come from countless sources. For example: Say that we want to create a profile on a monthly basis of customers who leave us or cancel their subscriptions. Based on this profile, we wish to show who is in the group that is at high risk of canceling next month, and seek to retain these customers. In this case, call lists must be ready within, say, 40 days. This also means—due to time considerations alone—that all the data from the data‐generating source systems can't be part of the analyses, and we therefore have to prioritize.
In Exhibit 6.1 we have placed the data sources that we choose to use in connection with the project in the gray area. We therefore have a clear overview of which data sources we have selected and which we have discarded. But the model gives cause for further deliberations. If, in the course of the project, we find that we have time to include additional data sources—or if we find that we are running out of time—the model can help us prioritize.
The model also tells us how the project can be expected to develop over time in terms of data sources. It's worth repeating that in connection with BA projects we should think big, start small, and deliver fast. This model enables us to maintain the general overview while delivering results quickly. The general overview, however, could also include some deliberations about whether the business should include, for example, Web logs in its data warehouse in the future. Web logs contain useful information in relation to the given problem and possibly also to other problems, but they are inaccessible.
The model therefore repeats one of the arguments for having a data warehouse: It makes data accessible. In relation to Exhibit 6.2, this means that we move the circle toward the right if we make data more accessible. Or we could say that we're creating a new circle that is positioned further to the right, since we now have two ways of accessing the same information.
The model may also highlight the problem of loss of information in connection with data transformations. If data is not stored correctly in terms of user needs, information potentially loses value. For example, if we are an Internet‐based company wishing to clarify how customers navigate our Web site, we can see this from the raw Web log. If we choose to save in our data warehouse only the information about which pages customers have viewed, we will be able to see only where the customers have been, not how they moved around between the pages. We have therefore stored information incorrectly in terms of our needs, and we have lost information and caused potential consequences for our business users.
Finally, the model also repeats the advantages of combining data correctly because this enables us to obtain synergies. If we combine Web log information with master data, users' ages, gender, and any other information, we can carry out detailed studies for different groups of users—and thus segmentations—that mean we are getting even more value from our source data. This is also often referred to as one version of the truth, as opposed to the many versions of the truth that analysts create, when each in his or her own way combines data from a fragmented system landscape into reports (see Exhibit 6.3).
In large organizations we often see a two‐tier BA function, namely the market analysts and the data warehouse analysts. The two groups are sometimes called, respectively, external analysts and internal analysts, based on their information sources. External analysts typically work with questionnaire analyses and interviews with direct contact with customers. This is what we call primary data, which is data collected for a given purpose. If we want to know which customers are disloyal, let's ask them. One of the problems with this kind of analysis is that it is costly to send out questionnaires to an entire customer base every quarter. It's also a matter of potentially annoying our customers with the constant questioning. Note that external analysts also often purchase standard market reports from other companies, in which case we would refer to the external analysts as users of secondary data.
Internal analysts, who take their point of departure in internal data sources in the data warehouse, are also able to come up with suggestions as to which customers are loyal or disloyal. Their information comes from the previously mentioned data mining models to predict churn (see Chapters 3 and 4), where customers are profiled based on their tendency to break off their relationship with a business. Churn predictive models may take several months to develop and automate, but after they are completed, an in‐depth analysis can be made in a matter of hours of which customers will be expected to leave the business when and why. Moreover, these models have the advantage of providing answers from all customers, so to speak, unlike questionnaire analyses, which are often completed by no more than 20 percent to 30 percent of customers in three weeks' time.
So, which solution do we choose? At the end of the day, the important thing is which solution is going to be more profitable in the long run. Do the internal analysts have the information about customers that can describe why these customers canceled their relationship with us? If not, the analysis is not going to add much value. In connection with some tasks we, the authors, performed for a major telecom company, we sent private customers a questionnaire about their loyalty, the results of which were used to carry out a churn predictive model. Based on the questionnaire analysis, we divided the respondents into four groups, depending on the scores they gave themselves for loyalty. Similarly, we could categorize our data, based on the percentage that describes the risk of losing a customer next month—which is one of the main results of a churn analysis. We divided the entire customer base into four segments according to risk score from the churn analysis and let the groups be percentagewise as large as the ones that came out as a result of the questionnaire analysis. We then compared the efficiency of the two methods and were able to conclude that they were equally good at predicting which customers would leave next month. So the choice here was simple. Data mining provided a score for all customers within 24 hours without annoying any customers—and at considerably lower cost. In other situations, when we have a smaller customer base and do not have as much information about customers as a telecom company does, questionnaire analyses may be the better option. What's important here is that it's not a question of either/or, or necessarily of both/and, but of what is more profitable in the given situation.
Likewise, it's important to look at the two information sources as supplementing each other, rather than as competing with each other in the organization. What really matters is not where we get our information, but how we apply it. As an example, say that we're working with the implementation of a CRM strategy with the overall objective of increasing our average revenue per customer. In this case, it doesn't matter whether it's a so‐called basket analysis that constitutes the basis for decision or whether it's in‐depth interviews or questionnaire analyses that form the foundation of the added‐sales strategy we're implementing. What matters is that we make the right decisions in our cross‐sales strategy.
In some cases, having overlapping information can even be an advantage. In connection with data mining models, which look to predict which customers the company will lose when and why, an exit analysis that, on a monthly basis, asks a number of customers who have canceled their commitment is able to systematically validate whether the statistical model is sufficient. If our interviewing tells us that a large number of customers are dissatisfied with the treatment they receive in our call center, then we know that our statistical model should include call center information and hints about how the data should be cut for the prediction modes. In this case, the external analysis function is thus able to support the analysis with a validation of the models.
In our discussion of data quality, we explained how organizations with high data quality use data as a valuable asset that ensures competitiveness, boosts efficiency, improves customer service, and drives profitability. Alternatively, organizations with poor data quality spend much time working with contradictory reports—deviating from business plans (budgets), which leads to misguided decisions based on dated, inconsistent, and erroneous figures. There is, in other words, a strong business case for improved data quality. The question in this section is how organizations can work efficiently to improve the data quality in their source systems (when data is created).
Poor data quality in source systems often becomes evident in connection with profiling when data is combined in the data warehouse, and the trail leads from there to the source system. To improve data quality efficiently, we need to start at the source with validation. For instance, it should not be possible to enter information in the ERP system without selecting an account—it must be obligatory to fill in the account field. If this is not the case, mistakes will sometimes be made that compromise financial reporting. In terms of sales transactions, both customer number and customer name must be filled in. If these details are not registered, we can't know, for example, where to send the goods. Data quality can typically be improved significantly by making it obligatory to fill in important fields in the source systems. Business transactions should simply not go through unless all required fields are completed.
Another well‐known data quality problem arises when the same data is entered twice into one or more source systems. In many international organizations, customers are set up and maintained in a local language and alphabet source system as well as in an English system. The first system can handle specific letters such as the double‐letter s that is used in the German language or the special letters used in Scandinavia; the other can't. The solution is, of course, to design the system so as to ensure that customer data can be entered and maintained in one place only.
The keys to improved data quality in source systems are to improve the company's validation procedures when data is created, and to hold a firm principle to create and maintain data in one place only.
In this chapter we went through typical data‐generating systems in the business's immediate environment and the difference between primary and secondary data, as well as external and internal analyses. We looked at initiatives to improve the data quality of source systems. Finally, we present a way in which a business can prioritize which source systems to collect project‐related data from.
We also explained that if we do not see the potential in source data, we will not be able to lead our business with confidence into the future using information as a strategic resource.