Chapter 6
The Company's Collection of Source Data

Source data for business analytics (BA) has historically primarily been created by the company's operational systems—accounts entries are, for instance, created in the financial management system and sales data (order data) is created through order pages on the company's Web site. It is here that data quality is of the utmost importance, because this is where data is created.

The trend is moving toward also exploiting data that is generated outside the company's operational systems. Products sensor data (on customer's consumption patterns) is increasingly being sent back from, for example, televisions, hearing devices, or apps. Additionally, social media data is used in many instances, for example, to understand market trends or how customers look at the organization (for human resources [HR] purposes). This chapter will answer the question: How does a business collect source data? We will go through typical data‐generating systems in the business's immediate environment, and we'll also look at the difference between primary and secondary data, as well as external and internal analyses. We'll be looking at initiatives to improve the data quality of source systems. Finally, we'll present a way in which a business can prioritize from which source systems to collect project related data.

An interesting observation is that primary data from source systems meets an information need for a particular target group in the business. When the same data then becomes secondary data in the data warehouse framework, it meets a different information need for a different target group.

This chapter—and the concept of source data—might seem less relevant than the topics of the previous chapters as a topic for this book. However, consider that we are about to use information as a strategic innovative asset. This means we should know our strategy in terms of which competitive advantages we want to gain in the long run and which issues we want to overcome in the short run. We should also know which potential improved operational procedures and improved decision support all that our potential data sources can give us. When future and present business needs are linked with potential and present data sources, we are able to see information as a strategic asset and lead our business with confidence into the future. The point is that strategies do not come out of nowhere. They are based on planning processes that are no better than the planners. If planners do not see the potential in source data, they cannot create strategies that take information into account. If an organization is headed by those who are outdated in their thinking, those who are relying on the same old tricks as the industrial winners back in the 1990s, consider this: Can these leaders carry the organization through the analytical age to come?

WHAT ARE SOURCE SYSTEMS, AND WHAT CAN THEY BE USED FOR?

Source systems is not a general term to be used for some systems and not others. When we use the term source systems, our starting point is a given data warehouse, where source systems are the data sources on which the data warehouse is based. Many companies have several data warehouses that are more or less integrated in such a way that the data warehouses can function as source systems to each other, too.

When we talk about data‐generating systems, we can, however, specify which systems create data for the first time, and which don't. A checkout register is, for example, a data‐generating system, because when it is scanning products, it is also generating data files, and these files in turn tell the store which products at which time and at which price are leaving the store. When the day is over, the customer has gone home, and the register is balanced, the store can choose to delete the data in the register—but we don't always want to do that because this data can be used for many other things. When we choose to save the information, the data‐generating system becomes a source system for one or several specific data warehouses. Based on this data warehouse information, we can carry out a large number of analyses and business initiatives (e.g., inventory management, supply chain management, earnings analyses, multi‐purchase analyses, etc.).

New data is, in other words, not generated in a data warehouse. Data in a data warehouse comes from somewhere else, and is saved based on business rules and generated to meet the company's information requirements. Just as in the previous chapter, we have listed a number of source systems to give an impression of what source systems might be and how they can create value. Keep in mind that neither the list of source systems nor their value‐creating potential is exhaustive. Chapter 7, which looks at the organization of business intelligence (BI) competency centers, will provide more inspiration through ways in which to achieve strategic influence.

Some examples of data‐generating sources are:

  • Billing systems. These systems print bills to named customers. By analyzing this data, we can carry out behavior‐based segmentations, value‐based segmentations, and the like.
  • Social media data. This data can help take the temperature of individuals and groups. It can be very useful for staff working with corporate social relationship management, as it will give input on how the public and key influencers see the company. Text mining that analyzes which positive and negative words an organization is associated with is a good starting point for this kind of market surveillance.

    The marketing function can also use this data to see how well “above the media line” campaigns are received within days, rather than months. The technique is based on curves showing the historical relationship between social media attention and actual sales. By tracking where a campaign is on a curve, we will at a very early point be able to estimate the future campaign outcome.

    HR can use social media data for tracking candidates and how employees present the organization to the public.

    Social network analysis can also be used when understanding who has an impact on whom in regard to the public opinion of the company, with the purpose of targeting influencers.

  • Wikipedia data and similar databases. These databases can help intelligent robots carrying out customer dialogue with understanding complex relationships; for example, references to Steve Jobs can have something to do with an Apple device.
  • Geo data. This data, combined with the location of an app user, can create foundation of a series of new services: we can let the user know he or she is now close to one of our cafes, and by showing this message can get a special discount.

    In the world of banking, a notification to a customer could also be sent if a credit card is used more than 30 miles from the position of the phone, as people usually carry both their phones and credit cards with them at all times.

  • Internet of Things data. More and more devices can now transmit sensor data. This data focuses on how the devices are used, and could range from hearing devices to televisions. The data can then in turn be used for product innovation or service improvements.
  • Reminder systems. These systems send out reminders to customers who do not settle their bills on time. By analyzing this data, we can carry out credit scoring and treat our customers based on their payment records.
  • Debt collection systems. These systems send statuses on cases that have been transferred to external debt collectors. This data provides the information about which customers we do not wish to have any further dealings with, and which should therefore be removed from customer relationship management (CRM) campaigns until a settlement is reached.
  • CRM systems. These systems contain history about customer calls and conversations. This is key information about customers; it can provide input for analyses of complaint behavior and thus what the organization must do better. It can also provide information about which customers draw considerably on service resources and therefore represent less value. It is input for the optimization of customer management processes (see “Optimizing Existing Business Processes” in Chapter 3). It's used in connection with analyses of which customers have left and why.
  • Product and consumption information. This information can tell us something about which products and services are sold out over time. If we can put a name to individual customers, this information will closely resemble billing information, only without amounts. Even if we are unable to put a name to this information, it will still be valuable for multi‐purchase analyses, as explained in “The Product and Innovation Perspective” in Chapter 2.
  • Customer information. These are names, addresses, entry times, any cancellations, special contracts, segmentations, and so forth. This is basic information about our customers, for whom we want to collect all market information. This point was explained from the customer relations perspective in Chapter 3.
  • Business information. This is information such as industry codes, number of employees, or accounting figures. It is identical to customer information for companies operating in the business‐to‐business (B2B) market. This information can be purchased from a large number of data suppliers, such as Dun & Bradstreet, and is often used to set up sales calls.
  • Campaign history. Specifically, who received which campaigns when? This is essential information for marketing functions, since this information enables follow‐up on the efficiency of marketing initiatives. If our campaigns are targeted toward named customers, and we subsequently are able to see which customers change behavior after a given campaign, we are able to monitor our campaigns closely. If our campaigns are launched via mass media, we can measure effect and generate learning through statistical forecasting models. If this information is aggregated over more campaigns, we will learn which campaign elements are critical, and we will learn about overall market development as well.
  • Web logs. This is information about user behavior on the company's Web site. It can be used as a starting point to disclose the number of visitors and their way of navigating around the Web site. If the user is also logged in or accepts cookies, we can begin to analyze the development of the use of the Web site. If the customer has bought something from us, it constitutes CRM information in line with billing information.
  • Questionnaire analyses performed over time. If we have named users, this will be CRM information that our customers may also expect us to act on. Questionnaire surveys can be a two‐edged sword, however; if we ask our customers for feedback on our service functions, for instance, they will give us just that, expecting us to then adjust our services to their needs.
  • Human resources information about employees, their competencies, salaries, history, and so on. This information is to be used for the optimization of the people side of the organization. It can also be used to disclose who has many absences due to illness, and why. Which employees are proving difficult to retain? Which employees can be associated with success as evaluated by their managers? This information is generally highly underrated in large organizations and public enterprises in particular, which we will substantiate by pointing out that all organizations have this information and that the scarce resource for many organizations is their employees. Similarly, hour registration information can be considered HR‐related information. When hour registration information (consumption of resources) is combined with output information from, for instance, the enterprise resource planning (ERP) system, we can develop a number of productivity key performance indicators (KPIs).
  • Production information. This kind of information can be used to optimize production processes, stock control, procurement, and so on. It is central to production companies competing on operational excellence, as described in Chapter 2.
  • Accumulation of KPIs. These are used for monitoring processes in the present, but can later be used for the optimization of processes, since they reveal the correlations between activities and the resulting financial performance.
  • Data mining results. These results, which may be segmentations, added sales models, or loyalty segmentation, provide history when placed in a data warehouse. Just as with KPIs, this information can be used to create learning about causal relations across several campaigns and thus highlight market mechanisms in a broader context.
  • Information from ERP systems. This information includes accounting management systems in which entries are made about the organization's financial transactions for the use of accounting formats. It can be related to KPI information, if we want to disclose correlations between initiatives, and whether results were as expected.

WHICH INFORMATION IS BEST TO USE FOR WHICH TASK?

Now that we have the source information, the question now becomes: How do we use which information? An efficient way of solving this problem is to list all data from generating and storing systems that may contain information that could potentially create value for the project at hand. Then each individual data source is assessed by the following two dimensions:

  1. How useful is the information?
  2. How accessible is the information?

Sometimes we may find ourselves in situations where we decide to disregard relevant information if this information is too difficult to access. Similarly, we may have easily accessible information with only a marginal relevance to the task at hand. This way of prioritizing information is, for instance, used in data mining, particularly in connection with customer information, which may come from countless sources. For example: Say that we want to create a profile on a monthly basis of customers who leave us or cancel their subscriptions. Based on this profile, we wish to show who is in the group that is at high risk of canceling next month, and seek to retain these customers. In this case, call lists must be ready within, say, 40 days. This also means—due to time considerations alone—that all the data from the data‐generating source systems can't be part of the analyses, and we therefore have to prioritize.

In Exhibit 6.1 we have placed the data sources that we choose to use in connection with the project in the gray area. We therefore have a clear overview of which data sources we have selected and which we have discarded. But the model gives cause for further deliberations. If, in the course of the project, we find that we have time to include additional data sources—or if we find that we are running out of time—the model can help us prioritize.

Image described by caption and surrounding text.

Exhibit 6.1 Model for the Prioritization of Data Sources in Connection with Specific BA Projects

The model also tells us how the project can be expected to develop over time in terms of data sources. It's worth repeating that in connection with BA projects we should think big, start small, and deliver fast. This model enables us to maintain the general overview while delivering results quickly. The general overview, however, could also include some deliberations about whether the business should include, for example, Web logs in its data warehouse in the future. Web logs contain useful information in relation to the given problem and possibly also to other problems, but they are inaccessible.

The model therefore repeats one of the arguments for having a data warehouse: It makes data accessible. In relation to Exhibit 6.2, this means that we move the circle toward the right if we make data more accessible. Or we could say that we're creating a new circle that is positioned further to the right, since we now have two ways of accessing the same information.

Image described by caption and surrounding text.

Exhibit 6.2 Loss of Information through Transformations

The model may also highlight the problem of loss of information in connection with data transformations. If data is not stored correctly in terms of user needs, information potentially loses value. For example, if we are an Internet‐based company wishing to clarify how customers navigate our Web site, we can see this from the raw Web log. If we choose to save in our data warehouse only the information about which pages customers have viewed, we will be able to see only where the customers have been, not how they moved around between the pages. We have therefore stored information incorrectly in terms of our needs, and we have lost information and caused potential consequences for our business users.

Finally, the model also repeats the advantages of combining data correctly because this enables us to obtain synergies. If we combine Web log information with master data, users' ages, gender, and any other information, we can carry out detailed studies for different groups of users—and thus segmentations—that mean we are getting even more value from our source data. This is also often referred to as one version of the truth, as opposed to the many versions of the truth that analysts create, when each in his or her own way combines data from a fragmented system landscape into reports (see Exhibit 6.3).

A diagram of a model for synergy through the combining of data with Availability on the horizontal axis, Usability on the vertical axis, and three open circles plotted in a box with two arrows from two circles pointing to one circle at the top right.

Exhibit 6.3 Synergy through the Combining of Data

WHEN THERE IS MORE THAN ONE WAY TO GET THE JOB DONE

In large organizations we often see a two‐tier BA function, namely the market analysts and the data warehouse analysts. The two groups are sometimes called, respectively, external analysts and internal analysts, based on their information sources. External analysts typically work with questionnaire analyses and interviews with direct contact with customers. This is what we call primary data, which is data collected for a given purpose. If we want to know which customers are disloyal, let's ask them. One of the problems with this kind of analysis is that it is costly to send out questionnaires to an entire customer base every quarter. It's also a matter of potentially annoying our customers with the constant questioning. Note that external analysts also often purchase standard market reports from other companies, in which case we would refer to the external analysts as users of secondary data.

Internal analysts, who take their point of departure in internal data sources in the data warehouse, are also able to come up with suggestions as to which customers are loyal or disloyal. Their information comes from the previously mentioned data mining models to predict churn (see Chapters 3 and 4), where customers are profiled based on their tendency to break off their relationship with a business. Churn predictive models may take several months to develop and automate, but after they are completed, an in‐depth analysis can be made in a matter of hours of which customers will be expected to leave the business when and why. Moreover, these models have the advantage of providing answers from all customers, so to speak, unlike questionnaire analyses, which are often completed by no more than 20 percent to 30 percent of customers in three weeks' time.

So, which solution do we choose? At the end of the day, the important thing is which solution is going to be more profitable in the long run. Do the internal analysts have the information about customers that can describe why these customers canceled their relationship with us? If not, the analysis is not going to add much value. In connection with some tasks we, the authors, performed for a major telecom company, we sent private customers a questionnaire about their loyalty, the results of which were used to carry out a churn predictive model. Based on the questionnaire analysis, we divided the respondents into four groups, depending on the scores they gave themselves for loyalty. Similarly, we could categorize our data, based on the percentage that describes the risk of losing a customer next month—which is one of the main results of a churn analysis. We divided the entire customer base into four segments according to risk score from the churn analysis and let the groups be percentagewise as large as the ones that came out as a result of the questionnaire analysis. We then compared the efficiency of the two methods and were able to conclude that they were equally good at predicting which customers would leave next month. So the choice here was simple. Data mining provided a score for all customers within 24 hours without annoying any customers—and at considerably lower cost. In other situations, when we have a smaller customer base and do not have as much information about customers as a telecom company does, questionnaire analyses may be the better option. What's important here is that it's not a question of either/or, or necessarily of both/and, but of what is more profitable in the given situation.

Likewise, it's important to look at the two information sources as supplementing each other, rather than as competing with each other in the organization. What really matters is not where we get our information, but how we apply it. As an example, say that we're working with the implementation of a CRM strategy with the overall objective of increasing our average revenue per customer. In this case, it doesn't matter whether it's a so‐called basket analysis that constitutes the basis for decision or whether it's in‐depth interviews or questionnaire analyses that form the foundation of the added‐sales strategy we're implementing. What matters is that we make the right decisions in our cross‐sales strategy.

In some cases, having overlapping information can even be an advantage. In connection with data mining models, which look to predict which customers the company will lose when and why, an exit analysis that, on a monthly basis, asks a number of customers who have canceled their commitment is able to systematically validate whether the statistical model is sufficient. If our interviewing tells us that a large number of customers are dissatisfied with the treatment they receive in our call center, then we know that our statistical model should include call center information and hints about how the data should be cut for the prediction modes. In this case, the external analysis function is thus able to support the analysis with a validation of the models.

WHEN THE QUALITY OF SOURCE DATA FAILS

In our discussion of data quality, we explained how organizations with high data quality use data as a valuable asset that ensures competitiveness, boosts efficiency, improves customer service, and drives profitability. Alternatively, organizations with poor data quality spend much time working with contradictory reports—deviating from business plans (budgets), which leads to misguided decisions based on dated, inconsistent, and erroneous figures. There is, in other words, a strong business case for improved data quality. The question in this section is how organizations can work efficiently to improve the data quality in their source systems (when data is created).

Poor data quality in source systems often becomes evident in connection with profiling when data is combined in the data warehouse, and the trail leads from there to the source system. To improve data quality efficiently, we need to start at the source with validation. For instance, it should not be possible to enter information in the ERP system without selecting an account—it must be obligatory to fill in the account field. If this is not the case, mistakes will sometimes be made that compromise financial reporting. In terms of sales transactions, both customer number and customer name must be filled in. If these details are not registered, we can't know, for example, where to send the goods. Data quality can typically be improved significantly by making it obligatory to fill in important fields in the source systems. Business transactions should simply not go through unless all required fields are completed.

Another well‐known data quality problem arises when the same data is entered twice into one or more source systems. In many international organizations, customers are set up and maintained in a local language and alphabet source system as well as in an English system. The first system can handle specific letters such as the double‐letter s that is used in the German language or the special letters used in Scandinavia; the other can't. The solution is, of course, to design the system so as to ensure that customer data can be entered and maintained in one place only.

The keys to improved data quality in source systems are to improve the company's validation procedures when data is created, and to hold a firm principle to create and maintain data in one place only.

SUMMARY

In this chapter we went through typical data‐generating systems in the business's immediate environment and the difference between primary and secondary data, as well as external and internal analyses. We looked at initiatives to improve the data quality of source systems. Finally, we present a way in which a business can prioritize which source systems to collect project‐related data from.

We also explained that if we do not see the potential in source data, we will not be able to lead our business with confidence into the future using information as a strategic resource.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset