Chapter 4
Business Analytics at the Analytical Level

As in all other levels in the BA model, we are experiencing—and can continuously expect—changes in the near future. The term artificial intelligence describes a machine's ability to perform intelligent behavior, such as decision‐making or speech recognition. Just think back to 1997, when the computer Deep Blue beat grand chess master Garry Kasparov through the brutal power of many calculations. In 2011, another computer, also from IBM, became a Jeopardy champion through the ability to understand simple forms of human questioning. At the time it was something that surprised people, but areas where we today accept that algorithms can outsmart humans.

Internet titans such as Facebook, Google, and others are investing heavily in artificial intelligence to better recognize, understand, and serve their users with online offerings. More examples include personal assistant devices and apps like Alexa, Cortana, and Siri, Web search predictions, and movie suggestions on Netflix.

Another area where we can expect change in 20 years' time is within the area of quantum computers. We will leave the technical explanation for experts, but the center of the matter is that when we can master the stabilization of 300 atoms at the same time, a computer will have more computational power than all computers have today. Today's challenge, however, is to create an algorithm that can keep these atoms exactly in place.

To tackle this challenge of developing this algorithm, a scientist at Aarhus University tried brutal calculation power, but failed. Currently, scientists are using gamification as the answer. This means that anyone can go to a Web page and try, with his or her PC, to keep an atom in place; scores are earned depending on how good the player is. Then the scientists behind these millions of scores can train the algorithms based on only the highest scoring attempts. This way, we can mix human creativity—the ability to discard the irrelevant data and gain deep insights—with the calculative force of computing.

On the commercial analytics market today, we already see some of these trends manifest in commercial products. Robots can understand human speech and unstructured text. Usually their comprehension is only based on keywords, but some are starting to use natural language processing, meaning the ability to understand the context of the word, including the personality and mood of the human delivering the message. This means that the trend goes towards a more humanized way of using decision support systems. So we can write a message to our business intelligence (BI) system (e.g., “Give me the ten most costly production processes per produced unit”) and it will find the most likely answer to our question. Additionally, we can expect customer support to be digitalized: first to support the agents, later to allow for full customer self‐service.

With this little introduction to the trends in computational power for the future, we begin Chapter 4, which has more to do with what we can expect an advanced analytical organization to be able to do today. This chapter describes the third level in the business analytics (BA) model that constitutes the underlying principle of this book. Chapters 2 and 3 explained the kinds of information an organization typically asks for at a strategic level and the requirements for information this leads to at the department level.

In this chapter, we'll be taking a closer look at the different analytical methods that generate and deliver the required information and knowledge. We will not be discussing the technical aspects of a delivery, as this will be included in Chapter 5, which discusses the data warehouse; instead, we will focus entirely on which methods can generate which types of decision support for the business.

The purpose of this chapter is to create a basis for dialogue between the company and the analyst. The chapter represents a menu that provides the company with an overview of which types of information and knowledge they can ask for and equally provides the analyst with an understanding of how the dish (the analytical method) is prepared and from which ingredients (data). To support this process, we have included an outline for a specification of requirements, giving an overview of which issues need to be covered in the dialogue.

The size of the menu is always debatable; the chef thinks he or she offers plenty of choice, while customers want to see as many dishes as possible. We have chosen a menu size that corresponds to what anyone can reasonably expect an analyst to master methodwise, or what an analyst would be able to learn during two to three weeks of training. Generally speaking, however, we assume that Microsoft Excel in its present form (2017) is just a spreadsheet (which is even capable of providing basic statistics), and that a quantitative analyst needs a statistical program. In the current market, it is our opinion that the leading analytical software products are R, SAS, and IBM SPSS. Note, too, that both SAS and IBM offer short courses in the analytical methods that are introduced in this chapter.

This chapter breaks down different types of knowledge and information in a way that makes it possible for the company to formulate exactly what it wants from its BA function. Furthermore, we are listing the different analytical methods that can produce the required input to the company—that is, we are translating information requirements into “analyst language.” Note that readers of this book have access to the Web site BA‐support.com, which consists of a large number of statistical examples and an interactive statistics book. Both can provide guidance to analysts in the search for which methods to use under which circumstances.

The focus in this chapter is not on methods or statistics, but rather on demonstrating the connection between the BA function and the deliveries of information and knowledge that the BA function must subsequently produce. The business wants information and knowledge, while analysts conduct data mining and provide both statistics and tables based on data.

DATA, INFORMATION, AND KNOWLEDGE

In this book, we distinguish between the three concepts: data, information, and knowledge. This chapter in particular emphasizes this distinction, which is why we'll go through the terms briefly. Data is defined as the carrier of information. Data, as such, seldom delivers, line by line, fact by fact, or category by category, any value to the user. An example of a piece of data could be “bread” or “10.95.” Data is often too specific to be useful to us as decision support. It is a bit like reading through a data warehouse from A to Z, and then expecting to be able to answer every question. We are deep down at a detailed level, where we simply can't see the forest for the trees. Besides, data in a data warehouse is not structured in any single way that makes sense; rather, it could potentially make sense in many contexts.

Information is data that is aggregated to a level where it makes sense for decision support in the shape of, for instance, reports, tables, or lists. An example of information could be that the sales of bread in the last three months have been respectively $18,000, $23,000, and $19,000. We can generate this information in the BA department and then deliver it to the person who is responsible for bread sales, and this person could then analyze this information, draw conclusions, and initiate the actions that are deemed relevant. When our deliveries consist of information, we are able to automate the process. This requires initial resources, but it requires those resources only once. It doesn't take a lot of analyst resources thereafter. When we say that the BA department generates knowledge, this doesn't mean that it generates just information to the user, but that this information has been analyzed and interpreted. This means that the BA department, as an example, offers some suggestions regarding why bread sales have fluctuated in the last three months. Reasons could be seasonal fluctuations, campaigns, new distribution conditions, or competitors' initiatives. It is therefore not a question of handing the user a table, but instead of supplementing this table with a report or a presentation. This means, of course, that when the BA department delivers knowledge, it is not a result of an automated process, as in connection with report generation, but rather a process that requires analysts with quantitative methods and business insight.

ANALYST'S ROLE IN THE BUSINESS ANALYTICS MODEL

Of course, organizations vary, but generally speaking there are certain requirements that we expect analysts to meet, and therefore certain competencies that must be represented if we want a smoothly running analytics function. We'll take a closer look at the implication of whether these competencies are covered by each individual analyst or whether these competencies are covered jointly across the analytics team or the rest of the organization. But what is important is that the competencies are present, since without them the BA function will not be able to link the technical side of the organization to the business side. We're talking about linking two completely different perspectives, as illustrated in Exhibit 4.1. Technicians have a tendency to perceive the organization as a large number of technologies that together constitute a systems structure, in relation to which data from the source systems moves. This perspective is incompatible with a business perspective, which sees the organization as a large number of value‐adding processes that ultimately deliver different types of services or products to its customers.

Image described by caption and surrounding text.

Exhibit 4.1 The Analyst's Role in the Business Analysis Model

The danger of assuming the technically oriented perspective is that operating and maintaining the company's technical systems structure might end up being an objective in itself. The consequence is what is called “a data warehouse with a life of its own,” independent of the rest of the organization's need for information. Symptoms include huge volumes of data of such poor quality or lack of relevance that they are useless to the organization. Such a situation means that the investment made by the organization is in fact merely a cost, as nothing valuable comes of it. Other symptoms might be that every time we want to enter new information in the data warehouse, we just can't, because the technical side is working on a project that the business has not asked for. The system is thus using all its resources on self‐maintenance and has none left for serving the business. Again, we've got an investment that is yielding no return.

At an operational level, symptoms include the delivery of front ends that are not user‐friendly. The front ends might show the required information, but not in a way that is practical from a user's point of view, because the analysts forgot to ask the users what they wanted and how this data or information could be fitted into how they work. If these symptoms are noted, we often find a general reluctance to use data warehouse information, too. It follows, then, that the business does not take the time to enter data in a thorough way, for example when salespeople meet with customers. The result is a further fragmentation of data and consequently a reluctance to use the data warehouse. Over time, an internal decision structure emerges that is not based on data warehouse information unless it is strictly necessary, and a large part of the argument for creating a data warehouse then disappears. When the decision was made to invest in a data warehouse, the purpose was to improve general decision behavior, which is the value‐adding element of a data warehouse. So, again, we see an investment that is yielding no return.

All this can happen when the company does not have an information strategy that clearly uses the data warehouse as a means of attaining business objectives. So, first and foremost, we need to ensure we've got a process, an information strategy, that ensures coordination and future planning between the business and the data warehouse. At the same time, we need to make some demands on the analyst's tool kit (i.e., we make some requirements regarding specific analytical competencies that are present in the company). The old rule applies here that a chain is only as strong as its weakest link. A company might possess perfect data material on the one hand and some clearly formulated requirements for information on the other, but the overall result will be only as good as the analysts are able to make it. Some companies invest millions per year in their data warehouse and yet hire analysts who are really only data managers or report developers, which means that they are unable to contribute any independent analytical input but can merely deliver reports, tables, or lists within a few days. The company therefore spends millions per year on technology and its maintenance and receives only a few reports, tables, and lists, because it never invested sufficiently in the people side of an information system.

THREE REQUIREMENTS THE ANALYST MUST MEET

Based on these premises, we can specify three clear requirements of our analysts, their competency center, or their performance of individual tasks:

  • Business competencies
  • Tool kit that is in order (method competencies)
  • Technical understanding (data competencies)

Business Competencies

First of all, the analyst must understand the business process he or she is supporting and how the delivered information or the delivered knowledge can make a value‐adding difference at a strategic level. In this context, when we talk about a strategic level, it is implied, too, that we need analytical competencies: the analyst understands and is able to convey to the business the potential of using the information as a competitive parameter. This is essential if the BA function is to participate independently and proactively in value creation, and it is likewise essential that we can therefore talk about data as a strategic asset. The analyst needs to have or be given a fundamental business insight in relation to the deliveries that are to be made. This insight is necessary so that the analyst stands a chance of maximizing his or her value creation. The analyst must also be able to independently optimize the information or the knowledge in such a way that the user is given the best possible decision support. This also enables analysts to approach individual business process owners on an ongoing basis and present them with knowledge generated in connection with other contexts. The analyst needs to be capable of having a continual dialogue with the business, as well as of detecting and creating synergies across functions.

Analysts must be able to see themselves in the bigger context as illustrated in the following story about the traveler and the two stonemasons. The traveler met one stonemason, asked him what he was doing, and got the reply that he was cutting stones, that each had to be 15 by 15 by 15, and that he had to deliver 300 stones a day. Later on, the traveler came across another stonemason and asked the same question, but here, the stonemason replied: “I am building the largest and most beautiful cathedral in all of the country, and through this cathedral, good tidings will be spread throughout the land.” In other words, analysts must be able to see their function in the broader picture, so they are not only performing a number of tasks, but are able to get the biggest possible value from the volume of information and knowledge they obtain and develop every single day.

Tool Kit Must Be in Order (Method Competencies)

An analyst's answer, regardless of the question, should never be simply, “I'll give you a table or a report.” Of course, a table can be the right solution at times, but tables can be enormous. It is therefore a reasonable requirement that the analyst be able to make suggestions about whether statistical testing is needed to show any correlations that might be present in the tables. The analyst might also be able to visualize the information in such a way that the user gets an overview of all the data material in the first place.

Moreover, the analyst must be able to deliver more than information in a model and take part in the analysis of this information to ensure that the relevant knowledge is obtained. Another important aspect in the analyst's role is ensuring that the users of the information derive the right knowledge from it. We cannot even begin to count the times we've been sitting at a presentation with bar charts of different heights, where people go for the red segment, because it has the highest average score. In this context, it could be pointed out that it would have been extraordinary if the averages had been exactly the same. This brings up the question about how different the averages must be before we're allowed to conclude that there is a difference and therefore a basis for a new segment and new business initiatives. The problem is that the decision has not been subjected to quality assurance validation via a simple statistical test. Such a test could prompt us to ask if we make a decision based on these figures, are we then likely to draw the wrong conclusion? Note here that we are not proposing that a requirement for the analyst be that he or she must be able to explain covariance matrices. That's not that important these days, when we have software for all the calculations. But the requirement for the analyst is that he or she have a basic knowledge of which test to use when and be able to draw the right conclusions from the test. As mentioned earlier in this chapter, this is knowledge that is communicated via two‐ to five‐day training courses that have been run by leading suppliers of analytical software (not traditional business intelligence software).

Another problem that we often encounter among analysts is that they are reluctant to work with software that is new to them. This means that they have a tendency to define themselves as software programmers rather than as analysts. The bottom line is that we have approximately three to four vendors of relevant analytical software, offering ten to twelve software packages, and the key to finding the optimum combination is a continual search‐and‐learn process. It is not about what the individual analyst has by now become familiar with and is comfortable programming and clicking around in. It is worth pointing out that most analytical software packages across vendors work well together. If a company has a software package that does not integrate well with other software, the company should consider replacing it, because it can limit the analysts' capabilities. All software packages differ enormously on dimensions such as price, pricing method, user friendliness (crucial to how fast an analysis can be performed), integration with data sources, ability to solve specific problems, guarantees of future updates, ability to automate reports, technical support, ability to make presentable output, analytical support, and training courses. If in doubt, start by taking a course in any given software, and then decide whether it is worth buying.

Technical Understanding (Data Competencies)

The final requirement that we must make of analysts is that they have a basic understanding of how to retrieve and process data. Again, this is about how to structure processes because just as analysts have to sometimes draw on support from the business in connection with the creation of information and knowledge, they must also be able to draw directly on data warehouse competencies. If, for instance, an analyst needs new data in connection with a task, it's no good if he or she needs several days to figure out how Structured Query Language (SQL) works, what the different categories mean, or whether value‐added tax is included in the figures. We therefore need the data warehouse to have a support function where people understand their role in the BA value chain. However, analysts spend about 80 percent of their time retrieving and presenting data, so we also have to place some clear demands on the analysts' competencies in connection with data processing.

In conclusion, analysts need to master three professional competencies to be successful: business, method, and data. We can add to this certain key personal competencies: the ability to listen and to convince. These are necessary if a task is to be understood, discussed with all involved parties, and delivered in such a way that it makes a difference to business processes and thereby becomes potentially value‐adding.

All in all, it sounds as if we need a superhero. And that might not be far off, considering the fact that this is the analytical age. And, if we recognize information as a potential strategic asset, then this is another area in which we need to invest, both in the public education sector and in individual companies. Note, however, that these personal and professional skills do not necessarily need to be encompassed in a single person; they just need to be represented in the organization and linked when required. We will discuss this in more detail in Chapter 7, where we discuss BA in an organizational context.

REQUIRED COMPETENCIES FOR THE ANALYST

An analyst derives only a fraction of the knowledge that is potential if he fails to use the correct analytical methodology. Analysts can therefore generate considerable loss in value, if they are the weak link in the process.

In Chapter 7 we will also discuss how to set up processes that make the analyst more efficient. For example, we once were given the task of developing an analytical factory for a large telecom provider. Our work reduced the average time it took to develop an analytical model from approximately two months to less than six hours (organizational sign‐off included).

Analytical Methods (Information Domains)

In the previous section, we discussed the analyst's role in the overall BA value chain, which stretches from collecting data in the technical part of the organization to delivering information or knowledge to the business‐oriented part of the organization. We outlined some requirements of the analytical function, one of which was that it must function as a bridge between the technical side and the business side of the organization and thereby form a value chain or a value‐creating process.

Another requirement is that the analytical function must possess methodical competencies to prevent loss of information. Loss of information occurs when the accessible data in a data warehouse, provided it is retrieved and analyzed in an optimum way, has the potential of delivering business support of a certain quality, but cannot because this quality is compromised. Reasons for this lack might be the simple failure to collect the right information, which might, in turn, be due to lack of knowledge about the data or lack of understanding of how to retrieve it.

But errors might also be traced to the analyst not having the necessary tool kit in terms of methodology. When this is the case, the analyst derives only a fraction of the knowledge that is potentially there. If we therefore imagine that we have a number of analysts who are able to extract only 50 percent of the potential knowledge in the data warehouse in terms of business requirements, we have a corresponding loss from our data warehouse investment. When we made the decision to invest in a data warehouse based on our business case, we naturally assumed that we would obtain something close to the maximum knowledge. Instead, we end up getting only half the return on our investment. That means that the data warehouse investment in the business case should have been twice as big. If we look at the business case from this perspective, it might not have been a profitable decision to acquire a data warehouse, which means the investment should not have been made. Analysts can therefore generate considerable loss in value if they are the weak link in the process.

Therefore, in the following section we have prepared a list of methods that provide the BA department with a general knowledge of the methodological spectrum, as well as a guide to finding ways around it.

How to Select the Analytical Method

In Chapter 3, we performed a so‐called strategy mapping process (i.e., we presented a method where we had some strategic objectives and ended up with having some specific information requirements). Now, we will pick up this thread. We will perform an information mapping process, where we start with some specific information requirements and proceed to identify which specific analytical techniques will deliver the required knowledge or the desired information.

The aim is to present a model that can be used in the dialogue between management, who wants information, and the analyst, who must deliver it. In the introduction to this chapter, we said that we would be delivering a menu. What we want to deliver here, too, are some key questions to ensure that the dialogue between analyst and recipient provides an overview of how this menu is designed to facilitate the right information being ordered. More specifically, this means that we divide potential BA deliveries into four information types (see Exhibit 4.2), deliver the questions that will help clarify which information types are the most relevant, and go through the four information types one by one. Concentrate on the type that is relevant.

A process diagram with text connected by arrows for three imperatives in connection with choice of methods and information mapping.

Exhibit 4.2 The Three Imperatives in Connection with Choice of Methods and Information Mapping

In terms of perspective, we start with a business perspective and finish with an analytical perspective. We begin, for example, by requesting information about which customers will be leaving us in the next month, and finish, perhaps, with the answer that a neural network will be a good candidate in terms of selecting a method of delivering results. The business‐oriented reader who wants to understand more about scalability levels, say, can log on to BA‐support.com, where we have included an interactive statistics book, along with a number of examples and case studies, as well as contact details for the authors of this book.

The Three Imperatives

We obviously are not suggesting that the analyst read through this whole text every time he or she needs to determine which methods to use to deliver which information or which knowledge. The idea is that the analyst has read the text beforehand, and is able to implicitly draw from it in his or her dialogue with the business. The following three points can be useful in selecting the relevant method:

Question 1: Determine with the process owner whether the quantitative analytical competencies or the data manager and report developer competencies are required. Analytical competencies here mean knowledge about statistical, exploratory data mining, and operations research competencies with the objective of generating knowledge and information. Data manager or report developer competencies refer to the ability to retrieve and present the right information in list or table form. Data manager or report developer competencies are therefore about retrieving and presenting the right information in the right way, without any kind of interpretation of this information via analytical techniques. One scenario might be that a number of graphs are generated in connection with delivery, providing a visualized overview of the information in the table but without any test to help the user prioritize this information. In other words, data managers or report developers deliver information and leave its interpretation to its users. Of course, there are examples of data managers or report developers who produce tables or reports, and then prepare a business case based on this information. However, this does not make them quantitative analysts. Rather, it's a case of wearing several hats. So, we are here talking about data manager or report developer competencies, and tasks within this domain are solved by wearing the controller hat, so to speak.

Analytical competencies are used if, for example, the user wants to find the answer to, “Is there is a correlation between how much of a raise we give our employees and the risk of employees leaving the company within one year?” In this case, the data manager or report developer will be able to deliver only a table or report that shows employees in groups according to the size of their pay increase, and what percentage within each group have changed jobs. The analyst (with a statistical solution) will be able to say, “Yes, we can say with 99 percent certainty that there is a correlation.” The analyst is therefore creating not only information, but also knowledge.

If the user wanted answers to questions like, “Do any of our customers have needs that resemble each other? If so, what are those needs?” then the data manager or report developer would be faced with a big challenge. He or she must now prepare reports and tables showing everyone who bought product A as well as which other products they purchased, too. There is a similar reporting need for products B, C, and on through the last product. Detecting correlations can become a large and complex puzzle. And the interpretation therefore depends on the eye of the beholder. The analyst (explorative analytics) will, via cluster models, identify different customer groups that have comparable consumption patterns and then segment the customer base, based on the identified clusters.

If the user wanted an answer to a question like, “Which customers are going to leave us next month and why?” the data manager or report developer would deliver a large number of tables or reports that, based on the available information about customers, can deliver a percentage figure of how many customers stayed and how many discontinued their customer relations. The analyst (data mining analytics with target variables) will be able to deliver models describing the different customer segments who often discontinue their customer relations, as well as pinpointing which specific customers must be expected to leave the company next month.

Question 2: Determine whether hypothesis‐driven analytics, or data‐driven analytics can be expected to render the best decision support. What we call hypothesis‐driven analytics could also be called the statistical method domain (note that descriptive statistics such as summations, means, minimum, maximum, or standard deviations are within the data manager domain), and its primary purpose is to create knowledge about correlations between different factors, such as age and purchasing tendencies or pay increase and job loyalty.

One of the problems in using traditional statistical tests is that 1 in 20 times a correlation will be found that does not actually exist. This is because we are working with a confidence level of 5 percent, which in turn means that if we are 95 percent certain, we conclude that there is a correlation. In 1 in 20 tests between variables that have nothing to do with each other, we will therefore find a statistical correlation anyway, corresponding to the 5 percent. To minimize this phenomenon, a general rule is applied that says that to ensure the quality of the conclusions they must have theoretical relevance. Note here that those tests are performed only when we have a test sample and want to show some general correlations in the population it describes. If we have the entire population, there is no reason to test whether men are earning more than women. That is obviously just a question of looking at the average figures in a standard report.

Data‐driven methods also have the purpose of creating knowledge about some general correlations, but are focused more strongly on creating models for specific decision support at the customer or subscriber level. The big difference between data mining and explorative analytics on the one hand, and hypothesis statistics on the other lies in how we conduct quality assurance testing on our results. Data mining is not theoretically driven like statistics; it is data driven. This means that data mining analysts will typically let the algorithms find the optimum model, without any major theoretical restrictions. The quality of the model then depends on how it performs on a data set that is set aside for this validation process.

To a certain extent, however, there is an overlap between some models, since we can conduct quality assurance on results by asking for theoretical significance, before even bothering to test the correlations. Similarly, we can develop models via the same method as a data‐driven process, and then subsequently test whether the correlations shown by the models can be generalized in a broader sense by examining how successful they are in making predictions on other data sets than on the ones for which they have been developed.

As explained earlier, the big difference between hypothesis analytics and data‐driven analytics is how quality assurance testing is conducted on their results. How do we know which route to take to reach our target? In the following section, we'll list a number of things to be aware of when choosing which route to take. Note here that it isn't important whether we choose one method or the other. Rather, the important thing is to generate the right information or the right knowledge for the company's subsequent decision making. Generally speaking, the target is the main thing, although we're here looking at the means.

If the aim is to generate knowledge to be used in a purely scientific context, the answer is unambiguously to adopt the hypothesis‐driven approach. It's not really a question of what gives the best results, but rather it's a question of completing the formalities to ensure that others with the same data and the same method can get the same results and can relate critically to these. This is possible when using statistical analytics, but not when using data mining analytics because they are based on sampling techniques. We will look at these in the section on data mining. If colleagues are to be able to re‐create the results in connection with the validation of generated knowledge at higher levels in the organization, the arguments for the hypothesis‐driven approach are very strong.

Hypothesis‐driven analytics are preferred if we just want to describe correlations of data in pairs. It is just a question of getting an answer to whether the correlations we find can be ascribed to coincidences in our test sample or whether we can assume that they vary as described in our theory. Typical questions here could be:

  • Did a campaign have any effect? Yes or no?
  • Do men spend more than women?
  • Are sales bigger per salesperson in one state than in another?

Data‐driven analytics are typically preferred for tasks that are complex for different reasons, where customer information is an example of data that constantly changes, or where there are large amounts of data and limited initial knowledge about correlations in the data material. This often creates a situation where analysts within a company are drowning in data, while the rest of the organization is thirsting for information and knowledge since the analysts' speed of analysis simply cannot keep up with need for knowledge based on ever‐changing near‐real‐time data. Business environments increasingly find themselves in situations in which enormous amounts of customer information are accumulated, but they are finding it difficult to unlock this information in a way that adds value.

A classic example could be a campaign that has been prepared and sent to all customers. Some customers have accepted the offer, and others haven't. The questions now are:

  • What can we learn from the campaign, and how can we make sure that the next campaign offers something that the rest of our customers will be interested in?
  • We've got mountains of customer information lying about, but what part of this information contains the business‐critical knowledge that can teach us to send relevant campaigns to relevant customers?

Data‐driven analytics are relevant here, because we do not know which data we should be examining first. We obviously have some pretty good ideas about this, but no actual knowledge. We have another problem, which is that next month when we prepare our next campaign, we'll be none the wiser. Our customer information has been updated since last time, and the campaign is a different one.

It makes sense, too, to look at our internal competencies and analytical tools. If we look at the problem from a broader perspective, it is, of course, possible that we will not choose a data mining solution, because it might be an isolated exercise that will require relatively large investments.

If we have now decided that we need the hypothesis‐driven approach, we can proceed to the next section. Likewise, we can proceed to the next question if we feel confident that the data‐driven types of analytics are the right ones for us. If we are still not sure, because the knowledge we want to generate can be created both ways, we have a choice. We should consider which of the two requires fewer resources and is more accessible to the user. Note that most data‐mining tools can automate large parts of the process, so if we have an analysis that is going to be repeated many times, these tools can render some significant benefits. Equally, we could consider whether we can kill more birds with one stone. A data mart developed to identify which customers will leave, when, and why will also be useful in other contexts and will therefore render considerable time savings in connection with ad hoc tasks. Thus a simple question, such as which segments purchase which products, can be answered in as little as five minutes when reusing the data‐mining mart as a regular customer mart. The alternative response time would be hours, because it involves making the SQL from scratch, merging the information, and validating the results.

Question 3: Determine whether the data‐driven method has the objective of examining the correlation between one given dependent variable and a large number of other variables, or whether the objective is to identify different kinds of structures in data. If we begin by describing situations where we have a target variable, we would want to describe this variable via a model. We could be an insurance company that has collected data via test samples about which claims are fraudulent and which are true. Based on this information, we can train a model to understand when we have a fraudulent claim and when we don't. From that point forward, the model can systematically help us identify and follow up on past as well as future cases that are suspicious. We therefore have a target variable—“Was it fraudulent or not?”—and a number of other variables that we can use to build a model. These variables might describe factors such as which type of damage, under which circumstances, which types of people report them, whether there have been frequent claims, and so on.

A target variable might also be the right price of a house. If we are a mortgage lender, we can make a model based on historical prices that illustrates the correlations between the price of the house and factors such as location, size, when it was built, and so forth. This means we can ask our customers about these factors and calculate the value of the house and the derived security it constitutes for us as lenders, thus saving us sending a person out to evaluate it.

Another target variable might be customer satisfaction. If we send out a questionnaire to a large number of customers and then divide the customers into groups according to satisfaction level, we can make a model that combines satisfaction scores with our internal data warehouse information about the customers. We can then train the model to understand the correlations and, based on the model, we can score all the customers who did not complete the questionnaire. We then end up with an estimated satisfaction score, which we can use as a good substitute.

As opposed to data mining techniques that build on target variables, we now see a large number of analytical techniques that look for patterns in data. The techniques that we have included here are techniques for data reduction. These are typically used if we have a large number of variables with little information, and we want to reduce the number of variables to a smaller number of variables (without losing the information value) and interpret and isolate different kinds of information. For example, we might have a survey with 50 questions about our business, and we know that there are only three to five things that really matter to the customers. These techniques can then tell us how many factors actually mean something to our customers and what these factors are.

Cluster analysis can also divide customers into comparable groups based on patterns in data. We do not know beforehand how many homogeneous groups or clusters we've got, but the model can tell us this, along with their characteristics, and can also make a segmentation of our customers based on the model.

Cross‐sales and up‐sales models also look for patterns in data, and can provide us with answers to questions about which products customers typically buy in combination, and how their needs develop over time. They make use of many different types of more or less statistical algorithms, but are characterized by not developing through learning about the correlation between one single variable and a large number of others. As a supplement to these models, data mining models with target variables work well where the target variable describes those who have purchased a given product compared with those who haven't. The rest of the customer information is then used to gain a profile of the differences between the two groups.

Following the discussion of the three imperatives that must be considered in order to identify which information domain to use in connection with the information strategy, we will now go through the general methods we've chosen to include. We want to emphasize once again that this is not a complete list of all existing methods, nor is this a book about statistics. What we are listing are the most frequently used methods in BA.

Descriptive Statistical Methods, Lists, and Reports

If you answered yes to data manager or report developer or controller competencies previously (see Exhibit 4.2, Question 1), this section will provide you with more detail.

Since popular terminology distinguishes between lists, which the sales department, for instance, uses to make their calls and reports, and which typically show some aggregated numeric information (averages, numbers, share, etc.), we have chosen to make the same distinction in our heading. Technically speaking, it doesn't make much difference whether the cells in the table consist of a long list of names or some calculated figures. In the following, we will simply refer to them as reports, as an overall term for these types of deliveries.

We have chosen to define reporting in a BA context as “selection and presentation of information, which is left to the end user to interpret and act on.” From a statistical perspective, we call this descriptive statistics; information is merely presented, and no hypothesis tests or explorative analyses of data structures are performed.

This form of transfer of information to customers is by far the most common in companies because after a number of standard reports are established, they can be automated. Ad hoc projects are different because they require the investment of human resources in the process each time. Moreover, if we look at the typical definition of BA, “to ensure that the right users receive the right information at the right time,” this describes what we typically want to get from a technical BA solution in the short run. This also tells us about the most common purpose of having a technical data warehouse and a reporting solution (i.e., to collect information with a view to turning it into reports). We also control users' reading access to these reports. Finally, we ensure that reports are updated according to some rule (e.g., once a month). Alternatively, the reports might be conditional, which means that they are updated, and the users are advised of this, if certain conditions are met. These might be conditions such as a customer displaying a particular behavior, at which point the customer executive is informed of the behavior along with key figures. Alternatively, as is known in business activity monitoring (BAM), in cases where certain critical values are exceeded, the report on this process is then updated and the process owner is informed.

Ad Hoc Reports

Ad hoc reports are the type of delivery required by the customer if we have information that we need in connection with, for instance, a business case, or a suspicion or critical question that must be confirmed or denied. We might, for instance, have a suspicion that the public sector segment rejects certain products that we produce, and we therefore need a report on this particular problem.

The procedure for establishing this type of project is completely straightforward and is based on the recipient in the business, as a minimum, designing the table he or she requires. The advantage is that the recipient contemplates which information he or she needs and in which form. Will averages suffice, or are variance targets needed? Revenue might have to be broken down into categories from 0 to 100, 100 to 200, and above, and then we just need to know how many customers exist in each category. Besides, there might well be a number of considerations concerning the data on which to build the analysis. In connection with the above example, where we divide the customers into categories, we might consider whether to include semipublic institutions such as sport centers or independent institutions in our analysis. Also, does the analysis apply only to companies that are not in a dunning process and that have been active customers with us for the past two years? It might seem like a lengthy process, but this kind of requirement specification ensures that the first delivery from the analyst is correct. As most analysts will know, there are two kinds of internal customers: the ones for whom we can perform a task in one attempt, and the ones with whom we need to go through at least three attempts.

One of the trends within ad hoc reporting today is to push it back out to the users. This has been tried before in the past with limited success, since the traditional BI systems were too cumbersome to work with. Hence historically, when there was a creative discussion that needed some facts, it would take days before these facts would be available because the local self‐service expert was busy doing something else and considered too technical to be invited to a creative discussion. The end result was, of course, that the fact‐based element would not be embedded in creative discussions, but merely used to validate assumptions. Today's market leading systems allows for in‐memory calculations, meaning that if we manually decide to run a test, we will get the results immediately, since all test results have already been calculated and are ready to be presented from the memory of the server.

Another trend is in some early attempts to make BI systems understand speech or simple written questions to setup the report. This is matched with an environment that makes it easier to dig deeper into data through drag‐and‐drop‐style user front ends, intuitive graphs, and simple data mining algorithms that can indicate to the user where there are more trends in the data to be explored.

With this trend of increasing user friendliness of systems, we must expect that at least a large part of ad hoc reporting and semi‐deep data analysis sooner or later will be moved away from the analytical department. This also means that it is increasingly becoming a task of the analytical department to promote an analytical culture when users are working with their daily tasks. Moreover, the analytical department must teach users to work with the analytical self‐service system and set the threshold for when analytics gets too complex and they themselves must take over.

Manually Updated Reports

Manually updated reports are normally used in connection with projects and therefore have a limited lifetime. This short‐term value makes it financially unviable to put these reports into regular production. Alternatively, the reports might come about because certain users do not have access to the company's reporting systems or simply can't make sense of them.

Other times, these reports are chosen as a solution because their requirements keep changing, or the dimensions change. Poor data quality might also be at the root of this: a table that might need manual sorting every time or that the analyst can add some knowledge to. Finally, there might be technical reasons why the business can't deliver anything apart from this type of reports. It is not an unknown phenomenon, either, for analysts to train executives to hand over reports in person—for the sake of attention!

Even though the reports are typically initiated on a project basis, they do have a tendency to become a regular delivery. When the business user has worked with the report, it's only natural that he or she would like to be able to see some useful purposes in this new perspective and requests the report be delivered on an ongoing basis—say, once a month. In principle, this is fine; it simply confirms that the BA function is delivering useful information. However, there are other things to take into consideration.

It's a question of resources. An analyst's time is precious. The more time an analyst spends on preparing a report, the less he or she has for other projects. It is not uncommon for an analyst to be almost drowning in his or her own success. Specifically, this means that we have an analyst who uses all his or her time at work on updating standard reports, which he or she once created for the users. If we let this continue, two things will happen. First, we achieve no further development of the knowledge that the analyst could otherwise contribute. Second, the entire information flow in the company stops when the analyst changes jobs because he or she has had enough of all the routine tasks.

In a broader organizational context, this kind of ungoverned reporting inevitably brings about different reporting conditions and thereby different versions of the same truth. Some people in the organization will know more than others, and these people will exchange information, and the organization thus establishes different levels of knowledge. Another consequence of this kind of ungoverned reporting is that the investments that were made in an automated reporting system will become more or less superfluous.

The solution to this conflict between analysts and the people responsible for the automated reporting systems is not that the analysts refuse to prepare repeat reports, but that continuous transfers of reports to automated systems take place. The analyst could receive a guarantee from those responsible for the automated processes that they will generate all standard reports. However, there are reports that are so complex that they cannot be fully automated. There might be some estimated decisions in connection with forecast to which the analyst needs to relate—as we know, there are no rules without exceptions. In any event, it could still be discussed whether it should be the user of the report who does the calculation, and have the automated processes support him or her as best they can.

Automated Reports: On Demand

This type of report is typically delivered in connection with data warehouse implementations and is based on users having access to a multitude of information that is updated on a regular basis.

There are no routines in place, however, as to whether those who have access actually read the reports, which is what is meant by the expression on demand (only when the user requests it). Typically, the technical solution consists of an individualized user interface, controlled by the user's login, that ensures that the user views relevant information only, and that any personal information (e.g., salary, illness) is not publically accessible in the organization.

One of the advantages of most types of automated reports is that they are not static. Most of them are interactive, which means that the user can drill down into the details by breaking down a given report into further dimensions. If we have a report describing revenue in the different national regions, we can ask the report to break down sales into which stores sold for how much, or which product groups generated which revenue. When talking about interactive reports, we can more specifically say that we gain access to a multitude of data or a data domain (the revenue), which provides users with the opportunity to analyze via a number of dimensions (regions, stores, products, etc.). For details about dimensions, see Chapter 5. The visualization of reports is something we will typically get from most front‐end solutions, where a front end is the user interface to the technical solution. So we are not only getting table reports, but we can also visualize this information, which can be an extremely time‐saving function, for instance, in connection with reports that perform general monitoring of market trends over time. A graph typically gives a better overview of trends than does a series of numbers (see Chapter 5 for more).

Automated Reports: Event Driven

This type of report works like the on‐demand reports, with the one difference that they remind the user when to read them. The event that triggers a report can be anything from the passing of a time interval to the fact that some critical values have been exceeded in data. When it's a case of time intervals being exceeded, there is not much difference between this reporting form and the on‐demand reporting form, where we must assume that the report is read at regular intervals. In cases where certain critical values are exceeded, the report starts representing an alarm, too. If, in connection with production reports, for instance, we discover that more than 3 percent of the produced items have errors, the report will first of all sound the alarm to the production executive giving him or her opportunity to react quickly.

In continuation of the lag information in an information strategy, a useful way of using this type of reporting would be in connection with investigating whether some of the established key performance indicators (KPIs) were over or under a critical level. Levels are often already defined in connection with KPI reporting so that the technical solution, which automates the reporting, can put on so‐called traffic lights or smileys that show whether a process is on track or not. The advantage of such a solution is that the report itself contacts its users when problems occur so that these can be solved at short notice, rather than users discovering these problems at the end of the month when the new figures are published.

Event‐driven reporting is thought to have a great future, a future in which relevant information presents itself to the individual user at the right time. In fact, that is something that we are able to do already to some extent. But the instance in which the underlying intelligence is specifying the right information at the right time will become much more refined, as described in Chapter 9, which covers pervasive BA.

Reports in General

In previous sections, we discussed the difference between lead and lag information, and pointed out that lag information will typically be distributed via reports. This means that it must be a requirement that an information strategy includes a set of reports that, via the measuring of critical business processes, is able to provide support for the chosen business strategy. This also means that the reports, taken together, both cover an area and at the same time are mutually exclusive. Our processes will thus be monitored and we will know precisely who is responsible for any corrective actions.

This means that we need the reports to be able to report to each other at higher as well as lower levels, as illustrated in Exhibit 4.3. If, for instance, we have a report describing monthly sales figures and a report showing daily sales figures, we need to be able to balance both internally. This brings about a need for one central data warehouse that feeds both reports. It stands to reason that if one report is built on figures from the finance department and another is built from information from daily aggregated till reports, the two reports can never be balanced. It is therefore important that we understand that we must choose one version of the truth when establishing a reporting system, even though we could easily define many. Equally, consistency is crucial when choosing the dimensions for generating the reports. If we break down the monthly reports in regions, we must break down the corresponding daily reports into the same regions.

Image described by caption and surrounding text.

Exhibit 4.3 Demands to the Reporting Are Hierarchically and Internally Aligned

HYPOTHESIS‐DRIVEN METHODS

When working with hypothesis‐driven methods, we use statistical tests to examine the relationship between some variables in, let's say, gender and age. The result of the test will be a number between 0 and 1, describing the risks of our being wrong, if we conclude based on the data material that there is a relationship between gender and lifetime. The rule is then that if the value we find is under 0.05—that is, 5 percent—then the likelihood of our being wrong is so insignificant that we will conclude that there is a relationship. However, this means that if we perform 20 tests between variables that have nothing to do with each other, then we can, based on an average perspective, still show a statistical correlation (1/0.05 = 20). This is why it's a general requirement that we do not just hold all sorts of variables up against each other, but that we must have some initial idea of the relationship. This doesn't change the fact, of course, that every 20th time a test is performed between two variables that have nothing to do with each other, a statistically significant relationship will be found anyway, but it does remove some of the incorrect knowledge we would otherwise be generating.

In a BA context, this means that if we want knowledge about our customers, we first have to go through a process of identifying which variables we want to include in the analysis, as well as which relations between the variables it makes sense to test. This is exemplified in Exhibit 4.4, where statistics in a BA context are typically about identifying the relevant data and testing for relevant correlations. Based on identified significant relationships between the variables, we can make an overall description as a conclusion on our analysis.

Image described by caption and surrounding text.

Exhibit 4.4 Illustration of Tests between Two Variables in Our Data Sets

Tests with Several Input Variables

There are tests that can handle several input variables at a time. The advantage of these tests is that they can reveal any synergies between the input variables. This is relevant if, for instance, a company is contemplating changing its pricing of a product and combining this change with a sales campaign. Both these steps are likely to have a positive effect on sales, but supposed there is a cumulative effect in undertaking the two initiatives at the same time. It is not enough, therefore, to carry out two tests; one that shows the correlation between price and sales of a product and one that shows campaign launch and sales on the same product. In fact, we need to investigate a third dimension (e.g., what are the synergies between price reduction and campaign launch on the one hand and sales on the other?).

Which test to choose depends on the dependent variable (the dependent variable is the variable that we want to learn something about), which in the previous example is sales. In the field of statistics, we distinguish strongly between the scaling of the dependent variable, as this determines which method to use. If we are after some estimates (interval‐dependent variable), this could be in connection with the need for knowledge about the correlation between the price of a house on the one hand, and everything that determines this price on the other (how old it is, its last renovation, number of square meters, size of overall property, insulation, and the like).

The variable we want to know something about is characterized by the fact that it makes sense to look at its average—that is, to multiply or divide it. The most commonly used method in this context is called linear regression analysis, and it describes the correlation between an interval variable and a number of input variables. Forecasting techniques, which look for correlations over time, would also typically be in this category. Forecasting techniques are based on looking at the correlation between, say, sales over time and a large number of input variables, such as price level, our own and others' campaigns, product introductions, seasons, and so on. Based on this correlation, we can conclude which factors determine sales over time, whether there are any synergies between these factors, and how much of a delay there is before they take effect. If we are running a TV commercial, when do we see its effect on sales, and how long does its effect last? If we have this information, we can subsequently begin to plan our campaigns in such a way that we achieve maximum effect per invested marketing dollar.

Forecasting is thus used for two things: (1) to create projections of trends, and (2) to learn from historical correlations. Forecasting methods are therefore extremely valuable tools in terms of optimizing processes, where we want to know, based on KPIs, how we can best optimize our performance. Sales campaigns utilize these methods because the companies need to measure which customers got their message. This is a well‐known method for companies investing in TV commercials who only know how many commercial slots they've bought and where they want to perform a subsequent measuring of any effect on sales. In addition, forecasting models play an important role in explaining the synergies among different advertising media, such as radio, TV, and billboards, so that we can find the optimum combination.

If we want to create profiles (binary‐dependent variables, which means that there are only two outcomes, e.g., “Yes or No,” and “New customer profile or Old customer profile”) using BA information, this might be a case of wanting a profile on the new customers we get in relation to our old ones, or an analysis of which employees gave notice in the last year. What we want is to disclose which input variables might contribute to describe the differences between Group A and Group B, where Group A and B, respectively, are the dependent variable. If we take the example of employees leaving the business in the last year, there might be information such as age, gender, seniority, absence due to illness, and so forth that might describe the difference between the two groups. In this context, the method that is typically used is a binary regression analysis.

In some cases, we want to explain how the variables rank (ordinal‐dependent variables) because we want to know more about satisfaction scores, where the satisfaction scores will typically be called something like “very happy,” “happy,” “neutral,” “unhappy,” or “very unhappy.” A rank variable is therefore characterized by a given number of optional answers that are ranked, but where we cannot average them. Although many people code the ranked variables from 1 to 5, it is statistically and methodically wrong to do so.

If we, for instance, want to understand which of our customers are very satisfied with our customer service, we could look at the correlation between gender, age, education, history, and then their satisfaction score using a method called ordinal regression analysis. A similar analysis must be used if, as in another example, we want to analyze our customer segments, and if these segments are value segmented and thereby rankable.

Finally, if we want to understand something about a group we would use a nominal‐dependent variable. Maybe we have some regional differences or certain groups of employees that we want to understand better. We can't just rank regions and say that Denmark ranks better than Norway, and then Sweden is third. One analysis could focus on the different characteristics of our customers in the Norwegian, Danish, and Swedish markets, where our input variables could be gender, age, education, and purchasing history. In this case, we would typically use a generalized linear model (GLM) analysis.

DATA MINING WITH TARGET VARIABLES

Data mining reveals correlations and patterns in data by means of predictive techniques. These correlations and patterns are critical to decision‐making because they disclose areas for process improvement. By using data mining, the organization can, for instance, increase the profitability of its interaction with customers. Patterns that are found using data mining technology help the organization make better decisions. Data mining is a data‐driven process. A data mining project often takes several weeks to carry out, partly because we are often talking about large volumes of data (both rows and columns) as input for the models. This is true even though the development in computer power has shortened this process considerably, partly because we probably want to automate the process so that it can be performed in a matter of hours next time. To perform this task, it's essential to have specialized data mining software that is managed by analysts rather than conventional data warehouse people. We also recommend that companies choose a software vendor who offers courses in the use of their software. Any course fees have a quick return if data mining is being performed for the first time.

Exhibit 4.5 shows the data mining process in three steps: (1) creating a number of models, (2) selecting the best model, and (3) using the selected model.

Image described by caption and surrounding text.

Exhibit 4.5 The Three Steps of a Data Mining Process

The City of Copenhagen Municipality needed to find out which employees had stress or long‐term absence due to illness. The first step, therefore, included collecting a large amount of historical information about absences due to illness, pay level, organizational level, labor agreement, and so forth. In addition, we had information about who had had a long‐term absence and at what time. We were therefore able to create two groups: (1) those who had not had a long absence due to illness, and (2) those who had had a long absence within that time. By means of a number of algorithms (neural network, decision trees, and binary regression analyses, used here in a data mining context), we came up with profiles of the differences between the two groups, and thereby characterized those employees who had a long absence due to illness. The result was that we had a number of models, and it was impossible to say that one would definitely be better than the other, since they had been developed in different ways.

The purpose of the second step (selecting the best model) is to identify which model is going to render the best results on an unknown data set. An “unknown” data set has the same characteristics as the original data set on which the model was developed. This ensures that the model we choose will not only be the best to describe the data set on which it has been developed, but can be generalized and applied to other data sets. In relation to the City of Copenhagen Municipality in Denmark, this ensured that not only was the model able to explain the historical absence due to illness in the data set on which it was based, but the model was also able to come up with good results on current data sets and thereby deliver efficient predictions on the data set it would be used on in future. The way we performed the testing was to let the model predict whether each of the profiles in the unknown data set would enter into a long absence due to illness. We then related this information with whether the employees described in the historical data set did actually become ill, and therefore we were able to see which of the models were best at explaining the general tendencies underlying long‐term illness in the City of Copenhagen Municipality.

When we reach this stage in the process, we can begin interpreting the models. Depending on the different types of algorithms, it varies a lot how much we are able to interpret; nevertheless, this step gives us the opportunity to generate knowledge about the problem at hand. (See Exhibit 4.6.) Later in this chapter, we'll look at which methods provide access to which types of knowledge.

A diagram for how history is used to make predictions about illness in the city of Copenhagen Municipality with a time line and historical data set and actual data set compared below. There are arrows pointing to Model 3.

Exhibit 4.6 History Is Used to Make Predictions about Illness in the City of Copenhagen Municipality

The third step in the process is to create one‐to‐one information on which to act. In the case of the City of Copenhagen Municipality, we had the opportunity to work with a current data set, identical to the historical data set, only more recent, so we didn't know who would get ill. The model was able to identify employees with an increased risk of entering into a long absence due to illness in the coming period of time, and the model could therefore inform us about which specific employees managers should be extra aware of.

As illustrated in Exhibit 4.7, the result of a data mining process is that we create both knowledge and new information on which we can act. When dividing data into several subgroups, even more divisions of data might arise, depending on the chosen method. And when we describe this as making a new variable, which is then delivered to our users as a list, this is an abstraction. This is because we are often looking to automate the information creation process by implementing the model in a data warehouse with a view to a continuous scoring of data, and an action is automatically executed based on the model.

Image described by caption and surrounding text.

Exhibit 4.7 Prediction Using the Data Mining Method

One example could be a solution that the authors made for a large telecom company. Based on information found in the customer relationship management (CRM) system of the company's call center, we found a number of clear correlations between the inquiries from corporate customers and whether they canceled their subscriptions shortly after. These inquiries could be about whether they could get a discount, or whether they could get a good deal on some new phones, as the old ones were getting quite worn. The questions were clear indications of customers on a quest, and that they should be contacted immediately by a person with great experience in corporate solutions. Consequently, we created a data mining model to identify the most important danger signals. Based on the model, an automated electronic service was generated that scanned the data warehouse of the call center every five minutes. If any “critical” calls were found in the log readings from conversations with customers—which could be that they had called in for a good deal combined with the fact that their contract was up—then the person who was responsible for this customer would automatically receive an e‐mail. With reference to the reporting section of this chapter, this is essentially an event‐driven report generated based on a data mining algorithm.

What we see in the call center example is what those in process optimization call an externally executed action, which means that it's not us, but something else, that initiates a process. In this case, it's a critical call. In connection with the ongoing accumulation of information about the customers in a market that reacts ever faster, this is a trend we will probably see more of in the future. It already exists on a small scale in so‐called marketing automation programs, so much so that when a customer changes her surname it is assumed that she has gotten married and therefore is sent an offer for family insurance. Similarly, if a new address is registered on a store Web site, certain programs will calculate where the nearest store is and automatically send out an e‐mail with this information. This is all the beginnings of what is called pervasive business analytics, which is built on you and me, based on our behavior and other information, receiving relevant information, when it is assumed that we need it. We will discuss pervasive business analytics in detail in Chapter 9.

Text mining is not an area we will go into in this book. Text mining is an area within analytics which will soon be an exotic variant where, for example, some keywords are identified and related to events, which then will add insight for the analyst. In a near future we expect text mining to be used to predict which answers I should give to which questions to enable robots and self‐service systems to give the right responses. Currently, a variety of systems are in use (based on simple keywords or more complex language analysis) to generate these relationships. We have however decided to keep text mining out of the scope of this book, since it is outside what an advanced analyst can be expected to do today and as it is unclear whether this analytical capability will fall in under the domain of general analytics, or whether it will become a more specialized field as we know it from image recognition, sound recognition or other sensory recognition systems today.

Data Mining Algorithms

In the field of statistics, a precedent has been established for choosing which specific statistical methods or algorithms to use based on the data we are holding in one hand and the conclusion we would like to be holding in the other. In data mining, there is a tendency to prefer the model that will render the best results on an unknown data set (i.e., the model that can generate the best new column for predictive purposes, as shown in Exhibit 4.7). In the following section, we will therefore go through the most popular techniques and group them according to the types of problems they can solve.

The purpose of data mining with a target variable will always be to explain this target variable. It is comparable to a statistical test, where we have a dependent variable and many explanatory ones. The only difference is terminology, where data mining uses the terms target variable and input variables. As with statistics, our target variables, and thus what we are trying to explain and predict, might be an estimate, a profile, a ranking, or a grouping.

The business problems covered by the four types of target variables have already been explained in the section of this chapter titled “Tests with Several Input Variables.” The most common techniques are neural networks and decision trees. Neural networks are characterized as being fast to work with. They do have the significant weakness, however, of being practically impossible to interpret and communicate because of their high level of complexity. Decision trees are easier to interpret and have the added advantage of being interactive, giving the analyst the scope for constant adaptation of the model, according to what he or she thinks will improve the results. Interactive decision trees can be compared with online analytical processing (OLAP) cubes or pivot tables, where the analyst can continue to drill deeper and deeper into the required dimensions. The analyst constantly receives decision support in the form of statistical information about the significant or incidental nature of the discovered differences. In our telecom case study at BA‐support.com, we give an example of how to use and read a decision tree.

Various kinds of regression analyses are used in data mining, too. The methods are typically developed in the same way as in the statistics field, but in a data mining context, the models are evaluated on an equal footing with decision trees and neural networks based on whether they are able to deliver efficient predictions in unknown data sets.

EXPLORATIVE METHODS

In BA, we typically see four types of explorative analyses. These are methods for data reduction, cluster analysis, cross‐sell models, and up‐sell models.

In connection with explorative models, we leave it to the algorithms to discover tendencies in the data material. The methods are therefore data driven, but there are no target variables that we want to model. Consequently, there is no way to conduct quality assurance testing on our models by testing them on unknown data sets. The quality assurance typically consists of the analysts evaluating whether the identified patterns make sense, which is the reverse of what we know from statistics, and where the theory precedes the test.

Another way of assuring the quality of our models is, for example, to let the same algorithm make a model on another and similar data set and, if the algorithm comes up with the same model, we can presume that it is not a coincidence in the given data material in combination with the algorithm that gives the result. Alternatively, we could let two different algorithms analyze a data set and, if they produce comparable solutions, we could presume that it is the result of some underlying patterns in the data and not a coincidence in the interaction between the individual algorithm and the data set.

Data Reduction

The reason for performing data reduction might seem somewhat abstract, but data reduction does have its advantages, as we will show in the following section. In specific terms, we take all the information in a large number of variables and condense it into a smaller number of variables.

In the field of statistics, data reduction is used in connection with analyses of questionnaire information, where we've got a large number of questions that are actually disclosing information only about a smaller number of factors. Instead of a questionnaire with, say, 20 questions about all kinds of things, we can identify how many dimensions are of interest to our customers and then ask about only these. We could therefore move from measuring customer satisfaction using 20 variables to measuring only the five variables that most precisely express our customers' needs. These five new variables will also have the advantage of having no internal correlation. That is ideal input for a subsequent cluster analysis, where many variables sharing the same information (high correlation) affect the clustering model in a way that we do not want.

Data reduction is typically used when there are many variables that each contain little information that is relevant in terms of what we need. Using this method, we can try to condense the information into a smaller number of variables, in the hope that the new variables now contain a concentrate of relevant information, and that this can make a positive difference. The most popular method for data reduction is principal component analysis (PCA), which is also called explorative factor analysis. The correspondence analysis is also quite commonly used.

Cluster Analysis

Other types of explorative analyses that are frequently used in BA are cluster analyses. Instead of working with a very large number of individual customers, we can produce an easy‐to‐see number of segments, or clusters, for observation. There are numerous methods for this, but they all basically focus on algorithms to combine observations that are similar. In statistics, cluster analyses are typically used to investigate whether there are any natural groupings in the data, in which case analyses can be performed on separate clusters while data mining will typically use the identified cluster, if this improves predictability in the model in which they are to be included. Finally, the purpose of the analysis might be the segmentation per se, as this will give us an indication of how we can make some natural divisions of segments based on information about our customers' response and consumption.

In terms of the relationship between data reduction and cluster analyses, data reduction facilitates the process of reducing a large number of variables to a smaller number. The cluster analysis also simplifies data structures by reducing a large number of rows of individual customers to a smaller number of segments. For this exact reason, the two methods are often used in combination with questionnaires, where data reduction identifies the few dimensions that are of great significance, and the cluster analysis then divides the respondents into homogenous groups.

Cross‐Sell Models

Cross‐sell models are also known as basket analysis models. These models will show which products people typically buy together. For instance, if we find that people who buy red wine frequently buy cheese and crackers, too, it makes sense to place these products next to each other in the store. This type of model is also used in connection with combined offers. They are used, too, when a company places related pieces of information next to each other on its Web site, so that if a customer wants to look at cameras, he or she will find some offers on electronic storage media, too. Amazon.com is a case in point: If a user wants to look at a book, he or she will at the same time be presented with a large number of other relevant books. The other “relevant” books are selected on the basis of historical knowledge about which books other users have purchased in addition to the book the customer is looking at.

Up‐Sell Models

Up‐sell models are used when a company wants to create more sales per customer by giving the individual customer the right offer at the right time. These models are based on the notion that a kind of consumption cycle exists. A time perspective has been added here. We are not looking at what's in the shopping basket once; instead, we are looking at the contents of the shopping basket over time. If, for example, we find that people who at one point have had one kind of sofa will get another specific sofa at a later stage, we will want to promote the new type of sofa with suitable intervals after the first sofa has been purchased. In the software industry, the method is used to discover who will buy upgrades of software at an early stage. Based on their information, a vendor can endeavor to penetrate the market with new versions. Upselling is also a strategy to sell a more expensive or newer version of a product that the customer already has (or is buying), or to add extra features or add‐ons to that product. The BMW site enables users to configure their cars before purchasing. Users have the option of upgrading anything from the seats to the wheels for an additional cost, and they can immediately see what those upgrades would look like. Another example is Spotify that offers a free account, but recommends users to subscribe to its Premium account.

BUSINESS REQUIREMENTS

Imagine an analyst or a controller from a BA department who sits at a desk, looking a colleague from a business‐oriented function in the eye. The business user asks, “So what can you do for me?” There are, of course, plenty of good answers to this excellent question, as we shall see later in this chapter. One point that is crucial for the analyst to make, however, is, “You can deliver business requirements.”

Business requirements is a kind of interpreting and communicating task that is a substantial part of an analyst's tool kit. Furthermore, it's a “shelf product” in most large consultancy firms. The requirement for the analyst is to be able to understand and translate the business user's thoughts and needs into something that can be answered through analyst or data manager or report developer competencies. The purpose is to deliver something that can be used by the business to improve processes, and that corresponds with the business strategy as well.

Producing business requirements requires a sound knowledge of business issues and processes as well as insight into the company's data warehouse and other IT infrastructure. As previously mentioned, one of the analyst's key competencies is to be able to build a bridge between business process and the technical environment. He or she has, so to speak, a foot in each camp, as illustrated in the BA model from Chapter 1. A good point of departure for the delivery of business requirements is a thorough interview in which the business user is interviewed by the analyst from the BA department. Business requirements can be built in many ways. In this book, we use a three‐tier structure, which includes definition of the overall problem, definition of delivery, and definition of content.

Definition of the Overall Problem

Based on this definition, the analyst must be able to place the specific task in a broader context, and thereby prioritize it in relation to other tasks. Consequently, the commissioner of the task must be able to explain for which business processes the given task will be adding value, if the task is to be prioritized based on a business case. Alternatively, the commissioner must relate the task to a strategic initiative. Otherwise, the analyst must be extremely careful in taking on the task, because if it's not adding value, and not related to the business strategy, he or she must question the justification of the task.

Definition of Delivery

The requesting party (recipient) must specify in which media the analysis must be delivered (HTML, PDF, Excel, PowerPoint, Word, etc.) and whether the delivery must include an explanation of the results, or whether these are self‐explanatory.

Deliveries that include automated processes, such as on‐demand reports or continuous lists of customers to call, need to have a clear agreement on roles and responsibilities. Who shall have access to the reports? To whom is the list to be sent in sales? Similarly, an owner of the reports must specify, either as a function or a person, who is responsible for ensuring that the business requirements are based on the BA information, and who is to be notified of any errors, changes, or breakdown in the automated delivery. The reason for this is that automated reports are not a static entity; changes might be made to the data foundation on which they are built, which means that they are no longer structured in the optimum way. In addition, the technology on which they are built might be phased out. Errors are inevitable over time, but if effective communication is in place, damage to business procedures can be prevented. The question is not whether an incorrect report will be delivered at some stage or not, because that is to be expected; the question is how efficiently we deal with the situation when it arises.

Other questions to clarify about delivery is time of delivery (on demand), or under which circumstances we update (event driven), and whether to notify users when updating.

Definition of Content

In connection with report solutions, the content part of a BA project is very concrete, since it is about designing the layout as well as defining the data foundation. As mentioned in the section about on‐demand reports, this type of report is not just a static document, but a dynamic data domain—it could be sales figures—that we want to break down into a number of dimensions, such as by salesperson, department, area, or product type. Often users experience problems grasping this new functionality. However, we have to remember to train users in how to work with the dynamic reporting tool, too.

Data quality is a subject that needs discussing, also, since our data warehouse contains imprecise or incorrect information, which we have chosen to live with for various reasons. What is the acceptable level of accuracy? Can we live with a margin of error of 5 percent between the daily reporting and the monthly figures from the finance department? In cases where we don't have the desired information quality in our data warehouse, this question is essential, since it determines how many resources we must use to sort out our data before we dare to make decisions based on the derived reports. A company must know the quality of its data and either live with it or do something about it. Unfortunately, we often experience that data quality is known as being poor, and the data warehouse is therefore unused. That is an unfortunate waste of resources.

In Exhibit 4.8, we have given an example of how a company may use a mind map to collect the relevant information in connection with business requirements for the reporting solution from the radio station case study in Chapter 1.

Image described by caption and surrounding text.

Exhibit 4.8 Business Requirements for the Radio Station Case Study (Chapter 1), Visualized with a Mind Map

For tasks that, to a great extent, require quantitative analytical competencies, a business performance specification will take the form of an ongoing dialogue. Unlike reports in which the user interprets the results, in this specification the analyst will be doing the interpretation of quantitative analytical tasks. Naturally, this means that unforeseen problems might be encountered in the process. These must be discussed and dealt with. The initial preparation of business requirements must state clearly who possesses business key competencies for answering any questions, and then make sure that these people are available as needed.

Similarly, in connection with major projects, such as data mining solutions, certain subtargets should be agreed on to facilitate a continuous evaluation of whether the performance of the overall task is on track in relation to its potential value creation as well as whether more resources and competencies should be added.

SUMMARY

In this chapter, we have looked at the analyst's role in the BA model, which was defined in Chapter 1. The analyst is a bridge builder between the company and its technical environment. Purchasing BA software is not sufficient to secure successful BA initiatives; the company must take care to invest in the human aspect of its information system as well.

Generally speaking, the analyst possesses business insight, technical insight, and the ability to choose the correct methodical approach and presentation form. In other words, the analyst's tool kit must be in order. And there is a fourth item for the list of competencies: the analyst must possess the ability to deliver a business requirements document.

We have looked at the four information domains in the analyst's methodical field and suggested when it is beneficial to deploy these for performing various tasks. The descriptive statistical information domain is the typical work area for the typical analyst, and it is usually presented to users in reports. We have discussed different forms of reports. However, it is characteristic of this method that the individual viewer or business user will be the person to interpret and transfer this information into knowledge. In other words, users themselves have to create knowledge from the information, which means that absolute knowledge in the information domain is not being created, but rather relative knowledge. These tasks are performed wearing the controller hat (data manager or report developer competencies).

The analyst creates absolute knowledge in the information domains: statistical tests, data mining, and explorative analytics. These analytical methods are used for creating knowledge about correlations between variables and identifying patterns in data in order to be able to predict, for instance, the scope for cross‐selling and up‐selling. The presented information from these methodical processes is not meant to be subsequently interpreted and transferred into knowledge by individual business users, because these analytical results are indisputable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset