CHAPTER 7
Economic Impact

INTRODUCTION

The investment in big data and analytics entails various economic challenges, which we address in this chapter. First we elaborate on the economic value of both technologies by zooming into the total cost of ownership (TCO) and return on investment (ROI). It will be clear that in the current setting it is difficult to accurately quantify these two key investments. Next, we review key economic considerations such as in-sourcing versus outsourcing, on-premise versus in the cloud, and open-source versus commercial software solutions. Obviously, selecting from among these options should be done with due diligence given the impact thereof on the TCO and ROI. The chapter concludes by giving some recommendations about how to improve the ROI by considering new sources of data, improving data quality, involving senior management, choosing the right organization format, and establishing cross-fertilization between business units.

ECONOMIC VALUE OF BIG DATA AND ANALYTICS

Total Cost of Ownership (TCO)

The total cost of ownership (TCO) of an analytical model refers to the cost of owning and operating the analytical model over its expected lifetime, from inception to retirement. It should consider both quantitative and qualitative costs and is a key input to make strategic decisions about how to optimally invest in analytics. The costs involved can be decomposed into acquisition costs, ownership and operation costs, and post-ownership costs, as illustrated with some examples in Table 7.1.

Table 7.1 Example Costs for Calculating Total Cost of Ownership (TCO)

Acquisition Costs Ownership and Operation Costs Post-Ownership Costs
  • Software costs, including initial purchase, upgrade, intellectual property, and licensing fees
  • Hardware costs, including initial purchase price and maintenance
  • Network and security costs
  • Data costs, including costs for purchasing external data
  • Model developer costs such as salaries and training
  • Model migration and change management costs
  • Model setup costs
  • Model execution costs
  • Model monitoring costs
  • Support costs (troubleshooting, helpdesk,…)
  • Insurance costs
  • Model staffing costs such as salaries and training
  • Model upgrade costs
  • Model downtime costs
  • De-installation and disposal costs
  • Replacement costs
  • Archiving cost

The goal of TCO analysis is to get a comprehensive view of all costs involved. From an economic perspective, this should also include the timing of the costs through proper discounting using, for example, the weighted average cost of capital (WACC) as the discount factor. Furthermore, it should help identify any potential hidden and/or sunk costs. In many analytical projects, the combined cost of hardware and software is subordinate to the human resources cost that comes with the development and usage of the models such as training, employment, and management costs (Lismont et al. 2017). The high share of personnel cost can be attributed to three phenomena: an increase in the number of data scientists; a higher use of open-source tools (see below); and cheaper data storage and sharing solutions.

TCO analysis allows cost problems to be pinpointed before they become material. For example, the change management costs to migrate from a legacy model to a new analytical model are often largely underestimated. TCO analysis is a key input for strategic decisions such as vendor selection, in-sourcing versus outsourcing, on-premise versus in the cloud solutions, overall budgeting, and capital calculation. Note that when making these investment decisions, it is also very important to include the benefits in the analysis, since TCO only considers the cost perspective.

Return on Investment (ROI)

Return on investment (ROI) is defined as the ratio of the net benefits or net profits over the investment of resources that generated this return. The latter essentially comprises the total cost of ownership (see above) and all follow-up expenses such as costs of marketing campaigns, fraud handling, bad debt collection, and others. ROI analysis is an essential input to any financial investment decision; it offers a common firm-wide language to compare multiple investment opportunities and decide which one(s) to go for.

For companies like Facebook, Amazon, Netflix, Uber, and Google, a positive ROI is obvious since they essentially thrive on data and analytics. Hence, they continuously invest in new analytical technologies since even a small incremental new insight can translate into competitive advantage and significant profits. The Netflix competition is a nice illustration of this whereby Netflix provided an anonymized dataset of user ratings for films and awarded $1 million for any team of data scientists that could beat its own recommender system with at least a 10% increase in performance.

For traditional firms in the financial services, manufacturing, healthcare and pharmaceutics sectors, among others, the ROI of big data and analytics may be less clear-cut and harder to determine. Although the cost component is usually not that difficult to approximate, the benefits are much harder to precisely quantify. One of the reasons is that the benefits may be spread over time (short term, medium term, long term) and across the various business units of the organization. Analytical models offer these benefits:

  • Increase sales (e.g., as a result of a response modeling or up/cross-selling campaign).
  • Reduce fraud losses (e.g., as a result of a fraud detection model).
  • Reduce credit defaults (e.g., as a result of a credit scoring model).
  • Identify new customer needs and opportunities (e.g., as a result of a customer segmentation model).
  • Automate or enhance human decision making (e.g., as a result of a recommender system).
  • Develop new data-driven services and business models (e.g., data poolers that gather data and sell the results of analyses).

When fully automating human decision making, the elimination of current and future employees allows the resultant benefits to be quantitatively assessed. However, when it comes to merely enhancing human performance, the benefits are less compelling and thus harder to quantify. In fact, many analytical models yield intangible benefits, which are hard to include, yet substantial, in an ROI analysis. Think about social networks. Analytically modeling word-of-mouth effects (e.g., in a churn or response setting) can have material economic impact but the precise value thereof is hard to quantify. The benefits may also be spread across multiple products, channels, and in time. Think about a response model for mortgages. The effect of successfully attracting a mortgage customer could create cross-selling effects toward other bank products (e.g., checking account, credit-card, insurance). Furthermore, since a mortgage is a long-term engagement, the partnership may be further deepened in time hereby contributing to the customer's CLV. Disentangling all these profit contributions is clearly a challenging task, complicating the calculation of the ROI of the original mortgage response model.

A vast majority of implementations of big data and analytics have reported significant returns. A study by Nucleus research1 in 2014 found that organizations obtained returns of $13.01 for every dollar invested which increased from just $10.66 in 2011. PredictiveanalyticsToday.com2 conducted a poll from February 2015 to March 2015 with 96 valid responses. The results are displayed in Figure 7.1. From the pie chart, it can be concluded that only a minority (10%) reported no ROI of bit data and analytics. Other studies have also reported strong positive returns, although the ranges typically vary.

Illustration of ROI of big data and analytics.

Figure 7.1 ROI of big data and analytics.

(PredictiveanalyticsToday.com 2015).

Critical voices have been heard as well, even questioning positive returns of investing in big data and analytics. The reasons often boil down to the lack of good-quality data, management support, and a company-wide data-driven decision culture, as we discuss in a later section.

Profit-Driven Business Analytics

ROI is defined in the previous section as the ratio of the net benefits or net profits over the investment of resources that generated this return. The investment of resources basically equals the total cost of ownership discussed before in this chapter. To increase ROI, indeed opportunities exist to lower the TCO, for example, by outsourcing and adopting cloud or open-source solutions as will be discussed in detail in the following sections. However, to our belief, the largest potential for boosting ROI hides in tuning the analytical models that are being developed for maximizing the net benefits or net profits. Providing approaches and guidance for doing so is exactly the main objective of this book. As explicitly stated in the introductory chapter, we aim to facilitate and support the adoption of a profit-driven perspective toward the use of analytics in business. To this end, a range of value-centric approaches have been discussed which allow to explicitly account for and maximize the profitability of the eventual analytical model, throughout the subsequent steps in the analytics process model as discussed in Chapter 1. Several illustrative examples have been provided in this book, and many more on the accompanying website,3 of how these profit-driven approaches can be applied in real-life cases, with an illustration of the difference in terms of the resulting profit that is achieved when compared to the standard analytical approaches currently in use.

KEY ECONOMIC CONSIDERATIONS

In-Sourcing versus Outsourcing

The growing interest and need for big data and analytics combined with the shortage of skilled talent and data scientists in Western Europe and the United States has triggered the question of whether to outsource analytical activities. This need is further amplified by competitive pressure on reduced time to market and lower costs. Companies need to choose between in-sourcing or building the analytical skillset internally, either at the corporate level or business line level, outsource all analytical activities, or go for an intermediate solution whereby only part of the analytical activities are outsourced. The dominant players in the outsourcing analytics market are India, China, and Eastern Europe, with some other countries (e.g., Philippines, Russia, South Africa) gaining ground as well.

Various analytical activities can be considered for outsourcing ranging from the heavy-lifting grunt work (e.g., data collection, cleaning, and preprocessing), set-up of analytical platforms (hardware and software), training and education, to the more complex analytical model development, visualization, evaluation, monitoring, and maintenance. Companies may choose to grow conservatively and start by outsourcing the analytical activities step by step, or immediately go for the full package of analytical services. It speaks for itself that the latter strategy is more inherently risky and should thus be more carefully and critically evaluated.

Despite the benefits of outsourcing analytics, it should be approached with a clear strategic vision and critical reflection with awareness of all risks involved. First of all, the difference between outsourcing analytics and traditional ICT services is that analytics concerns a company's front-end strategy, whereas many ICT services are part of a company's back-end operations. Another important risk is the exchange of confidential information. Intellectual property (IP) rights and data security issues should be clearly investigated, addressed, and agreed on. Moreover, all companies have access to the same analytical techniques, so they are only differentiated by the data they provide. Hence, an outsourcer should provide clear guidelines and guarantees about how intellectual property and data will be managed and protected (using, e.g., encryption techniques and firewalls), particularly if the outsourcer also collaborates with other companies in the same industry sector. Another important risk concerns the continuity of the partnership. Offshore outsourcing companies are often subject to mergers and acquisitions, not seldom with other outsourcing companies collaborating with the competition, hereby diluting any competitive advantage realized. Furthermore, many of these outsourcers face high employee turnover due to intensive work schedules, the boredom of performing low-level activities on a daily basis, and aggressive headhunters chasing these hard-to-find data science profiles. This attrition problem seriously inhibits a long-term thorough understanding of a customer's analytical business processes and needs. Another often-cited complexity concerns the cultural mismatch (e.g., time management, different languages, local versus global issues) between the buyer and outsourcer. Exit strategies should also be clearly agreed on. Many analytical outsourcing contracts have a maturity of three to four years. When these contracts expire, it should be clearly stipulated how the analytical models and knowledge can be transferred to the buyer thereof to ensure business continuity. Finally, the shortage of data scientists in the United States and Western Europe will also be a constraint, and might even be worse, in the countries providing outsourcing services. These countries typically have universities with good statistical education and training programs, but their graduates lack the necessary business skills, insights, and experience to make a strategic contribution with analytics.

Given these considerations, many firms are quite skeptical about outsourcing and prefer to keep all big data and analytics in house. Others adopt a partial outsourcing strategy, whereby baseline, operational analytical activities such as query and reporting, multidimensional data analysis, and OLAP are outsourced, whereas the advanced descriptive, predictive, and social network analytical skills are developed and managed internally.

On Premise versus the Cloud

Most firms started to develop their first analytical models using on-premise architectures, platforms, and solutions. However, given the significant amount of investment in installing, configuring, upgrading, and maintaining these environments, many companies have started looking at cloud-based solutions as a budget-friendly alternative to further boost the ROI. In what follows, we elaborate on the costs and other implications of deploying big data and analytics in the cloud.

An often-cited advantage of on-premise analytics is that you keep your data in-house, as such giving you full control over it. However, this is a double-edged sword since it also requires firms to continuously invest in high-end security solutions to thwart data breach attacks by hackers, which are becoming ever more sophisticated. It is precisely because of this security concern that many companies have started looking at the cloud. Another driver concerns the scalability and economies of scale offered by cloud providers, since they pledge to provide customers with state-of-the-art platforms and software solutions. The computation power needed can in fact be entirely tailored to the customer, whether it is a Fortune 500 firm or a small- or medium-sized enterprise (SME). More capacity (e.g., servers) can be added on the fly whenever needed. On-premise solutions need to carefully anticipate the computational resources needed and invest accordingly, whereby the risk of over-investment or under-investment significantly jeopardizes the ROI of analytical projects. In other words, upsizing or downsizing scalability on premise is a lot more tedious and cost consuming.

Another key advantage relates to the maintenance of the analytical environment. Average on-premise maintenance cycles typically range around 18 months. These can get quite costly and create business continuity problems because of backward compatibility issues, new features added, old features removed, and new integration efforts needed, among others. When using cloud-based solutions, all these issues are taken care of and maintenance or upgrade projects may even go unnoticed.

The low footprint access to data management and analytics capabilities will also positively impact the time to value and accessibility. As mentioned, there is no need to set up expensive infrastructure (e.g., hardware, operating systems, databases, analytical solutions), upload and clean data, or integrate data. Using the cloud, everything is readily accessible. It significantly lowers the entry barrier to experiment with analytics, try out new approaches and models, and combine various data sources in a transparent way. All of this contributes to the economic value of analytical modeling and also facilitates serendipitous discovery of interesting patterns.

Cloud-based solutions catalyze improved collaboration across business departments and geographical locations. Many on-premise systems are loosely coupled or not integrated at all, thereby seriously inhibiting any firm-wide sharing of experiences, insights, and findings. The resulting amount of duplication efforts negatively impacts the ROI at corporate level.

From the above discussion, it becomes clear that cloud-based solutions have a substantial impact in terms of TCO and ROI of your analytical projects. However, as with any new technology, it is advised to approach it with a thoughtful strategic vision and the necessary caution. That is why most firms have started adopting a mixed approach by gently migrating some of their analytical models to the cloud so as to get their toes wet and see what the potential but also caveats are of this technology. It can, however, be expected that given the many (cost) advantages offered, cloud-based big data and analytics will continue to grow.

Open-Source versus Commercial Software

The popularity of open-source analytical software such as R and Python has sparked the debate about the added value of commercial tools such as SAS, SPSS, and Matlab, among others. In fact, commercial and open-source software each has its merits, which should be thoroughly evaluated before any software investment decision is made. Note that they can, of course, also be combined in a mixed setup.

First of all, the key advantage of open-source software is that it is obviously available for free, which significantly lowers the entry barrier to its use. This may be particularly relevant to smaller firms that wish to kick off with analytics without making investments that are too big. However, this clearly poses a danger as well, since anyone can contribute to open source without any quality assurance or extensive prior testing. In heavily regulated environments such as credit risk (the Basel Accord), insurance (the Solvency Accord), and pharmaceutics (the FDA regulation), the analytical models are subject to external supervisory review because of their strategic impact on society—which is now bigger than ever before. Hence, in these settings, many firms prefer to rely on mature commercial solutions that have been thoroughly engineered and extensively tested, validated, and completely documented. Many of these solutions also include automatic reporting facilities to generate compliant reports in each of the settings mentioned. Open-source software solutions come without any kind of quality control or warranty, which increases the risk of using them in a regulated environment.

Another key advantage of commercial solutions is that the software offered is no longer centered on dedicated analytical workbenches—for example, data preprocessing and data mining—but on well-engineered business-focused solutions which automate the end-to-end activities. As an example, consider credit risk analytics which starts from initially framing the business problem, to data preprocessing, analytical model development, model monitoring, stress testing and regulatory capital calculation (Baesens et al. 2016). To automate this entire chain of activities using open source would require various scripts, likely originating from heterogeneous sources, to be matched and connected, resulting in a melting pot of software, whereby the overall functionality can become unstable and unclear.

Contrary to open-source software, commercial software vendors also offer extensive help facilities such as FAQs, technical support hot lines, newsletters, and professional training courses, to name some. Another key advantage of commercial software vendors is business continuity. More specifically, the availability of centralized R&D teams (as opposed to worldwide loosely connected open-source developers), which closely follow up on new analytical and regulatory developments provides a better guarantee that new software upgrades will provide the facilities required. In an open-source environment, you need to rely on the community to voluntarily contribute, which provides less of a guarantee.

A disadvantage of commercial software is that it usually comes in prepackaged, black-box routines, which, although extensively tested and documented, cannot be inspected by the more sophisticated data scientist. This is in contrast to open-source solutions, which provide full access to the source code of each of the scripts contributed.

Given the previous discussion, it is clear that both commercial and open-source software have strengths and weaknesses. Hence, it is likely that both will continue to coexist, and interfaces should be provided for both to collaborate, as is the case for SAS and R/Python, for example.

IMPROVING THE ROI OF BIG DATA AND ANALYTICS

New Sources of Data

The ROI of an analytical model is directly related to its predictive and/or statistical power, as extensively discussed in Chapters 4 to 7 of this book. The better an analytical model can predict or describe customer behavior, the better the effectiveness and the higher the profitability of the resulting actions. In addition to adopting profit-driven analytical approaches, one way to further boost ROI is by investing in new sources of data, which can help to further unravel complex customer behavior and improve key analytical insights. In what follows, we briefly explore various types of data sources that could be worthwhile pursuing in order to squeeze more economic value out of analytical models.

A first option concerns the exploration of network data by carefully studying relationships between customers. These relationships can be explicit or implicit. Examples of explicit networks are calls between customers, shared board members between firms, and social connections (e.g., family or friends). Explicit networks can be readily distilled from underlying data sources (e.g., call logs), and their key characteristics can then be summarized using featurization procedures, as discussed in Chapter 2. In our previous research (Verbeke et al. 2014; Van Vlasselaer et al. 2017), we found network data to be highly predictive for both customer churn prediction and fraud detection. Implicit networks or pseudo networks are a lot more challenging to define and featurize.

Martens and Provost (2016) built a network of customers where links were defined based on which customers transferred money to the same entities (e.g., retailers) using data from a major bank. When combined with non-network data, this innovative way of defining a network based on similarity instead of explicit social connections gave a better lift and generated more profit for almost any targeting budget. In another, award-winning study they built a geosimilarity network among users based on location-visitation data in a mobile environment (Provost et al. 2015). More specifically, two devices are considered similar, and thus connected, when they share at least one visited location. They are more similar if they have more shared locations and as these are visited by fewer people. This implicit network can then be leveraged to target advertisements to the same user on different devices or to users with similar tastes, or to improve online interactions by selecting users with similar tastes. Both of these examples clearly illustrate the potential of implicit networks as an important data source. A key challenge here is to creatively think about how to define these networks based on the goal of the analysis.

Data are often branded as the new oil. Hence, data-pooling firms capitalize on this by gathering various types of data, analyzing them in innovative and creative ways, and selling the results thereof. Popular examples are Equifax, Experian, Moody's, S&P, Nielsen, and Dun & Bradstreet, among many others. These firms consolidate publically available data, data scraped from websites or social media, survey data, and data contributed by other firms. By doing so, they can perform all kinds of aggregated analyses (e.g., geographical distribution of credit default rates in a country, average churn rates across industry sectors), build generic scores (e.g., the FICO in the United States), and sell these to interested parties. Because of the low entry barrier in terms of investment, externally purchased analytical models are sometimes adopted by smaller firms (e.g., SMEs) to take their first steps in analytics. Besides commercially available external data, open data can also be a valuable source of external information. Examples are industry and government data, weather data, news data, and search data (e.g., Google Trends). Both commercial and open external data can significantly boost the performance and thus economic return of an analytical model.

Macroeconomic data are another valuable source of information. Many analytical models are developed using a snapshot of data at a particular moment in time. This is obviously conditional on the external environment at that moment. Macroeconomic up- or downturns can have a significant impact on the performance and thus ROI of the model. The state of the macroeconomy can be summarized using measures such as gross domestic product (GDP), inflation, and unemployment. Incorporating these effects will allow us to further improve the performance of analytical models and make them more robust against external influences.

Textual data are also an interesting type of data to consider. Examples are product reviews, Facebook posts, Twitter tweets, book recommendations, complaints, and legislation. Textual data are difficult to process analytically since they are unstructured and cannot be directly represented into a matrix format. Moreover, these data depend on the linguistic structure (e.g., type of language, relationship between words, negations, etc.) and are typically quite noisy data due to grammatical or spelling errors, synonyms, and homographs. However, they can contain very relevant information for your analytical modeling exercise. Just as with network data, it will be important to find ways to featurize text documents and combine it with your other structured data. A popular way of doing this is by using a document term matrix indicating what terms (similar to variables) appear and how frequently in which documents (similar to observations). It is clear that this matrix will be large and sparse. Dimension reduction will thus be very important, as the following activities illustrate:

  • Represent every term in lower case (e.g., PRODUCT, Product, product become product).
  • Remove terms that are uninformative, such as stop words and articles (e.g., the product, a product, this product become product).
  • Use synonym lists to map synonym terms to one single term (product, item, article become product).
  • Stem all terms to their root (products, product become product).
  • Remove terms that only occur in a single document.

Even after the above activities have been performed, the number of dimensions may still be too big for practical analysis. Singular Value Decomposition (SVD) offers a more advanced way to do dimension reduction (Meyer 2000). SVD works similar to principal component analysis (PCA) and summarizes the document term matrix into a set of singular vectors (also called latent concepts), which are linear combinations of the original terms. These reduced dimensions can then be added as new features to your existing, structured dataset.

Besides textual data, other types of unstructured data such as audio, images, videos, fingerprint, GPS, and RFID data can be considered as well. To successfully leverage these types of data in your analytical models, it is of key importance to carefully think about creative ways of featurizing them. When doing so, it is recommended that any accompanying metadata are taken into account; for example, not only the image itself might be relevant, but also who took it, where, and at what time. This information could be very useful for fraud detection.

Data Quality

Besides volume and variety, the veracity of the data is also a critical success factor to generate competitive advantage and economic value from data. Quality of data is key to the success of any analytical exercise since it has a direct and measurable impact on the quality of the analytical model and hence its economic value. The importance of data quality is nicely captured by the well-known GIGO or garbage in, garbage out principle: Bad data yield bad analytical models (see Chapter 2).

Data quality is often defined as fitness for use, which illustrates the relative nature of the concept (Wang et al. 1996). Data with quality for one use may not be appropriate for another use. For example, the extent to which data are required to be accurate and complete for fraud detection may not be required for response modeling. More generally, data that are of acceptable quality in one application may be perceived to be of poor quality in another application, even by the same users. This is mainly because data quality is a multidimensional concept in which each dimension represents a single construct and also comprises both objective and subjective elements. Therefore, it is useful to define data quality in terms of its dimensions as illustrated in Table 7.2 (Wang et al. 1996).

Table 7.2 Data Quality Dimensions (Wang et al. 1996)

Category Dimension Definition: the extent to which…
Intrinsic Accuracy Data are regarded as correct.
Believability Data are accepted or regarded as true, real, and credible.
Objectivity Data are unbiased and impartial.
Reputation Data are trusted or highly regarded in terms of their source and content.
Contextual Value-added Data are beneficial and provide advantages for their use.
Completeness Data values are present.
Relevancy Data are applicable and useful for the task at hand.
Timeliness Data are available at the right moment for the task at hand.
Appropriate amount of data The quantity or volume of available data is appropriate.
Representational Interpretability Data are in appropriate language and unit and the data definitions are clear.
Representational consistency Data are represented in a consistent way.
Concise representation Data are represented in a compact way.
Ease of understanding Data are clear without ambiguity and easily comprehended.
Accessibility Accessibility Data are available or can be easily and quickly retrieved.
Security Access to data can be restricted and hence kept secure.

Most organizations are becoming aware of the importance of data quality and are looking at ways to improve it. However, this often turns out to be harder than expected, more costly than budgeted, and definitely not a one-off project but a continuous challenge. The causes of data quality issues are often deeply rooted within the core organizational processes and culture, as well as in the IT infrastructure and architecture.

Whereas often only data scientists are directly confronted with the consequences of poor data quality, resolving these issues and, importantly, their causes typically requires cooperation and commitment from almost every level and department within the organization. It most definitely requires support and sponsorship from senior executive management in order to increase awareness and set up data governance programs to tackle data quality in a sustainable and effective manner, as well as to create incentives for everyone in the organization to take their responsibilities.

Data preprocessing activities such as handling missing values, duplicate data, or outliers (see Chapter 2) are corrective measures for dealing with data quality issues. These are, however, short-term remedies with relatively low cost and moderate return. Data scientists will have to keep applying these fixes until the root causes of the issues are resolved in a structural way. In order to do so, data quality programs need to be developed that aim at detecting the key problems. This will include a thorough investigation of where the problems originate from, in order to find and resolve them at their very origin by introducing preventive actions as a complement to corrective measures. This obviously requires more substantial investments and a strong belief in the added value and return thereof. Ideally, a data governance program should be put in place assigning clear roles and responsibilities with respect to data quality. Two roles that are essential in rolling out such a program are data stewards and data owners.

Data stewards are the data quality experts who are in charge of assessing data quality by performing extensive and regular data quality checks. They are responsible for initiating remedial actions whenever needed. A first type of action to be considered is the application of short-term corrective measures as already discussed. Data stewards are, however, not in charge of correcting the data themselves: This is the task of the data owner. Every data field in every database of the organization should be owned by a data owner, who is able to enter or update its value. In other words, the data owner has knowledge about the meaning of each data field and can look up its current correct value (e.g., by contacting a customer, by looking into a file). Data stewards can request data owners to check or complete the value of a field, as such correcting the issue. A second type of action to be initiated by a data steward concerns a deeper investigation into the root causes of the data quality issues that were detected. Understanding these causes may allow the designing of preventive measures that aim at eradicating data quality issues. Preventive measures typically start by carefully inspecting the operational information systems from which the data originate. Based on this inspection, various actions can be undertaken, such as making certain data fields mandatory (e.g., social security number), providing drop-down lists of possible values (e.g., dates), rationalizing or simplifying the interface, and defining validation rules (e.g., age should be between 18 and 100). Implementing such preventive measures will require close involvement of the IT department in charge of the application. Although designing and implementing preventive measures will require more efforts in terms of investment, commitment, and involvement than applying corrective measures would, they are the only type of action that will improve data quality in a sustainable manner, and as such secure the long-term return on investment in analytics and big data!

Management Support

To fully capitalize on big data and analytics, it should conquer a seat in the board of directors. This can be achieved in various ways. Either an existing chief-level executive (e.g., the CIO) takes the responsibility or a new CXO function is defined such as chief analytics officer (CAO) or chief data officer (CDO). To guarantee maximum independence and organizational impact, it is important that the latter directly reports to the CEO instead of another C-level executive. A top-down, data-driven culture where the CEO and his subordinates make decisions inspired by data combined with business acumen will catalyze a trickledown effect of data-based decision making throughout the entire organization.

The board of directors and senior management should be actively involved in the analytical model building, implementation, and monitoring processes. Of course, one cannot expect them to understand all underlying technical details, but they should be responsible for sound governance of the analytical models. Without appropriate management support, analytical models are doomed to fail. Hence, the board and senior management should have a general understanding of the analytical models. They should demonstrate active involvement on an ongoing basis, assign clear responsibilities, and put into place organizational procedures and policies that will allow the proper and sound development, implementation, and monitoring of the analytical models. The outcome of the model monitoring exercise must be communicated to senior management and, if needed, accompanied by appropriate (strategic) response. Obviously, this requires a careful rethinking of how to optimally embed big data and analytics in the organization.

Organizational Aspects

In 2010, Davenport, Harris, and Morison wrote:

As mentioned before, investments in big data and analytics only bear fruit when a company-wide data culture is in place to actually do something with all these new data-driven insights. If you would put a team of data scientists in a room and feed them with data and analytical software, then the chances are pretty small that their analytical models and insights will add economic value to the firm. A first hurdle concerns the data, which are not always readily available. A well-articulated data governance program is a good starting point (see above). Once the data are available, any data scientist will be able to derive a statistically meaningful analytical model from it. However, this does not necessarily imply that the model adds economic value, since it may not be in sync with the business objectives (see Chapter 3). And suppose it was in sync, how do we sell it to our business people such that they understand it, trust it, and actually start using it in their decision making? This implies delivering insights in a way that is easy to understand and use by representing them in, for example, simple language or intuitive graphics.

Given the corporate-wide impact of big data and analytics, it is important that both gradually permeate into a company's culture and decision-making processes, as such becoming part of a company's DNA. This requires a significant investment in terms of awareness and trust that should be initiated top-down from the executive level as discussed above. In other words, companies need to thoroughly think about how they embed big data and analytics in their organization in order to successfully compete using both technologies.

Lismont et al. (2017) conducted a worldwide, cross-industry survey of senior-level executives to investigate modern trends in the organization of analytics. They observed various formats used by companies to organize their analytics. Two extreme approaches are centralized, where a central department of data scientists handles all analytics requests, and decentralized where all data scientists are directly assigned to the respective business units. Most companies opt for a mixed approach combining a centrally coordinated center of analytical excellence with analytics organized at business unit level. The center of excellence provides firm-wide analytical services and implements universal guidelines in terms of model development, model design, model implementation, model documentation, model monitoring, and privacy. Decentralized teams of one to five data scientists are then added to each of the business units for maximum impact. A suggested practice is to rotationally deploy the data scientists across the business units and center so as to foster cross-fertilization opportunities between the different teams and applications.

Cross-Fertilization

Big data and analytics have matured differently across the various business units of an organization. Triggered by the introduction of regulatory guidelines (e.g., Basel II/III, Solvency II) as well as driven by significant returns/profits, many firms (particularly financial institutions) have invested in big data and analytics for risk management for quite some time now. Years of analytical experience and perfecting contributed to very sophisticated models for insurance risk, credit risk, operational risk, market risk, and fraud risk. The most advanced analytical techniques such as survival analysis, random forests, neural networks, and (social) network learning have been used in these applications. Furthermore, these analytical models have been complemented with powerful model monitoring frameworks and stress testing procedures to fully leverage their potential.

Marketing analytics is less mature, with many firms starting to deploy their first models for churn prediction, response modeling, or customer segmentation. These are typically based on simpler analytical techniques such as logistic regression, decision trees or K-means clustering. Other application areas such as HR and supply chain analytics start to gain traction although not many successful case studies have been reported yet.

The disparity in maturity creates a tremendous potential for cross-fertilization of model development and monitoring experiences. After all, classifying whether a customer is creditworthy in risk management is analytically the same as classifying a customer as a responder, or not, in marketing analytics, or classifying an employee as a churner, or not, in HR analytics. The data preprocessing issues (e.g., missing values, outliers, categorization), classification techniques (e.g., logistic regression, decision trees, random forests) and evaluation measures (e.g., AUC, lift curves) are all similar. Only the interpretation and usage of the models will be different. Additionally, there is some tuning/adaptation in the setup and in gathering the “right” data—what characteristics are predictive for employee churn? How do we define employee churn? How much time on beforehand do we have to/can we predict? The cross-fertilization also applies to model monitoring, since most of the challenges and approaches are essentially the same. Finally, gauging the effect of macroeconomic scenarios using stress testing (which is a common practice in credit risk analytics) could be another example of sharing useful experiences across applications.

To summarize, less mature analytical applications (e.g., marketing, HR and supply chain analytics) can substantially benefit from many of the lessons learned by more mature applications (e.g., risk management) as such avoiding many rookie mistakes and expensive beginner traps. Hence, the importance of rotational deployment (as discussed in the previous section) to generate maximum economic value and return is clear.

CONCLUSION

In this chapter, we zoomed into the economic impact of analytical models. We first provided a perspective on the economic value of big data and analytics by discussing total cost of ownership (TCO), return on investment (ROI), and profit-driven business analytics. We elaborated on some key economic considerations, such as in-sourcing versus outsourcing, on-premise versus in the cloud configurations, and open-source versus commercial software. We also gave some recommendations about how to improve the ROI of big data and analytics by exploring new sources of data, safeguarding data quality and management support, careful embedding of big data and analytics in the organization, and fostering cross-fertilization opportunities.

REVIEW QUESTIONS

Multiple Choice Questions

  1. Question 1
  2. Which of the following costs should be included in a total cost of ownership (TCO) analysis?
    1. acquisition costs
    2. ownership and operation costs
    3. post-ownership costs
    4. all of the above

  3. Question 2
  4. Which of the following statements is not correct?
    1. ROI analysis offers a common firm-wide language to compare multiple investment opportunities and decide which one(s) to go for.
    2. For companies like Facebook, Amazon, Netflix, and Google, a positive ROI is obvious since they essentially thrive on data and analytics.
    3. Although the benefit component is usually not that difficult to approximate, the costs are much harder to precisely quantify.
    4. Negative ROI of big data and analytics often boils down to the lack of good quality data, management support, and a company-wide data driven decision culture.

  5. Question 3
  6. Which of the following is not a risk when outsourcing big data and analytics?
    1. need for all analytical activities to be outsourced
    2. exchange of confidential information
    3. continuity of the partnership
    4. dilution of competitive advantage due to, for example, mergers and acquisitions

  7. Question 4
  8. Which of the following is not an advantage of open-source software for analytics?
    1. It is available for free.
    2. A worldwide network of developers can work on it.
    3. It has been thoroughly engineered and extensively tested, validated, and completely documented.
    4. It can be used in combination with commercial software.

  9. Question 5
  10. Which of the following statements is correct?
    1. When using on-premise solutions, maintenance or upgrade projects may even go by unnoticed.
    2. An important advantage of cloud-based solutions concerns the scalability and economies of scale offered. More capacity (e.g., servers) can be added on the fly whenever needed.
    3. The big footprint access to data management and analytics capabilities is a serious drawback of cloud-based solutions.
    4. On-premise solutions catalyze improved collaboration across business departments and geographical locations.

  11. Question 6
  12. Which of the following are interesting data sources to consider to boost the performance of your analytical models?
    1. network data
    2. external data
    3. unstructured data such as text data and multimedia data
    4. all of the above

  13. Question 7
  14. Which of the following statements is correct?
    1. Data quality is a multidimensional concept in which each dimension represents a single construct and also comprises both objective and subjective elements.
    2. Data preprocessing activities such as handling missing values, duplicate data, or outliers are preventive measures for dealing with data quality issues.
    3. Data owners are the data quality experts who are in charge of assessing data quality by performing extensive and regular data quality checks.
    4. Data stewards can request data scientists to check or complete the value of a field, as such correcting the issue.

  15. Question 8
  16. To guarantee maximum independence and organizational impact of analytics, it is important that
    1. The chief data officer (CDO) or chief analytics officer (CAO) reports to the CIO or CFO.
    2. The CIO takes care of all analytical responsibilities.
    3. A chief data officer or chief analytics officer is added to the executive committee, who directly reports to the CEO.
    4. Analytics is supervised only locally in the business units.

  17. Question 9
  18. What is the correct ranking of the following analytics applications in terms of maturity?
    1. marketing analytics (most mature), risk analytics (medium mature), HR analytics (least mature)
    2. risk analytics (most mature), marketing analytics (medium mature), HR analytics (least mature)
    3. risk analytics (most mature), HR analytics (medium mature), marketing analytics (least mature)
    4. HR analytics (most mature), marketing analytics (medium mature), risk analytics (least mature)

  19. Question 10
  20. Which of the following activities could be considered to boost the ROI of big data and analytics?
    1. investing in new sources of data
    2. improving data quality
    3. involving senior management
    4. choosing the right organization format
    5. establishing cross-fertilization between business units
    6. all of the above

Open Questions

  1. Question 1
  2. Conduct a SWOT analysis for the following investment decisions:
    1. In- versus outsourcing analytical activities
    2. On-premise versus in the cloud analytical platforms
    3. Open-source versus commercial analytical software

  3. Question 2
  4. Give examples of analytical applications where the following external data can be useful:
    1. macroeconomic data (e.g., GDP, inflation, unemployment)
    2. weather data
    3. news data
    4. Google Trends search data

  5. Question 3
  6. Discuss the importance of data quality. What are the key dimensions? How can data quality issues be dealt with in the short term and the long term?

  7. Question 4
  8. How can management support and organizational format contribute to the success of big data and analytics? Discuss ways to generate cross-fertilization effects.

  9. Question 5
  10. Read the following article: B. Baesens, S. De Winne, and L. Sels, “Is Your Company Ready for HR Analytics?” MIT Sloan Management Review (Winter 2017). See mitsmr.com/2greOYb.
    1. Summarize important cross-fertilization opportunities between customer and HR analytics.
    2. Which techniques from customer churn prediction could be useful for employee churn prediction?
    3. What about customer segmentation?
    4. Which model requirements are different in HR analytics compared to customer analytics?
    5. Illustrate with examples.

NOTES

REFERENCES

  1. Ariker M., A. Diaz, C. Moorman, and M. Westover. 2015. “Quantifying the Impact of Marketing Analytics.” Harvard Business Review (November).
  2. Baesens B., D. Roesch, and H. Scheule. 2016. Credit Risk Analytics—Measurement Techniques, Applications and Examples in SAS. Hoboken, NJ: John Wiley & Sons.
  3. Davenport T. H., J. G. Harris, and R. Morison. 2010. Analytics at Work: Smarter Decisions, Better Results. Boston: Harvard Business Review Press.
  4. Lismont, J., J. Vanthienen, B. Baesens, and W. Lemahieu. 2017. “Defining Analytics Maturity Indicators: A Survey Approach.” submitted for publication.
  5. Martens D., F. Provost. 2016. “Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics.” MIS Quarterly 40 (4): 869–888.
  6. Meyer, C. D. 2000. Matrix Analysis and Applied Linear Algebra. Philadelphia: SIAM.
  7. Provost F., D. Martens, and A. Murray. 2015. “Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial Design.” Information Systems Research 26 (2): 243–265.
  8. Van Vlasselaer, V., T. Eliassi-Rad, L. Akoglu, M. Snoeck, and B. Baesens. 2017. “GOTCHA! Network-based Fraud Detection for Security Fraud.” Management Science, forthcoming.
  9. Verbeke, W., D. Martens, and B. Baesens. 2014. “Social Network Analysis for Customer Churn Prediction.” Applied Soft Computing 14: 341–446.
  10. Wang R. Y., and D. M. Strong. 1996. “Beyond Accuracy: What Data Quality Means to Data Consumers.” Journal of Management Information Systems 12 (4): 5–34.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset