images The Data Deluge

images According to IDC, the amount of data in the world is doubling every two years.1

Anywhere from 5 to 20 percent of the nation's population contract the flu each year, a cause of roughly 36,000 deaths annually.2 Those numbers were sufficient to earn the common flu a place among the top 10 killers in the United States in 2010—beating suicide, homicide, and other lethal forces in the process.3 As with any epidemic, early detection and warning are critical to contain contagion. Enter the Centers for Disease Control and Prevention (CDC), part of the U.S. Department of Health and Human Services, and responsible, in part, for notifying the public of potential epidemics. They do so by monitoring data collected and compiled from thousands of health care providers, laboratories, and other sources. But, in 2008, there arrived another alternative to the CDC—one that relied not on medical records from esteemed sources but the seemingly banal keyboard searches from millions of everyday consumers. That year, Google debuted its Google Flu Trends service, designed to mine and analyze the millions of search terms entered through its engine in an attempt to predict the threat of an epidemic. Even more boldly, the company suggested that it may have the ability to detect regional outbreaks a week to 10 days before the authority on epidemics, the CDC. As evidence to the claim, the company pointed to February of that year, when the CDC reported a spike in flu cases in the mid-Atlantic states, yet Google's search analytics revealed an increase in flu-related search terms fully two weeks before the CDC's release.

In 2010, Southeastern Louisiana University released data confirming that popular microblogging site Twitter was also capable of beating the CDC. Aron Culotta, assistant professor of computer science, and two student assistants analyzed more than 500 million Twitter messages over an eight-month period in 2009. By using a small number of keywords to track rates of flu-related messages on the site, the team was able to forecast future flu rates with a 95 percent correlation to the national health statistics compiled by the CDC. According to Culotta, the predictive nature of Twitter not only beat the CDC in speed but in cost as well:

A micro-blogging service such as Twitter is a promising new data source for Internet-based surveillance because of the volume of messages, their frequency and public availability. This approach is much cheaper and faster than having thousands of hospitals and health care providers fill out forms each week.... The Centers for Disease Control produces weekly estimates, but those reports typically lag a week or two behind. This approach produces estimates daily.4

The potential opportunities associated with how existing data could be mined and exploited were sufficient to capture the attention of the United Nations, which established its Global Pulse initiative in 2009. The program is designed “to explore opportunities for using real-time data to gain a more accurate understanding of population wellbeing, especially related to the impacts of global crises.”5 Apparently, the data extracted from social media updates can go much further than predicting the latest disease epidemic—as incredible as that outcome may be on its own. As it turns out, social media can help prognosticate economic epidemics as well. As part of a joint engagement between the United Nations and SAS, the organizations monitored two years worth of social media data, comprising half a million blogs, forums, and news sites, from the United States and Ireland. The organizations sought to correlate the mood of social chatter with official unemployment figures to determine if the former could prophesy the latter. In a fascinating conclusion, the analysis concluded that increased chatter about cutting back on groceries, downgrading one's automobile, and increasing use of public transportation carried tangible value in predicting an unemployment spike. Even by assessing the subtle changes in the tone of conversations, the organizations were able to predict the amount of time preceding an unemployment surge. For example, in the United States, a rise in “hostile” or “depressed” mood occurred four months before the unemployment increase. In Ireland, increases in “anxious” unemployment chatter preceded an unemployment spike by five months. Increased “confused” chatter came three months before the unemployment gain, whereas “confident” chatter significantly decreased two months out.6

If harnessing existing data is sufficient for predicting major health and well-being trends, what does that suggest for an enterprise seeking the same simply to make better decisions? The digital trail composed of millions of keystrokes, social networking and location updates, channel changes, user-generated and surveillance photographs, and inventory movements creates a staggering amount of data for enterprises potentially to exploit each year. In fact, the precipitous growth in data has become both opportunity and challenge for businesses in recent years, with more and more data becoming digitalized—from a paltry 0.8 percent of data being digitalized in 1986 to a staggering 94 percent by 2007. Today, 99.9 percent of all new information is digital, adding to the data treasure trove that can be mined and monetized by resourceful enterprises.7 In fact, there is growing evidence to suggest that companies can work smarter by harnessing data to make informed and actionable decisions. In 2011, a study by MIT and the University of Pennsylvania determined that “big data”-driven decision making can be associated with a 5 percent to 6 percent rise in productivity, after studying the success of 179 large publicly traded firms. In addition, the benefits went beyond mere productivity to include other financial performance indicators, such as asset utilization, return on equity, and market value.8 The data is validated by an Economist Intelligence Unit study commissioned by Capgemini in 2012, which surveyed more than 600 executives from across the globe on the topic. They found nine in 10 respondents agreeing that data is now an essential factor of production, gaining parity with land, labor, and capital. On average, they agree that big data has improved their organizations' performance in the past three years by 26 percent, and they are bullish that it will do so by an average of 41 percent in the next three years.9

Yet, for all its hype and potential, “big data” has failed to produce big results in the majority of enterprises. More than half the respondents in the Capgemini study say that big data management is not viewed strategically at senior levels of the organization.10 In addition, when Alcatel-Lucent took the pulse of a solution that would analyze the digital footprint of an enterprise and its customers to assist in better decision making, the reaction ran the gamut of optimism to apathy among the 200 respondents, representing a variety of functions in large and small U.S. firms:

A big data customer-facing solution will save man hours and will make the workplace more efficient. This is a brilliant idea to make the workplace more efficient and profitable. I think it will help for sure for customers to get manager attention with their needs and problems. Time saving and improving productivity is what retailers need. I like it. [Retail employee]

I don't think this would help my business. I can see how this would be a big help for large business. For a large organization, this sounds like a great tool to manage and make information and data more useful. [Small business owner]

Helping to explain the volatility of responses is the dilemma of big data itself. That is, although there are real opportunities facing companies capable of mining their digital data reserves, the challenges are simply too overwhelming for many to succeed. According to McKinsey, 15 out of 17 sectors in the United States have more data stored per company than the U.S. Library of Congress. They project a 40 percent increase in global data generated each year compared with only a 5 percent growth in global IT spending.11 And, although nearly all of the new data produced is digital, thereby making it more available for rapid decision-making capabilities, the reality is that most of it is unstructured—with documents, images, videos, and e-mail comprising the majority of existing data in most organizations and contributing most of the growth.12 According to the Capgemini study, 40 percent of executives complain they have too much unstructured data.13

Despite the challenges, enterprising companies are forging ahead, spurred by the promise of superhuman decision-making capabilities. Once a concept reserved for the wild imaginations of sci-fi authors, the notion of computers outsmarting human beings has made real headlines, most notably when Watson, an artificial intelligence system developed by IBM on 90 distributed computers, defeated returning champions Ken Jennings and Brad Rutter on the popular game show Jeopardy! It's not altogether impressive that a computer can sift through millions of algorithms to render answers to questions for well-indexed data sets. However, what made Watson so unique and revolutionary was its ability to do the same for unstructured data—the very bane of organizations attempting to make their growing loads of dirty data actionable. As such, Watson's victory was sufficient to capture the attention of couch potatoes and business decision makers alike. Watson's fame earned it more than a place in game show record books; it also secured it employment at Citigroup as the company's newest sales “recruit.” By using the scores of data available in the company's coffers, Watson will make intelligent recommendations as to what products or services (including loans and credit cards) should be offered to customers.14 As it so happens, product recommendations are a boon to other businesses that rely on such data manipulation to make informed suggestions to customers. As an example, at one point, Amazon reported that 30 percent of its sales resulted from its recommendation engine.15 In addition, for organizations capable of converting structured and unstructured data into meaningful information, the results can yield far more than revenue growth. Partners HealthCare, the largest healthcare provider in Massachusetts, is reusing medical data originally collected for clinical purposes, encompassing structured and unstructured formats, to accelerate medical research dramatically. As former CIO John Glaser explained, “We can cut the cost of research by a factor of five, and the time required by a factor of 10. This is a big deal. And even if those [improvements] are halved, this is still a really big deal.”16 Perhaps these early pioneers give hope to other companies aspiring to the same superhuman results, leading at least one analyst to project that the big data technology and services market will grow to roughly $17 billion by 2015, seven times the growth rate of the overall IT category.17

But although the big data movement is not without its success stories, these examples are currently more the exception than the rule. For struggling organizations wondering what to make of this latest craze, history can be an interesting precursor to future events. McKinsey charts four waves of IT adoption, each with varying degrees of impact to productivity growth, in the United States. Within each tranche, the firm estimates the productivity gains associated with IT improvements versus managerial innovation. In the “mainframe” era (from 1959 to 1973), although annual U.S. productivity growth overall was high at 2.82 percent, IT's contribution to output was rather modest, along with its share of overall capital expenditure. During the era of “minicomputers and PCs” (from 1973 to 1995), the U.S.'s overall productivity decreased to 1.49 percent, but IT's contribution grew significantly to more than 40 percent of the measured output. The era of “Internet and Web 1.0” (from 1995 to 2000) was characterized by deepening IT spend and significant managerial innovations that leveraged previous IT investments. During this period, overall U.S. productivity burgeoned to 2.7 percent, with IT's contribution ballooning to nearly 60 percent of the reported output. The fourth and final era, what McKinsey terms that of “mobile devices and Web 2.0,” shows a slight decrease in overall productivity (at 2.5%), with the contribution from managerial innovation again taking the lion's share of credit (at more than 60%). McKinsey's assessment reveals a time lag between increased IT expenditures and the associated return on these investments in the form of managerial innovation. The result is higher levels of overall productivity for firms able to leverage their technology infrastructure. The analyst uses this rationale to help explain, in part, the lack of significant empirical evidence between data intensity or capital spend in data investments and productivity in specific sectors. McKinsey suggests that the evidentiary void is not due to a lack of causality but is merely the result of the same time delay seen in the previous waves of IT adoption.18 However, what companies can learn from this history lesson is that, to make big data pay off in a big way, they will require both IT expertise and managerial innovation. Although intuitive at face value, there is probably no greater example of IT investment requiring managerial buy-in and collaboration with IT than that mandated by the big data phenomenon.

Unfortunately, for the many organizations where functional silos reign supreme, fostering collaboration between business leaders and the IT unit is complicated at best. Marketing, Finance, Human Resources, Operations, Sales, and IT may be aligned to overarching company strategy, but each department typically owns specific metrics for getting there. In the case of IT, these individuals have traditionally been the stewards of managing the technology infrastructure of the firm, including the data centers and computing horsepower therein. After all, IT professionals have the technical expertise required to solve appropriately for how a company's data assets are processed, stored, and retrieved. But the staggering data load on its own is sufficient to stretch IT's technical skills to new extremes. According to IDC, users created and replicated 1.8 trillion gigabytes (GB) of data in 2011, equivalent to filling more than 57 billion 32 GB Apple iPads—sufficient to build a structure equal in distance to the Great Wall of China and twice its height. The study asserts:

Over the next decade (by 2020), IT departments worldwide will experience [growth of] 10 times the number of servers (virtual and physical), 50 times the amount of information to be managed, [and] 75 times the number of files or containers that encapsulate the information in the digital universe, which is growing even faster than the information itself as more and more embedded systems, such as sensors in clothing, in bridges, or medical devices [proliferate].19

Yet, for the vast increase in technology infrastructure required, the growth in IT personnel to manage this data tsunami is nowhere near keeping pace. IDC expects to see growth of only 1.5 times the number of IT professionals over the same period of time.20

But, beyond the technical gymnastics required to move and store bits and bytes securely and efficiently at a torrid pace, the promise of big data requires understanding what data is useful and how it can be interpreted to produce action. As capable as IT professionals may be in addressing the nontrivial technical challenges, they lack the know-how of separating the wheat from the chaff in data interpretation and analysis—such proficiency is the dominion of those business leaders residing in their functional silos. Not only does this new paradigm require closer collaboration between IT and business leaders than perhaps ever before, the burden itself may prove too great for passionate teamwork to address. McKinsey foretells of a creeping labor shortage that may leave many organizations paralyzed by an inability to turn data into insights. Unfortunately, this gap is neither the exclusive problem of IT nor the functional leaders they support. By 2018, McKinsey warns that the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills—in addition to the 1.5 million managers and analysts with the expertise to use such data to make effective decisions.21

Even if blessed with the right number of people possessing the right skills who work across functional silos effectively, companies also risk falling victim to making bad decisions—using big data as their crutch—all the same. Whether manifesting itself in statistical scourges (such as multicollinearity, whereby highly correlated variables may yield invalid results as to the predictive value of said inputs) or simple human bias in interpreting the data, the field is littered with potential landmines. Using big data as their sword, firms may enter battle with erroneous conclusions based on a flawed assumption that patterns in a data set, such as correlation, on their own suggest causation. Consider some of the more ridiculous examples demonstrating that correlation does not imply causation: Shark attacks and the sale of ice cream have been positively correlated (not because sharks have a penchant for ice cream but, rather, because the two variables exhibit a common response to the warmer season); the number of cavities in children and the size of their vocabularies have been positively correlated (with neither having anything to do with the other but with both being associated with the maturing age of the child in question); or the confounding mystery of rising skirt hemlines with corresponding increases in the stock market. In some cases, these correlations can appear very strong, further compounding the risk of faulty interpretation. For example, Google Correlate is a tool that helps identify best-fit correlations between two sets of terms in online search patterns. Since 2004, there has been a 0.9 correlation (90% fit) between the terms “weight gain” and “apartments for rent” in the United States, although it is a mystery how one variable could cause the other.22

Although big data is not without big problems, there are some early lessons that may prove useful to companies hoping to turn big data into bigger results:

  • Create a culture of data-driven actions—In the 2012 Alcatel-Lucent study, nearly two-thirds of the most successful companies in the study (as measured by self-reported feedback on a variety of financial metrics, including sales, profitability, growth, and employee retention) are also those that respond well to changes in the market environment, according to their employees (compared with only 39% of failing companies whose employees stated the same). Paradoxically, these successful companies are also much more likely to involve multiple people in decision making (50% of successful companies vs. 30% of their failing counterparts). Typically, the more heads involved in decision making, the slower is the speed to market for the firm—the classic result of bureaucratically laden organizations. However, these successful firms are somehow managing to respond well to changes in their environment while enlisting multiple inputs. They also manage to do it without working much harder than their counterparts, as measured by respondents' answers to the average hours worked per week at their companies—employees in top-performing enterprises report working an average of 44 hours per week, compared with 42.5 hours for employees in failing firms.

    A culture focused on data insights and powered by automated and guided decision making may be, in part, to credit. In the Capgemini study, two-thirds of executives agree that there is not enough of a “big data culture” in their organizations. According to nearly 60 percent of them, the biggest impediment to cultural transformation is the resolute bastions of organizational silos that are still the mainstay of so many cultures. Culture is a concern for every employee of a company, and, to pull this objective off, senior business leaders and IT decision makers must create deeper and more creative linkages between their teams than ever before. It will require IT to become more business-savvy and business leaders to up their technical knowledge. For those organizations able to crack the culture code, the sustainable competitive advantage can be lasting and meaningful, precisely because this is not a problem that will be easily solved by most enterprises.

    images

  • Prepare for a recruitment war—Second only to the cultural challenge is the clear dearth of professionals capable of analyzing and activating corporate data—a problem that will only be exacerbated in coming years as data growth continues at exploding rates. According to the 2012 Alcatel-Lucent study, a minority of upcoming college graduates have taken more than a few statistics courses, reflecting a deficit in qualified talent able to process volumes of data with business insight.

    Resourceful companies are facing the challenge head-on, engaging their recruitment forces for an all-out war and gunning for talent at a very young age. For example, SAS created Curriculum Pathways, a web-based tool for teaching data analytics to high school students, a target woefully underrepresented in the United States as measured by those interested in a career in science, technology, engineering, or math (STEM). The course has been running for 12 years across 18,000 American schools. The company has also developed analytics courses with several universities to seed the next generation of data analysts.23 In the big data game, there's no such thing as starting too early in building a talent pipeline capable of being converted to human resources.

    images

  • Determine your risk quotient—There's a wide chasm between using data to automate, versus support, decision making. In the Capgemini study, on average, big data is used to support decision making 58 percent of the time, whereas it is applied to automate decisions 29 percent of the time.24 A company's approach to its own data philosophy will depend on its risk tolerance in making a poor decision versus the speed it loses in not making decisions automatically.

    For Citi, the risk assessment varies to properly reflect the costs of making a poor decision. A “false positive” by an automation error where the system inaccurately rejects a loan based on various set parameters can be corrected with a simple phone call to the consumer in question. However, with corporate clients, the stakes and risks increase considerably, thereby decreasing the tolerance for making bad decisions. According to Michael Knorr, head of integration and data services at Citi, “Suppose that a ship cannot leave a port due to late payment, and suddenly all the bananas go rotten; from a commercial perspective, this involves a much higher risk because the amounts are much larger. The human element and review by somebody for larger amounts of money won't go away.”25 Companies must establish the risk parameters for various decision rules within the organization, automate decisions where possible, and oversee system-generated recommendations where necessary.

  • Structure processes to structure data—Just because the volumes of data being generated are increasingly unstructured in nature does not mean that processes should follow suit. In fact, the opposite is the case. The more unstructured the data, the more structured the workflows around it must be to maximize its value. A team of university researchers that has explored the risks, opportunities, and case studies in this space recommend using manual or automated processes for using metadata—tags that allow unstructured data to be categorized or manipulated, thereby taking on more of the desirable characteristics associated with structured data. Once the business has defined how unstructured data should be tagged and used, the IT function can respond in kind with the right tools to implement those decisions.26
  • Demystify storage and network costs—Technically storing and transporting the volumes of data created each day by employees and customers create challenges that would make an IT veteran weep. For this reason, the vast majority of data created within an enterprise is not stored or used—largely because of a deficit of storage facilities globally, what some call the case of “the leaky corporation.”27 Companies would be wise to take a page from traditional service providers—particularly video service providers—that have made big business out of efficiently transporting and storing lots of data.

    In the most basic of explanations, a network architecture is composed of three main components—a content asset in question, the cost of storing the asset in a data warehouse, and the cost of transporting the asset to a user in need. In general, the closer the stored asset is to the user, the lower are the transport costs to freight it and the better response time afforded the user (termed latency, a concept covered in Chapter 2). Conversely, the data asset may be centralized in a large data warehouse that is located further from the end user, but one in which the efficiency gains in storing the content more than make up for the higher costs in backhauling the traffic across a longer network route.

    This concept is important when considering video on demand, a popular service offered by many video providers. In a typical video on demand library, there are popular releases and niche content. Based on the basics of network architecture, it typically makes more economic sense to store the more popular fare closer to the consumer, because this will yield a better quality of experience at a lower overall cost. Niche content, on the other hand, is better stored in centralized data centers, where the savings in storage can more than compensate for higher delivery costs. Enterprises can adopt the same principles when considering their own data sets.

    More mission-critical or time-sensitive data, especially that which is accessed on a regular basis, should be prioritized and stored accordingly. Less popular data sets can take advantage of better storage rates and a lower quality of experience given the infrequency with which this data is needed. In either case, the IT unit would do well to demystify and communicate the actual storage and transport costs for each business unit, such that decisions can be made about how much and which data to expunge, relocate, or capture.28

  • Understand context; establish governance—Even when data is accurately predictive in modeling phenomena, it still may lead to faulty conclusions if context is not properly understood. In the case of Google Flu Trends, a team of medical experts compared data from the service with data from two surveillance networks and found that, although it did a very good job at predicting nonspecific respiratory illnesses that closely resemble the flu, it did not predict the actual flu very well. According to one of the researchers, “this year, up to 40 percent of people with pandemic flu did not have ‘influenza-like illness’ because they did not have a fever.... Influenza-like illness is neither sensitive nor specific for influenza virus activity—it's a good proxy, it's a very useful public-health surveillance system, but it is not as accurate as actual nationwide specimens positive for influenza virus.” In response, Google addressed its critics by reiterating that a person searching symptoms would have no way of knowing if he or she was the victim of flu, although “it's just as important to monitor symptoms as it is to monitor specific information.”29 Both sides of the argument are correct. It's a matter of understanding the context of the data being analyzed before drawing conclusions about its meaning.

Big data is often accompanied by a bigger debate—that of privacy for those being studied. In 2011, Alcatel-Lucent asked more than 5,000 U.S. consumers about their definition of privacy in today's networked age. When asked to choose which statement came closest to their view: (a) Privacy is the right to be left alone; or (b) Privacy is the right to control and manage what information about oneself is available to others; an overwhelming 78 percent of respondents selected the latter option. Boundaries are being redefined and contracts are in flux between customers and companies, and employers and employees. Making use of data without compromising ethical standards or violating a coveted trust position with key stakeholders will be central to companies attempting to monetize their data assets. Once governance in how an enterprise plans on using its own data is established, there is also the matter of determining if sharing such information with others may provide a buoy to the broader market collective. The more variables shared between firms, the more interesting are the insights that may be derived. In response to this unfulfilled need, the United Nations' Global Pulse is championing its concept of “data philanthropy,” whereby “corporations [would] take the initiative to anonymize (strip out all personal information) their data sets and provide this data to social innovators to mine the data for insights, patterns and trends in realtime or near realtime.”30 Firms may be slow to embrace such a program, being preoccupied initially with managing their own data flood. Still, the United Nations' proactive and visionary approach will be there for companies, when and if their data needs mature to such a point.

There is no shortage of data. Any organization—large or small—has within it a digital footprint capable of revealing the path to better and faster decisions. What is in short supply, however, is the ingenuity in considering how the data might be used differently, experimentation in measuring if hypotheses are warranted, and a perseverance to drive cultural change where information is revered, not feared. But, in perhaps the best example of how these forces can create the perfect environment for aspiring companies, look no further than Billy Beane, general manager of the baseball franchise Oakland A's and the subject of the Hollywood movie Moneyball. Beane and team revolutionized the way big data was used in baseball, arguably one of the most data-intensive sports. Rather than relying solely on the batting average or earned run average as the critical criteria in assessing a player's performance, Beane began looking at different metrics to predict the efficiency of the player, including base runs, on-base plus slugging percentage, and fielding independent pitching.31 These variables were available to any of Beane's competitors, although they remained underutilized until he identified the meaningful pattern that made such data an actionable precursor to a team's success. As the amount of data produced and consumed by private enterprises and their customers continues unabated, the opportunity for the next Beane to generate the next “moneyball” remains the territory of the most resourceful and data-driven companies.

ATTACK OF THE MACHINES

Since the beginning of automation, early prognosticators have warned of a cataclysmic outcome in which machines cannibalize the jobs of humans. In many cases, the foreboding was justified—consider the steep decline in manufacturing jobs with the advent of assembly automation. Yet, these jobs have typically been the domain of highly repetitive, mundane tasks, wherein robotic machinery provides speed and accuracy while relieving the human to higher-order cognitive functions. After all, computers were the manufactured products of human beings, giving the creators the benefit of the doubt in being considered the smarter “race.” Although this is a straightforward argument, it doesn't necessarily bear true. Andrew McAfee, principle research scientist at MIT and author of Race Against the Machine, looked at a group of 136 man-versus-machine studies and found humans on the winning side just eight times. The irony is that this may be the best outcome humans can hope for, given that the tests were run in an era preceding big data. In all likelihood, the reason the machines didn't do even better is that they didn't have enough data—a problem being rectified every day with more data being added to the equation. According to McAfee, “I kind of see our robot overlords and computer overlords getting smarter and smarter.”

Rather than retreat in a coward's stance in the corner, McAfee recommends that enterprises and employees embrace the challenge to race with the machines humans have created. And, for now, there are still problems that are better served by human beings (such as scientific research that depends on complex understanding about human proteins).32 Although big data may present a big threat to the average knowledge worker, it is more likely a case in which the skills of the American worker will evolve—those with a strong analytical foundation who are able to link data to more fundamental business implications will be in high demand. Increasingly, they will be joined by Watson and his ilk, more than capable of crunching volumes of structured and unstructured data in such a way that a better decision—whether on the part of a human or machine—is more likely to be made.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset