CHAPTER 2
More Data . . . More Data . . . Big Data!

This chapter discusses what readers need to know about the important trend of big data if their organizations are to leverage big data to support operational analytics. As the years pass, organizations have always collected more and more data. However, the pace has accelerated in recent years. It's not just that data sources are getting bigger either. Often today, data also comes in new formats and contains information that requires different analysis methods. Big data is the label that has been applied to this trend that leads to the challenges of more data, from more sources, in different formats.

An organization must keep in perspective a number of concepts when starting to consider big data and how it will affect the organization's analytics processes. This chapter discusses a variety of the hype points surrounding big data that organizations sometimes get caught up in and some ways to prepare for big data and keep it in perspective. Big data isn't as scary as it first may seem. Understanding how big data fits into the picture is necessary in order to incorporate it successfully into operational analytics.

Cutting through the Hype

There is no doubt that a massive amount of hype has been built around big data. Organizations must cut through that hype and focus on what is really important. This section covers several concepts that help to do that. The content in this section is not meant in any way to diminish the importance or value of big data but rather to bring it back into the realm of reality. Developing realistic expectations should be the first step in the process of working with big data.

What's the Definition of Big Data? Who Cares!

One of the first questions I am often asked when I meet with a customer is “How do you define big data, Bill?” People seem preoccupied with defining big data.1 To see this firsthand, visit some of the LinkedIn groups devoted to big data. Each group will have the question of how to define big data, in some form or another, repeated over the past few years. One discussion thread I was involved in had dozens, if not hundreds, of responses to the question “What's the definition of big data?” That is extreme in a forum where a post is usually lucky to get a couple of responses. As the discussion went on, people were trying to outdo each other with one more nuance that may or may not fit into the definition of big data. It seemed silly and overly academic to me.

People are much too concerned about defining big data. In fact, I always like to propose what may be the shortest definition of big data anywhere. My preferred definition is a contrarian one that has only two words, but I believe it to be the most relevant definition of big data: “Who cares!” That may sound extreme at first. Why in the world would I say that? Let me explain.

If an organization's main concern is solving a business problem by implementing new operational analytics, it doesn't need to worry about the definition of big data. Here's why. The process that should be followed, and that organizations should have been following over the years, is simple. When you have a problem to solve, you should look around and ask this question: “What data, if collected, organized, and used within an analytics process, would improve the answers that we are able to generate to address our problem?” Once the necessary data is identified, at that point it is necessary to figure out how to collect, organize, and incorporate it into the analysis. But here's the key point. That first question of “Does this data have value for my business?” has absolutely nothing to do with the definition of the data. It could be big data, small data, or a bunch of spreadsheets.

By the time an organization is at the moment of realizing that it must make use of something that resembles big data, it is too late to worry about definitions; the data is needed. Perhaps the data is not well structured and there is a lot of it. It might just fit the famous “Volume, Variety, Velocity” framework that industry analyst firm Gartner helped to coin.2 Knowing that the data fits the Three Vs framework doesn't help because at the point the data is needed, there is no choice but to figure out how to make use of it, and the fact that it may be big data is really irrelevant. I also always like to propose that the most important, but often overlooked, V related to big data is Value.3 The only reason to worry about the other characteristics is because it is believed that there is value in the data and that it is worth going to the effort to collect and analyze it.

Don't misunderstand what I am saying. If an organization is dealing with data that fits the typical definitions of big data, then that will certainly influence the tools and techniques the organization must use to incorporate big data into its analytics processes. The important distinction here is that the choice of tools and techniques is a tactical implementation issue. The strategic question initially is simply “Is the information this data contains important?” Once that question is answered, an organization must do what it takes to put the data to work.

Don't get overburdened trying to understand what qualifies as big data and what doesn't. Just worry about incorporating the important data sources you've identified into your organization's analytics processes.

Start from the Right Perspective

The preceding topic implies that it is important to start from the right perspective. An organization can't start collecting data and storing it with hopes that one day a use for the data will be found. As Figure 2.1 illustrates, organizations should start with a business problem first and then let that business problem lead to the right data. Make the effort and incur the costs to acquire and use a data source once there is a reason to do so. In the world of big data, it is very easy to become overwhelmed by collecting every piece of data that can be found and worrying about how to drive value with it later. An organization can get so busy collecting data that it never gets around to doing anything with it.

images

Figure 2.1 Start from the Right Perspective

Starting with a business problem instead of the data sounds obvious, but I have seen many otherwise very smart, very careful organizations totally abandon this principle when it comes to big data. At first I was very much puzzled by this trend, but then I realized what is going on. There is such hype around big data as I write this in early 2014 that no one wants to be left out. Every board of directors is asking every chief executive officer, “What are you doing with big data?” Every CEO is asking every chief information officer and chief marketing officer and chief financial officer, “What are you doing with big data?” And each of those executives then asks his or her respective team, “What are you doing with big data?”

The only answer nobody wants to give is “Nothing yet” or “We are planning to do something but we're first going through the diligence of figuring out how to do it right.” Because of the hype, those are not acceptable answers. As a result, organizations are rushing headlong into big data. In some cases, organizations are starting very large, expensive big data initiatives without having a solid plan for how to make use of the investments. They're simply buying a bunch of storage and collecting a bunch of data and hoping that they'll figure it out as they go.

Here's the biggest problem with that approach: It gets you past this year's conversation just fine. You'll get the pat on the back for being on top of the big data trend and for “doing something.” However, what's going to happen 12 or 18 months down the road when the same person comes back and asks, “I see you applied a lot of resources to that big data project. What do we have to show for it?” If you didn't know up front what you were going to do with the data, you're probably going to have a hard time showing fast value on the back end. I'd hate to be the person who has to respond “Well, we jumped into big data aggressively as requested, but as yet we have nothing to show for it.”

Make sure your organization is disciplined as it gets into big data. Take a little extra time to start with a real business issue and develop a plan. Identify some specific analytics that can be built with the data. It won't take much extra time, but it will make the probability of success much higher. Don't get pressured by all the hype to abandon basic principles.

Is There a Big Data Bubble?

Amid all the hype around big data, the question often arises as to whether there's a big data bubble.4 Industry analysis firm Gartner put forth an official opinion in January 2013 that claimed big data was past the peak of the hype cycle and heading for the trough of disillusionment.5 A journalist called me after reading the Gartner article and asked if I thought big data was heading for a fall and a bubble was about to burst. I thought about the question and gave an answer that at first will seem self-contradictory but will make sense after I explain it. My answer was that in some ways, yes, there is a big data bubble. But in even more important ways, no, there's not. These views are summarized in the text and in Table 2.1.

Table 2.1 Is There a Big Data Bubble?

In These Ways, Yes In These Ways, No
Unrealistic expectations New information always adds power to analytics
Belief in easy buttons Big data does yield value with effort
Money thrown at companies in the space Real success stories exist

I do believe there's a big data bubble that's going to burst from one perspective. The problem stems from unrealistic expectations in the marketplace. Many people seem to believe that they can get into big data cheaply and easily and that there's an “auto-magical” button that they can press to get all of the answers to their questions delivered. That's always been a ridiculous assumption for any analytics endeavor. It's still ridiculous in the world of big data.

There is no easy button for big data! It will take time and effort to build analytics processes with big data just as it always has with any type of data. It likely will take even more time initially since big data is new. There will certainly be some very visible big data failures in the marketplace as a result of those wrong assumptions. I have seen some failures already starting to happen. To the extent that those initial failures help burst the hype bubble of unrealistic expectations, they will be good for everyone. This is because it is absolutely possible to succeed with big data and to make it operational. However, organizations must get into big data with realistic expectations in terms of cost, timing, and effort.

Now let's turn our attention to the ways in which there is not a big data bubble about to burst. Often people think that a bubble bursting means that an underlying premise was bogus to begin with. You can be sure that big data is not a bogus premise. Big data is going to have a very large impact on our future. I'll use an analogy to demonstrate why.

Think back to the Internet bubble in 1999 and 2000. There was a huge bubble for Internet companies, and a lot of people lost a lot of money. But there's an important point to understand. Go back and find news stories from late 1999 or 2000 at the very peak of the Internet hype. Then look at what the articles claimed regarding how the Internet would change our personal lives and how we do business. I'm confident that you'll find that the Internet has already exceeded even the wildest dreams of that era.

You see, the Internet bubble had nothing to do with the Internet being bogus or not holding all the promise (and more) that was being hyped at the time. Rather, the Internet bubble was about people thinking it would be too cheap, too fast, and too easy to realize those benefits. During the Internet bubble, a company could get funding as long as the founders threw an “i” or an “e” in front of the company's name. This sounds a lot like big data today to me. If I had started a company in 2013 and claimed that it was a cloud-based, big data, machine learning, analytics-as-a-service company, I probably would have rounded up some cash pretty quickly.

There will be market consolidation and there will be business failures in the big data space in the next few years. There will also be disillusionment as companies that dove in too quickly without realistic expectations realize their error. However, five to ten years down the road, big data will have had all of the impacts it has been purported to enable and much more. The impact from operational analytics based on big data is going to exceed anything being discussed today. Despite the cautions at the beginning of this section, organizations should not sit on the sidelines with big data. In fact, your organization absolutely must get into big data. Just do it intelligently and rationally.

Preparing for Big Data

Once an organization has set realistic expectations about big data, how does it prepare? What are some of the most important concepts to consider when developing a big data strategy? This section focuses on themes to help an organization prepare for big data after moving past the hype.

The Big Data Tidal Wave Is Here

There is no doubt that a tidal wave of data has come our way and that every organization is going to have to tame the wave in order to succeed. This was the theme of my book Taming the Big Data Tidal Wave.6 The reason I chose that title for the book is that the sea is a very good analogy for data. Imagine waves crashing on the shore. If you sit on an inner tube right where the wave crashes, you'll learn that even a wave not much above your waist has the power to flip you over backward. If you start sitting under really big waves, you can be injured by letting them crash upon you. So it is with data. Data, as it gains in volume, can become overwhelming and hard to handle. If you just let the wave of data hit you, it will simply knock you around and you won't accomplish anything with it.

What you must do is to figure out how to ride the wave, whether it is a wave in the ocean or a wave of data. When it comes to surfing in the ocean, we have surfboards. For someone who doesn't know anything about surfing, it is easy to think that a surfboard is a surfboard and that surfing is surfing. But that's not true. Visit a surf shop and look around. There are many different types of surfboards. There are long boards and short boards. There are different shapes. Some have fins and some don't. The reason surfers choose one board over the others has to do with what kind of wave they will ride, how skilled they are, and if they want to go for speed or want to do tricks.

Similarly, when it comes to data and analytics, outsiders often assume that all that is required is to just grab data, store it, and then analyze it with a tool. But anyone who understands analytics realizes there are many different types of tools and many different types of platforms that can give access to the data and allow it to be analyzed. Big data can certainly necessitate adding a few new tools into the mix, just as a surfer might need to add multiple boards over time. Just as there are more similarities in how to use two different surfboards than there are differences, so there are more similarities than differences when it comes to using different analytics tools and platforms for different types of data and analytics.

When an organization gets to the point of adding tools for big data, it will need people who know how to use the tools. If you give me the best surfboard to surf the best waves, I'll fall right off because I don't know how to surf. Expert surfers, however, will be fine if they are given a new surfboard on a new beach with a different size and shape of wave than they are used to. They may be a little wobbly at first, but within a few hours they'll be up and surfing just as strongly as they ever have. That's because the new board on the new beach for the new wave is an incremental change. It's not a quantum leap that can't be overcome. Similarly, expert analytics professionals already have the underlying skills to handle big data and simply need to tune their skills slightly for the new data and analysis requirements. Just as a surfer can adapt to any board on any beach, analytics professionals can adapt to any type of analysis for any type of data because it's an incremental change. It's not a huge quantum leap that they can't overcome.

New Information Is What Makes Big Data So Powerful

What is it that makes big data so powerful and exciting? Why have I predicted that big data will have huge impacts? It is because of the new information that big data can provide.7 Big data sources often provide information to an organization that is novel in one or both of two dimensions. First, big data is often at a level of detail not seen before. Second, big data also often provides information that was not available before.

Let's consider how automobile manufacturers now use big data for predictive maintenance purposes. For many years, as cars broke down, an auto manufacturer would do its best to figure out why the cars had broken down and then work back to what may have caused the problem. Today, embedded sensors are providing intensive data during the development and testing of engines as well as from engines that have been sold once the car is released. Leveraging the sensor data, auto manufacturers can now often identify troublesome patterns before the damage is done and a car breaks down. This is called predictive maintenance.

With engine sensor data, it is now possible to identify early warnings of trouble. Does a certain part heat up before a failure? Does the battery lose a bit of voltage prior to a common electrical problem? Do some parts break in pairs or in sets rather than individually? The answers to these questions would never have been known before since there was no data available to provide the answers. Today the data is available, and it is being analyzed in detail.

The power of the sensor data in this case isn't just that it is more data. It is that the data contains entirely new information not available previously. If a problem is predicted before it happens, there is often time to get the issue fixed proactively before a break down occurs. This can result in higher customer satisfaction and lower warranty costs since cars are spending less time in the shop and it is usually cheaper to avoid a problem than to perform repairs after a problem has occurred.

Traditionally, analytics professionals spent a lot of time fine-tuning existing models using a given set of data sources. Over time, analytics professionals try to incorporate the newest and latest modeling methodologies and to add new metrics derived from the data. This leads to incremental gains in the power of the models, and those efforts are worthwhile.

It is possible, however, to greatly increase the power of a given analytics process with one simple change. Organizations should deviate from the traditional tuning approach as soon as new information relevant to a problem is found. New information can be so powerful that once it is found, analytics professionals should stop worrying about improving existing models with existing information. Instead, they should focus immediately on incorporating and testing that new information.

Even a fairly simplistic use of new information can have impacts on the performance of an analytics process that go far beyond what's possible by tuning the process using existing information. Incorporate new information into a process as soon as possible, even if it can be done only roughly at first. After that is done, then return to tuning and improving the analytics incrementally. New information will beat new algorithms and new metrics based on existing information almost every time.

Seek New Questions to Ask

As an organization changes the breadth of data and tools it is using, it must make a point to look for new questions to ask as well as new ways to ask old questions. Often, when people find a new data source, they immediately think about how it can add additional power to existing solutions of old problems. But there are two other angles that need to be considered, as shown in Figure 2.2.

  • Add additional value to existing analytics processes
  • Identify new ways to solve existing problems
  • Identify entirely different problems to solve

Figure 2.2 Three Ways to Drive Value with Big Data

First, look for entirely new and different questions that can be addressed with the new information. This is a seemingly obvious suggestion, but it is easy for people to get into a rut and simply apply the data to the usual questions. An organization must put emphasis on looking for new opportunities with data. Second, look for new and better ways to address old questions. Do this by examining problems considered solved and thinking about whether the problems could be approached from a completely different direction through the incorporation of the new data. It just might improve the power of the insights generated.8 One helpful framework for pursuing these activities in the context of customer data is the concept of a dynamic customer strategy, as proposed by Jeff Tanner in the book Dynamic Customer Strategy: Big Profits from Big Data.9 That book can be a further reference for readers interested in the topic. Asking new questions is a straightforward concept, so let's focus on an example of revisiting old questions in new ways using big data. In the healthcare industry, clinical trials are the gold standard. A clinical trial has the ultimate test and control structure through what is called a double-blind methodology. In a double-blind clinical trial, neither the patients nor the doctors know who's getting what treatments. It's a tightly controlled atmosphere, and it makes it possible to very precisely pinpoint the positive and negative effects of the treatment or drug being tested. However, after hundreds of millions of dollars and years of effort, a clinical trial will have 2,000 to 3,000 participants if it is lucky. That's not a lot of sample size. This means that while a clinical trial can very precisely measure the things researchers know they want to measure up front, there's not enough data to test for a broad range of unexpected impacts.

What does this lack of sample lead to? Situations like those that occurred a few years ago, when multiple drugs from a class of painkillers known as COX-2 inhibitors, which includes the drugs Vioxx and Celebrex, ran into trouble. Researchers found that these drugs had an association with heart trouble that was two to four times the normal rate of heart trouble.10 The issue wasn't identified in the original clinical trial, and it took several years after the products went on the market before the problem was identified.

Let's flash forward to today. Can we enhance clinical trials with big data even outside a controlled environment? In the near future, detailed electronic medical records will be the norm. Once a drug is released, it will be possible to monitor trends within the thousands, hundreds of thousands, or millions of people who start using that drug. It will be possible to analyze every combination of ailment that people have as they use the drug as well as every combination of other drugs and treatments that are taken alongside the drug. There will be people using the drug for things it wasn't supposed to be used for and with other drugs it wasn't supposed to be used with. These are specifically the things that would not be assessed in a clinical trial.

Using electronic medical histories, it will be possible to mine for unexpected positive and negative effects of a drug (while protecting patient privacy, of course). Granted, the data won't be from a fully controlled environment like a clinical trial. However, might it be possible to identify that something is happening, like the heart issues with Vioxx, much, much earlier? Further controlled studies may be required to validate the findings from the medical records, but researchers will know where to look much faster. It is not about uncontrolled medical data ever replacing clinical trials, but about researchers' ability to identify unexpected positive and negative effects of new drugs and treatments can rise immensely through the use of the uncontrolled data. All that is required is thinking about how to solve problems differently . . . even if they are already considered solved today.

Data Retention Is No Longer a Binary Decision

Big data necessitates a change to policies related to what data organizations collect, how they store it, and how long they store it. Until recently, it was too expensive to waste resources on anything but the most critical data. If data was important enough to collect, it was important enough to keep for a very long time, if not forever. With a lot of big data sources, we must move from a binary decision of “to collect or not to collect” and also from storing what is collected forever. A multiple-tier decision is necessary.

First, is it necessary to collect any part of a data source or not? Second, how much of the source should be collected and for how long should it be kept? Only a small portion of a big data source might be captured, and that portion may be stored for only a short time before deletion. Determining the right approach requires assessing the value of the data both today and over time.

To illustrate data that isn't worth collecting, imagine that you have a highly connected house with sensors all throughout. Every room has its own thermostat constantly sending the current temperature back to the central system so that each room's temperature can be kept stable. The thermostats will generate data continuously as they communicate with the central system, but is there any value in that data? The data has value for a very specific tactical purpose, but it's hard to imagine why it would be worth capturing the data for the long term. Millisecond-level temperatures just don't matter beyond the basic purpose of updating the system. That's okay. If a power company, for example, tried to store all of this detailed data from all of the homes and buildings under its purview, its storage capacity would be overwhelmed while nothing of value was provided.

It is also possible to perform analytics to reduce the data. Data reduction is the process of identifying fields of data that can be either ignored or combined so that there are fewer metrics to work with but little information is lost. For example, it may be found that adjacent rooms in your house always stay within half a degree of each other. Instead of storing readings for each room, just store one of the room's readings and associate it with a zone of the house rather than a specific room. That will cut down on data storage requirements without degrading the quality of information available for analytics.

Let's look at a scenario where data is critical for only a period of time. Railroads have sensors along the tracks monitoring the speed of trains as they go by. What I didn't know until recently is that the wheel temperatures of the train cars are also monitored. If the load in a car gets unbalanced so that there's more weight on one side than the other, it will cause the car to start to lean. That lean puts more weight on one side, which adds friction, which will heat up the wheels. If wheels start heating beyond a certain point, it's an indicator of a serious imbalance and a potential derailment. The railroads monitor wheels in real time as a train moves along the track. If a set of wheels heats beyond guidelines, the train will be stopped and someone will be sent to inspect and fix the load. This saves the railroad a lot of money in the long run because a derailment is a catastrophic and costly, if not deadly, event.

Let's turn attention to the data collected on wheel temperature and the time frame in which it's important. Consider a long train traveling 2,000 miles over a period of many days. At regular intervals, perhaps every 30 seconds, another measure of each wheel's temperature is taken. It is critically important to collect that data and analyze it right away to ensure that nothing's going wrong.

Now fast-forward a few weeks. The train experienced no issues and arrived safely. All wheel readings are within a half degree of the expected temperature. There's really no point in keeping the readings at that point. It might make sense to keep a sampling of trips where everything was fine against which exceptions can be compared. The data surrounding the trips where there was a wheel temperature problem can be kept virtually forever along with a small sample of uneventful trips. The rest of the data doesn't add further value.

Of course, there is still data that will make sense to keep for a long time. Banks or brokerages have a relationship with customers that can last years or decades. These organizations will want to keep records of every deposit that each customer makes and every e-mail exchanged with each customer. That will help provide better service over time and provide legal protection as well. In this case, data that gets collected is still kept virtually forever just as was done traditionally.

The key takeaway is that organizations are going to have to get used to assessing the collection, storage, and retention of data in a different fashion. It's uncomfortable at first to think of letting data slip past and intentionally deleting data that is captured. It is necessary in today's big data era, however.

The Internet of Things Is Coming

The concept of the Internet of Things (IOT) has been getting steadily more attention in 2013 and early 2014. The IOT refers to all of the “things” that will be online and communicating both with each other and with us. As sensors and communication technologies become cheaper and cheaper, more and more items will have the capability to assess surroundings and report information. We already see mundane items like refrigerators and clocks being connected to the Internet and regularly sending and receiving information.

The IOT has the potential to drive absolutely massive amounts of data. It may even outpace all of the other sources of big data. The interesting thing about much of the data generated by the IOT is that it is often very tactical. Any given communication is very short and may contain only simplistic information. For example, a clock may receive a time update from a trusted external source and then pass that information on to other clocks within a house over a home network. In aggregate, this produces a large amount of data, but much of it has very low, very tactical, very short term value.

Many of the examples outlined in this book could be considered a part of the IOT. Once sensor data is involved, it is usually fair to assume the realm of the IOT has been entered. Both businesses and consumers will benefit from all of these devices talking to each other. As more and more of your possessions are able to communicate, new possibilities open up:

  • Your home will learn your preferences for lighting, heating, and more and then automatically make adjustments for you.
  • Items like light bulbs and air fresheners will be able to warn you when they will soon need replacing.
  • Grocery lists will be created automatically based on what you've consumed and what has passed its expiration date.
  • Video and audio content will follow you seamlessly from room to room, removing the need for you to turn anything on or off.
  • Sensors on or near your body will monitor and report sleep patterns, calorie usage, body temperature, and all sorts of other facts.

While the IOT may drive some of the biggest volumes of data, it will likely be filtered much more aggressively than most data. In fact, what we decide to keep may be fairly manageable. We'll let all of our things communicate freely on an ongoing basis and only capture critical pieces of those communications. We discuss this concept more in Chapter 6.

Soon the IOT will become a very hot and popular topic. It is impossible to do the topic justice with just this small introduction, but the topic can't be ignored. Similar to the big data phenomenon, books and articles on the IOT will soon abound. Readers who have interest should carefully monitor the progress of this trend. As many of the examples in this book illustrate, a lot of operational analytics will be driven by data that is sourced from all of the things around us. The IOT will become a component of virtually every organization's analytics strategies.

Putting Big Data in Context

How does big data fit? Why is it special? Where will big data go from here? Questions like these are common and arise in most organizations. As with anything that is relatively new, there is confusion and disagreement on what big data is all about. This section explores themes and concepts that must be understood to put big data in the correct context. Placing big data within the correct context will make it much easier to succeed when applying it to operational analytics.

It's Not So Much Big Data as It Is Different Data

As we discussed earlier in the chapter, what makes big data exciting is the new information it contains. As we also discussed earlier, many people think that what makes big data challenging is simply the fact that the volume of the data is so large. Volume is not really what makes many sources of big data stand out. What often is most challenging about big data is that the new information it contains is found within a different type or format of data and can require different analysis methodologies.

Most data historically collected for analysis in the business world was transactional or descriptive in nature and was well structured. This means that information was clearly identified and easy to read. For example, a column labeled Sales in a spreadsheet would contain dollar values. The less structured data that organizations had, such as written documents or images, was not considered for analysis purposes. With big data, organizations now come across new types and formats of data, many of which are not structured like traditional sources. Sensors spit out information in special formats. GPS data describes where people and things are in space. The strength of the relationships between people or organizations is often desired. These are fundamentally different types of data both in terms of format and in terms of how the data must be analyzed. We talk about the different types of analysis in Chapter 7.

Analyzing a social network and assessing the number and strength of connections between people requires entirely different methodologies from predicting sales, for example. This “differentness” of big data can actually be a much bigger challenge than the “bigness” of the data. Why can it be challenging? Let's look at an example.

Consider an organization looking to do text analysis for the first time. Even to analyze just a few thousand e-mails, it is necessary to acquire a text analysis tool, to set up and configure the tool, and to define the text analysis logic that the organization would like applied. It requires just as much time and effort to initially create a text analysis process to handle 10,000 e-mails as it does to create a process to handle 10 million or 100 million e-mails. The same logic just has to scale as more e-mails are processed. Due to the fact that text is a different type of data, it is necessary to go to a lot of preliminary setup work to get started even for a very small volume of text data.

Of course, when the text analysis process defined is executed, 10,000 e-mails will process more quickly than 100 million. While it is necessary to scale the process as more volume is added, the underlying analytics logic is the same. Figuring out how to handle the differentness of a source of big data is often step one. Once the differentness is handled, then it is possible to move on to figuring out how to handle the differentness at scale.

Big Data Must Be Scaled across Multiple Dimensions

The big data challenge that gets the most attention is the problem of scale. Specifically, the usual focus is the amount of data and the amount of processing required. However, other dimensions of scale, as illustrated in Figures 2.3 and 2.4, are also required if an organization is to implement analytics at an enterprise level, and especially if it will make those analytics operational.

images

Figure 2.3 Scaling Big Data: Typical Focus Dimensions

images

Figure 2.4 Scaling Big Data: Necessary Focus Dimensions

First, it is necessary to have scale in terms of the number and variety of users that access both the underlying data and the results of the analytics processes built on it. Tens or hundreds of thousands of employees might need to see various views of raw data and analysis results at any time. Enterprise platforms must be user friendly and also compatible with a wide range of tools and applications.

Second is a crucial need for scale in the dimension of concurrency. Concurrency refers to the number of users or applications that can access a given set of information at the same time. Concurrency at an enterprise level also means that as data is changing, users will receive consistent answers. As concurrency levels increase, the risks become quite large if a system isn't engineered to handle processing requests appropriately. For a large organization desiring to build operational analytics processes, it is necessary to have an environment where many different users and applications can interact with the same information simultaneously.

Third, there is the need for scalable workload management tools. With different user types submitting a wide variety of analysis requests with a layer of security on top, something must manage the workload. It is not a trivial task to balance many requests at once, and it is easy to forget this aspect of scalability. Creating a system that can effectively manage both very small, tactical requests and very large, strategic requests simultaneously is very difficult.

Last is the need to scale security protocols. An organization must be able to lock data down and control access as needed. Users must be allowed to see only those pieces of data that they are allowed to see. A large organization must have security built into its platforms in a robust fashion.

All of these dimensions of scale—data, processing, users, concurrency, workload management, and security—have to be present alongside each other from the start to succeed with operational analytics. Organizations that worry only about scaling the storage and processing dimensions will fail.

Getting the Most Value from Big Data

One of the most common mistakes I've seen organizations make as they try to incorporate big data into their analytics processes is that they consider big data a completely separate and distinct problem. Many companies are setting up an internal organization to focus specifically and only on big data.11 In fact, some organizations are going so far as to open new offices in Silicon Valley to handle their big data initiatives. That approach is asking for trouble because it is imperative that big data is simply another facet of an overarching data and analytics strategy. There should be a single, cohesive strategy to execute against that includes all data, big and small, as illustrated in Figures 2.5 and 2.6.

images

Figure 2.5 Big Data as Distinct Silo

images

Figure 2.6 Integrated Big Data

Let's explore a historical parallel that shows why not having a single data and analytics strategy will be problematic. When e-commerce came of age, many retailers did not think about e-commerce as another facet of their retail strategies. Instead, many retailers handled e-commerce as though it was something totally new. As a result, many retailers established a separate division to handle their e-commerce activities. In some cases, this division was also a separate legal entity. Those separate entities set up their own supply chain processes, their own product hierarchies, their own pricing policies, and so forth.

Fast forward to today. These same retailers now desire a single view of their business. They want to have their e-commerce and traditional store environments not only within a unified view, but they want to provide a seamless experience for customers across channels as well. However, it is taking years and millions of dollars for retailers to reconcile what in some cases are completely incompatible hierarchies and systems.

Retailers 10 to 15 years ago correctly recognized that e-commerce would have new challenges. But they also should have recognized that e-commerce needed to fit within their overall retail strategy. Setting up their e-commerce business in a way that kept it integrated with the core business would have taken a little bit longer initially, but it would have saved a lot of time and money in the long run.

Make sure your organization doesn't make this same mistake with big data. Take the extra time up front to think through how big data will fit into your overall data and analytics strategy. This is important because no data source by itself will provide optimal value. Mixing various data sources together is the only way to maximize value. For example, it is necessary to mix sales data, web browsing data, demographic data, and more to fully understand a customer.

Once an organization establishes separate systems and processes for big data without thinking up front about the need for integration, it will just make it that much harder to derive the value needed on the back end. Companies need to work toward a unified analytics environment that allows people to perform any type of analysis against any type or volume of data at any time. We discuss in much more detail how to make this a reality later in the book. Readers wishing to take a deep dive into getting the most value from big data as it relates to marketing should consider reading Big Data Marketing: Engage Your Customers More Effectively and Drive Value, written by my colleague Lisa Arthur.12

Back to the Future

A highly hyped concept around big data is the supposedly new world of nonrelational tool sets that are not based on a relational database and do not use SQL as the primary interface. SQL stands for Structured Query Language, and it has been called “the language of business” for years. Nonrelational tool sets do not leverage SQL exclusively, if at all. The premise behind the nonrelational movement is that there is need for additional languages since SQL has in many companies been virtually the sole language of business. After all, why shouldn't businesses be multilingual? They should be. Furthermore, they should have been all along.

Let's get right to the fatal flaw in the hype. The fact is that nonrelational analytics is not a new concept. When I started in my analytics career, relational databases did not yet exist in the business world. There literally was no SQL. Therefore, everything we did to generate analytics was based on nonrelational methods. In my case, I usually leveraged tools from SAS. To people like me, SQL is actually the new kid on the block. Over time, we analytics professionals realized that SQL is a better way to go for certain kinds of problems and processing. There have also always been certain kinds of processing that analytics professionals have executed outside of an SQL environment.

With big data, what's really happening is that organizations have rediscovered the value of processing outside of an SQL context when it makes sense. As it happens, using nonrelational options makes sense much more often with many big data sources than with many traditional data sources. Many companies went too far and tried to fit all processing into an SQL paradigm. That was a mistake; organizations do need to incorporate other options into the mix. Just keep in mind that nonrelational options have always been available. It isn't that there was no need for nonrelational processing during the first decade of the 21st century. Rather, companies moved too far toward SQL. We can expect that SQL will remain the dominant approach for analyzing data in the future and that nonrelational analysis will be focused on specific needs.

Organizations should embrace the use of nonrelational tool sets where it is appropriate but can't for a minute think that doing so negates the need for SQL right alongside them. It is very easy to swing too far in the other direction, and many are at risk of doing just that today. In fact, for several years, many people advocated the death of SQL. In a case of massive opinion flip-flop, there is now a large movement to enable SQL-like functionality on a wide variety of nonrelational platforms, such as Hadoop. Once again, we're going back to the future. We talk more about this trend and how to leverage the right kind of processing in Chapters 5 and 6.

Big Data Is Going through a Maturity Curve

A lot of people talk to me about how big data feels overwhelming to them. There are so many new data sources and so many new things to do with the data that many organizations are just not sure how to begin and how to handle it all. Before despairing, consider the fact that big data is going through the same maturity curve that any new data source goes through.13 The reality is that the first time a new data source becomes available, it is always challenging. People aren't sure exactly how to best use the new data, what metrics to create from it, what data quality issues will be found, and so forth. However, over time, the handling of that data source becomes standardized.

Many years ago when I first started analyzing retail point-of-sale (POS) data, my team and I weren't sure how to best use the data to analyze customer behavior and drive better business results. We weren't even thinking yet about how to make analytics operational with the data. We had a lot of theories and ideas, but which of them would work hadn't yet been proven. We certainly hadn't standardized how we would input, prepare, and analyze the data. Over time, those analyzing POS data regularly did standardize all of those aspects. Today, POS data is considered easy to deal with, and it is applied to a wide range of problems.

Organizations must go through the same process outlined in Figure 2.7 with each new big data source. The fundamental difference with big data is that, in the past, one truly new and unique data source might be made available to an organization every few years. With big data, an organization may be faced with multiple new data sources all at once.

  • The quality of the data is not understood
  • The best methods to store and process the data are not identified
  • The most valuable metrics to create from the data are not known
  • The ability of the data to address business problems has not been proven
  • How the data overlaps and is distinct from other data has not been assessed

Figure 2.7 Challenges with Any New Data Source

Analytics professionals today can be tasked with trying simultaneously to analyze social media interactions, customer service interactions, web behavior, information from sensors, and more. This data may have to be leveraged together all at once within a single analytics process. In such a case, multiple new data sources going through the maturity curve all are being applied together. That is much more challenging than having a single new data source to worry about. Making matters worse, as we discussed previously, is the necessity to think not just about how to handle each data source by itself but how to connect them together.

  • The quality of the data is not understood
  • The best methods to store and process the data are not identified
  • The most valuable metrics to create from the data are not known
  • The ability of the data to address business problems has not been proven
  • How the data overlaps and is distinct from other data has not been assessed

Don't lose sight of the fact that working with new data is always difficult and always intimidating at first. There are always bumps in the road to get past. Inevitably, how to incorporate and analyze the data becomes largely standardized and everything is just fine. Then it is time to move onto the next new data source. That's exactly what is going to happen, and is already happening, with big data.

Big Data Is a Global Phenomenon

A final big data trend worth discussing is how consistent the views and maturity of big data are around the world.14 It is true that some organizations are farther ahead or farther behind in the adoption and maturity cycles. However, I've gone to several continents and I've talked to banks, insurance companies, retailers, government agencies, and more. What I've found is that everyone across the globe is struggling with almost the exact same issues. There are always local market considerations with respect to customs and regulations, but the fundamental business issues tend to be very consistent. Moreover, most people think that other industries and other parts of the world are well ahead of their own organization, even though often that really isn't true.

Math, statistics, analytics, and data don't really speak a specific language or belong to a specific culture; rather, they are universal in nature. A trend graph in China looks exactly the same as a trend graph in Spain and relays similar information. An average will be computed in India the same way as in Germany. A transaction record in Japan will have the same information as a transaction record in Brazil. The claim that big data is something that's a unique problem for an industry within a country is not true except in extremely rare instances.

Consider forming relationships with peers from a business just like yours elsewhere in the world. With social media today, it is easy to do. The other organization is probably struggling with the same problems your organization is. It isn't possible to get into a meaningful discussion with a direct competitor about how your organization analyzes data. However, it is quite possible to talk to somebody halfway around the world who poses no competitive threat. By sharing information and lessons learned, both organizations can benefit.

Whatever pains your organization is going through with big data, you can be sure that many others are going through the same pains as well. Over time, the solutions to those pain points will be found and the solutions will become widely known and implemented. Incorporating big data into operational analytics will become much easier and more commonplace. An organization doesn't necessarily have to be the first in the world to tackle something, but it shouldn't wait till the problem is fully solved either. At that point, the effort is nothing more than playing catch-up. Being the company following everyone else is not a winning approach.

Wrap-Up

The most important lessons to take away from this chapter are:

  • Don't worry about how to define big data. Worry about what data, whether big or small, is needed for your analytics. Definitions don't matter; results do!
  • Always start with specific business problems. Don't implement big data technology just to claim you are doing something with big data.
  • Despite excessive hype and unrealistic short-term expectations, big data is here to stay. Just as the Internet bubble didn't mean the Internet wasn't a huge opportunity, the big data bubble doesn't mean that big data isn't a huge opportunity.
  • What makes big data so exciting is the new information it provides. New information will beat out new algorithms almost every time.
  • Don't just use big data to improve existing analytics processes. Also look for ways that big data can solve old problems from a new perspective or can solve entirely new problems.
  • Expect the hype around the Internet of Things to rise rapidly in the coming years, but also expect to rethink data retention policies in order to handle new floods of data that have lower value.
  • The “differentness” of big data compared to traditional data can lead to more challenges than the “bigness” of big data.
  • Big data requires scale, but not just of processing and storage. Scalability is also needed for the dimensions of users, concurrency, workload management, and security.
  • Big data must be made a component of an overall data and analytics strategy. Big data can't be tackled effectively by itself.
  • After years of predictions that SQL was going to die, nonrelational platforms are now scrambling to implement SQL interfaces. Although this represents a huge flip-flop, it also reflects the reality of business needs.
  • Big data seems overwhelming today, but it is going through the same maturity curve as other data sources. Big data feels worse due to the number of new data sources coming at us all at once.
  • Around the world and across industries, most organizations perceive they are well behind with big data. In reality, few organizations are very far ahead today, which means few are far behind either.

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset