Chapter 2. Data Pipelines and Modeling

We have looked at basic hands-on tools for exploring the data in the previous chapter, thus we now can delve into more complex topics of statistical model building and optimal control or science-driven tools and problems. I will go ahead and say that we will only touch on some topics in optimal control since this book really is just about ML in Scala and not the theory of data-driven business management, which might be an exciting topic for a book on its own.

In this chapter, I will stay away from specific implementations in Scala and discuss the problem of building a data-driven enterprise at a high level. Later chapters will address how to solve these smaller pieces of the puzzle. A special emphasis will be given to handing uncertainty. Uncertainty usually comes in several favors: first, there can be noise in the information we are provided with. Secondly, the information can be incomplete. The system may have some degree of freedom in filling the missing pieces, which results in uncertainty. Finally, there may be variations in the interpretation of the models and the resulting metrics. The final point is subtle, as most classic textbooks assume that we can measure things directly. Not only the measurements may be noisy, but the definition of the measure may change in time—try measuring satisfaction or happiness. Certainly, we can avoid the ambiguity by saying that we can optimize only measurable metrics, as people usually do, but it will significantly limit the application domain in practice. Nothing prevents the scientific machinery from handling the uncertainty in the interpretation into account as well.

The predictive models are often built just for data understanding. From the linguistic derivation, model is a simplified representation of the actual complex buildings or processes for exactly the purpose of making a point and convincing people, one or another way. The ultimate goal for predictive modeling, the modeling I am concerned about in this book and this chapter specifically, is to optimize the business processes by taking the most important factors into account in order to make the world a better place. This was certainly a sentence with a lot of uncertainty entrenched, but at least it looks like a much better goal than optimizing a click-through rate.

Let's look at a traditional business decision-making process: a traditional business might involve a set of C-level executives making decisions based on information that is usually obtained from a set of dashboards with graphical representation of the data in one or several DBs. The promise of an automated data-driven business is to be able to automatically make most of the decisions provided the uncertainties eliminating human bias. This is not to say that we no longer need C-level executives, but the C-level executives will be busy helping the machines to make the decisions instead of the other way around.

In this chapter, we will cover the following topics:

  • Going through the basics of influence diagrams as a tool for decision making
  • Looking at variations of the pure decision making optimization in the context of adaptive Markov Decision making process and Kelly Criterion
  • Getting familiar with at least three different practical strategies for exploration-exploitation trade-off
  • Describing the architecture of a data-driven enterprise
  • Discussing major architectural components of a decision-making pipeline
  • Getting familiar with standard tools for building data pipelines

Influence diagrams

While the decision making process can have multiple facets, a book about decision making under uncertainty would be incomplete without mentioning influence diagrams (Influence Diagrams for Team Decision Analysis, Decision Analysis 2 (4): 207–228), which help the analysis and understanding of the decision-making process. The decision may be as mundane as selection of the next news article to show to a user in a personalized environment or a complex one as detecting malware on an enterprise network or selecting the next research project.

Depending on the weather she can try and go on a boat trip. We can represent the decision-making process as a diagram. Let's decide whether to take a river boat tour during her stay in Portland, Oregon:

Influence diagrams

Figure 02-1. A simple vacation influence diagram to represent a simple decision-making process. The diagram contains decision nodes such as Vacation Activity, observable and unobservable information nodes such as Weather Forecast and Weather, and finally the value node such as Satisfaction

The preceding diagram represents this situation. The decision whether to participate in the activity is clearly driven by the potential to get certain satisfaction, which is a function of the decision itself and the weather at the time of the activity. While the actual weather conditions are unknown at the time of the trip planning, we believe there is a certain correlation between the weather forecast and the actual weather experienced during the trip, which is represented by the edge between the Weather and Weather Forecast nodes. The Vacation Activity node is the decision node, it has only one parent as the decision is made solely based on Weather Forecast. The final node in the DAG is Satisfaction, which is a function of the actual whether and the decision we made during the trip planning—obviously, yes + good weather and no + bad weather are likely to have the highest scores. The yes + bad weather and no + good weather would be a bad outcome—the latter case is probably just a missed opportunity, but not necessarily a bad decision, provided an inaccurate weather forecast.

The absence of an edge carries an independence assumption. For example, we believe that Satisfaction should not depend on Weather Forecast, as the latter becomes irrelevant once we are on the boat. Once the vacation plan is finalized, the actual weather during the boating activity can no longer affect the decision, which was made solely based on the weather forecast; at least in our simplified model, where we exclude the option of buying a trip insurance.

The graph shows different stages of decision making and the flow of information (we will provide an actual graph implementation in Scala in Chapter 7, Working with Graph Algorithms). There is only one piece of information required to make the decision in our simplified diagram: the weather forecast. Once the decision is made, we can no longer change it, even if we have information about the actual weather at the time of the trip. The weather and the decision data can be used to model her satisfaction with the decision she has made.

Let's map this approach to an advertising problem as an illustration: the ultimate goal is to get user satisfaction with the targeted ads, which results in additional revenue for an advertiser. The satisfaction is the function of user-specific environmental state, which is unknown at the time of decision making. Using machine learning algorithms, however, we can forecast this state based on the user's recent Web visit history and other information that we can gather, such as geolocation, browser-agent string, time of day, category of the ad, and so on (refer to Figure 02-2).

While we are unlikely to measure the level of dopamine in the user's brain, which will certainly fall under the realm of measurable metrics and probably reduce the uncertainty, we can measure the user satisfaction indirectly by the user's actions, either the fact that they responded to the ad or even the measure of time the user spent between the clicks to browse relevant information, which can be used to estimate the effectiveness of our modeling and algorithms. Here is an influence diagram, similar to the one for "vacation", adjusted for the advertising decision-making process:

Influence diagrams

Figure 02-2. The vacation influence diagram adjusted to the online advertising decision-making case. The decisions for online advertising can be made thousand times per second

The actual process might be more complex, representing a chain of decisions, each one depending on a few previous time slices. For example, the so-called Markov Chain Decision Process. In this case, the diagram might be repeated over multiple time slices.

Yet another example might be Enterprise Network Internet malware analytics system. In this case, we try to detect network connections indicative of either command and control (C2), lateral movement, or data exfiltration based on the analysis of network packets flowing through the enterprise switches. The goal is to minimize the potential impact of an outbreak with minimum impact on the functioning systems.

One of the decisions we might take is to reimage a subset of nodes or to at least isolate them. The data we collect may contain uncertainty—many benign software packages may send traffic in suspicious ways, and the models need to differentiate between them based on the risk and potential impact. One of the decisions in this specific case may be to collect additional information.

I will leave it to the reader to map this and other potential business cases to the corresponding diagram as an exercise. Let's consider a more complex optimization problem now.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset