Chapter 3

Rule #1

Talk to the Users

Abstract

It is a mistake to view a data science problem as just a management exercise or just a technical issue. Some of the infrastructure required for success cannot be mandated or bought—some issues require attitudinal changes within the personnel involved in the project. In this part of the book Data Science for Software Engineering: Sharing Data and Models, some of those changes are listed. They include (1) talk to your users more than your algorithms; (2) know your maths and tools but, just as important, know your problem domain; (3) suspect all collected data and, for all such data, “rinse before use”; (4) data science is a cyclic process, which means you will get it wrong (at first) along the way to getting it right.

The most important rule in industrial data science is this: talk more to your users than to your algorithms.

Why is it most important? Well, the main difference we see between academic data mining research and industrial data scientists is that the former is mostly focused on algorithms and the latter is mostly focused on “users.”

Note that by this term “user,” we do not mean the end user of a product. Rather, we mean the community providing the data and domain insights vital to a successful project. Users provide the funding for the project and, typically, need to see a value-added benefit, very early in a project.

At one level, talking to the people who hold the purse strings is just good manners and good business sense (because it is those people who might be willing to fund future projects). As the Mercury astronauts used to say, “No bucks, no Buck Rodgers.”

But there is another, more fundamental, reason to talk to business users:

 The space of models that can be generated from any data set is very large. If we understand and apply user goals, then we can quickly focus a data mining project on the small set of most crucial issues.

 Hence, it is vital to talk to users in order to leverage their “biases” for better guiding the data mining.

As discussed in this chapter, any inductive process is fundamentally biased. Hence, we need to build ensembles of opinions (some from humans, some from data miners), each biased in their own particular way. In this way, we can generate better conclusions than using some an overreliance on a single bias.

The rest of this chapter expands on this notion of “bias,” as it appears in humans and data miners.

3.1 Users biases

The Wikipedia page on “List of cognitive biases”1 lists nearly 200 ways that human reasoning is systematically biased. That list includes

 Nearly a hundred decision-making, belief, and behavioral biases such as attentional bias (paying more attention to emotionally dominant stimuli in one's environment and neglecting relevant data).

 Nearly two dozen social biases such as the worse than average effect (believing that are we are worse than others at tasks that are difficult).

 Over 50 memory errors and biases such as illusory correlation (inaccurately remembering a relationship between two events).

As documented by Simons and Chabris [400], the effects of these human imperfections can be quite startling. For example, in the “Gorilla in our midst” experiment, subjects were asked to count how often a ball was passed between the members of a basketball team wearing white shirts. Nearly half the subjects (48%) were so focused on “white things” that they did not notice a six-foot tall hairy black gorilla walk slowly into the game, beat its chest, then walk out (see Figure 3.1). According to Simons and Chabris, humans can suffer from “sustained in attentional blindness” (which is a kind of cognitive bias) where they do not see effects that, to an outside observer, are glaringly obvious.

f03-01-9780124172951
Figure 3.1 Gorillas in our midst. From [408]. Scene from video used in “invisible gorilla” study. Figure provided by Daniel Simons. For more information about the study and to see the video go to www.dansimons.com or www.theinvisiblegorilla.com.

The lesson here is that when humans analyze data, their biases can make them miss important effects. Therefore, it is wise to run data miners over that same data in order to find any missed effects.

3.2 Data mining biases

When human biases stop them from seeing important effects, data miners can uncover those missed effects. In fact, best results often come from combining the biases of the humans and data miners (i.e., the whole may be greater than any of the parts).

That kind of multilearner combination is discussed below. First, we need to review the four biases of any data miner:

1. No data miner has access to all the data in the universe. Some sampling bias controls which bits of the data we offer to a miner.

2. Different data miners have a language bias that controls what models they write. For example, decision tree learners cannot write equations and Bayesian learners cannot write out categorical rules.

3. When growing a model, data miners use a search bias that controls what is the next thing they add to a model. For example, some learners demand that new rules cannot be considered unless they are supported by a minimum number of examples.

4. Once a model is grown, data miners also use an underfit bias that controls how they prune away parts of a model generated by random noise (“underfit” is also called “overfitting avoidance”).

For example, consider the task of discussing the results of data mining with management. For that audience, it is often useful to display small decision trees that fit on one slide. For example, here's a small decision tree that predicts who will survive the sinking of the Titanic (Hint: do not buy third-class tickets).

u03-01-9780124172951

To generate that figure, we apply the following biases:

 A language bias that generates decision trees.

 A search bias that does not grow the model unless absolutely necessary.

 An underfit bias that prunes any doubtful parts of the tree. This is needed because, without it, our trees may grow too large to show the business users.

For other audiences, we would use other biases. For example, programmers who want to include the learned model in their program have a language bias. They often prefer decision lists, or a list of conditions (and the next condition is only checked if the last condition fails). Such decision lists can be easily dropped into a program. For example, here is a decision list learned from the S.S. Titanic data:

(sex = female) and (class = first) => survived
(sex = female) and (class = second) => survived
(sex = female) and (class = crew) => survived
=> died

So bias is a blessing to any data miner because, if we understand it, we can tune the output to the audience. For example, our programmer likes the decision list output as it is very simple to (say) drop it into a “C” program.

if ((sex == “female”) && (class == “first”)) return “survived”
if ((sex == “female”) && (class == “second”)) return “survived”
if ((sex == “female”) && (class == “crew”)) return “survived”
return “died”

On the other hand, bias is a challenge because, the same data can generate different models, depending on the biases of the learner. This leads to the question: Is bias avoidable?

3.3 Can we avoid bias?

The lesson of the above example (about the Titanic) is that, sometimes, it is possible to use user bias to inform the data mining process. But is that even necessary? Is not the real solution to avoid bias all together?

In everyday speech, bias is a dirty word. For example:

Judge slammed for appearance of bias

A Superior Court judge has barred an Ontario Court colleague from presiding over a man's trial after concluding he appeared biased against the defendant.

(www.lawtimesnews.com, September 26, 2011. See http://goo.gl/pQZkF).

But bias' bad reputation is not deserved. It turns out that bias is very useful:

 Without bias, we cannot say which bits of the data matter most to us.

 That is, without bias, we cannot divide data into the bits that matter and the bits that do not.

This is important because if we cannot ignore anything, we cannot generalize. Why? Well, any generalization must be smaller than the thing it generalizes (otherwise we might as well just keep the original thing and ignore the generalization). Such generalizations are vital to data mining. Without generalization, all we can do is match new situations to a database of old examples. This is a problem because, if the new situation has not occurred before, then nothing is matched and we cannot make any predictions. That is, while bias can blind us to certain details, it can also let use see patterns that let us make predictions in the future.

To put that another way:

Bias make us blind but it also let us see.

3.4 Managing biases

Bias is unavoidable, even central, to the learning process. But how to manage it?

Much of this book is concerned with automatic tools for handling bias:

 Above, we showed an example where different learners (decision tree learners and rule learners) were used to automatically generate different models for different users.

 Later in this book, we discuss state-of-the-art ensemble learners that automatically build committees of expert, each with their own particular bias. We will show that combining the different biases of different experts produces better predictions than relying exclusively on the bias of one expert.

But even without automatic support, it is possible to exploit expert biases to achieve better predictions. For example, at Microsoft, data scientists conduct “user engagement meetings” to review the results of a data mining session:

 Meetings begin with a quick presentation on the analysis method used to generate the results.

 Some sample data is then shown on the screen, at which point the more informed business users usually peer forward to check the data for known effects in that business.

 If the data passes that “sanity check” (that it contains old conclusions), the users start investigating the data for new effects.

Other measures of success of such meetings are as follows. In a good user engagement meeting

 The users keep interrupting to debate the implications of your results. This shows that (a) you are explaining the results in a way they understand, and (b) your results are commenting on issues that concern users.

 Data scientists spend more time listening than talking as the users propose queries on the sample data, or if they vigorously debate implications of the displayed data.

 At subsequent meetings, users bring their senior management.

 The users start listing more and more candidate data sources that you could exploit.

 After the meeting, the users invite you back to their desks inside their firewalls to show them how to perform certain kinds of analysis.

3.5 Summary

The conclusions made by any human- or computer-based learning will be biased. When using data miners, the trick is to match the biases of the learner with the biases of the users. Hence, data scientists should talk more, and listen more, to business users in order to understand and take advantage of the user biases.

Bibliography

[400] Simons DJ, Chabris CF. Gorillas in our midst: sustained inattentional blindness for dynamic events perception. Perception. 1999;28:1059–1074.

[408] Stukes S, Ferens D. Software cost model calibration. J Parametrics. 1998;18(1):77–98.


2606 “To view the full reference list for the book, click here

1 http://en.wikipedia.org/wiki/List_of_cognitive_biases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset