13

What’s at Stake?

When Amazon recommends products, Expedia lists travel options, Netflix suggests movies, or Google serves up a list of sites that could answer your question, they don’t have a button that says, “If this isn’t helpful, then click here to speak to one of our customer service agents.” But browse Tesla’s site for more than thirty or so seconds and a chat window comes up asking if you have any questions. Ask something complicated and you are quickly passed from a chatbot to a person who will answer your questions. The person will also try and get you to test-drive. This is how you know there is something different going on. Rather than leave the entire interaction to automation, Tesla inserts a human in the loop when the interaction becomes more difficult.

There are many differences between these companies that we can point to. For one, Amazon and Google have larger shares of their respective markets, so there is less risk you will go elsewhere. However, that motive doesn’t apply to Expedia. Amazon and Google have millions (or more) options available that they have to sort through, whereas Tesla has only a few models. But if you think about it, with all the options, Tesla has perhaps the same degree of complexity as Expedia. Even when you compare how much you might spend, Expedia and Tesla are similar. Someone with the income to purchase a $60,000 Tesla that will last five to six years is also likely to spend $10,000 a year on travel. So the obvious candidates don’t help us get to the bottom of the fact that one is automated and the other is a hybrid with a knowledgeable person at the ready.

This chapter is devoted to stakes. We have already seen some of the factors that go into choosing whether to use prediction machines to fully automate a task. But here we highlight a potential cost to full automation when the stakes are high.1 Stakes are the expected losses that arise when there is an error in prediction. Throughout this book, when we talk about prediction, we have often been considering whether perfect prediction could help or not. That is actually a somewhat unrealistic exercise. The better question is whether or not some imperfect prediction could help, for the excellent reason that there are no perfect predictions.

Predictions fail by degrees and in different directions. Your weather app may give you a zero chance of rain tomorrow, and if there is rain, it is clearly incorrect. It may also give you a 10 percent chance of rain. If there is rain, it is hard to know if it was right or not. But the consequences of being incorrect may be very different, depending on what you are using that prediction for. If you’re thinking of going for a bicycle ride, then a 10 percent probability may be worth the risk. If you’re making the final call on whether to hold your wedding reception outside or in a tent, then if the 10 percent prediction is wrong, there are big consequences.

From this perspective, we can see why Tesla is not relying on AI prediction to guide its customer interactions. Compared with Amazon, Google, and Expedia, the stakes are higher. If an Amazon search doesn’t serve up what the consumer wants, the consumer will likely search again. Even so, if the search doesn’t work out for a given interaction, the consumer will likely be back for other stuff. But if consumers don’t find answers to their questions and then search for another brand or, even worse, end up going to a dealer where an incentivized salesperson is desperate for them not to leave without the keys to a brand-new car, Tesla won’t get that person back. Thus, while one could imagine an AI experience that could manage that consumer interaction for Tesla, in the end, if the prediction of what a consumer wants is wrong, Tesla loses the customer. The stakes are higher, and when the stakes are higher, the losses when relying wholly on imperfect AI predictions are higher.

We could conclude that when the stakes are higher, AI has to give better prediction so that the errors don’t appear. However, while better AI may be more valuable in these situations, the message we want to give in this chapter is that such an approach oversimplifies the situation.

Here we examine the relationship between stakes and AI prediction. One implication is that when the stakes are high, utilizing AI prediction involves complementary investments in measures that manage the additional risks that are created. That management will involve either some form of insurance—countering potentially adverse downside outcomes—or some form of protection, reducing the likelihood of those outcomes and containing the adverse consequences.

High versus Low Stakes

Amazon recommendations involve what we would call “low-stakes” transactions. If you search for “dog bowls,” and Amazon recommends the “PEGGY11 No Spill Non-Skid Stainless Steel Dog Bowl” with its five-star rating from 10,505 reviews for $18.89 (discounted from $18.99), that may well hit the mark.2 But then you realize the bowl is very large and may not suit your small shih tzu, so you search for “dog bowls small” instead. That leads you to the “Bone Dry Paw Patch & Stripes Ceramic Pet Bowl & Canister Collection, Small Bowl Set,” which gives you two bowls for just $13.05 (discounted from $14.99), and you are done.3

That was good news for you but apparently not for one reviewer who gave it a one-star rating:

They look much larger than they are in fact. There was nothing in the text to suggest how minuscule they are. They might be good for teacup breeds, but nothing bigger. Hopefully I will save all who read this some wasted time and energy.4

Something went gravely wrong for this customer. Amazon recommended the bowl, but led the customer astray. It made the sale, but disappointed the customer. You might think the customer would never return. Instead, they bought and reviewed nine more products after that poor experience, of which they gave five one-star reviews and four favorable reviews. After ten years of eighty-seven reviews, this customer was very satisfied for the first year of reviewing. After that, their experience has been mixed.

It is hard to tell whether or how much these negative experiences have changed this person’s buying behavior on Amazon. But each what we call “true positive” involves buying suggestions that, when followed, led to high satisfaction, and each “false positive” led to low satisfaction. Amazon would prefer the former to the latter, but the latter were not devastating for it.

But there is a flip-side to this evaluation: What about things it did not recommend? Amazon has hundreds of dog bowls available, and it did not recommend most (in terms of being high up in the search results). If these were “true negatives,” then had they been recommended and bought, they would have resulted in low satisfaction. But what if there was a “false negative,” a bowl that would have delighted this customer? Then Amazon would have got a favorable review.

One question is whether Amazon could have done better. If we do a deep dive into the reviews, this person had bought dog products before and complained about size. In a review from years earlier, this person mentions that their dog is a Siberian husky, which is a substantial beast. This gives us a clue that this person was quite a dog lover. Amazon had recommended the tiny bowls for this same dog.5 Someone who even vaguely knew their customer could have avoided this whole debacle.

The point of this exercise (you may be forgiven for wondering) is that Amazon’s recommendation engine does well, but it makes mistakes. There is room and opportunity for improvement, as our cursory examination showed.6 The consequences of an error were low. The customer was more satisfied than not and kept shopping for over a decade at this point. False positives result in damaging reviews, and false negatives result in missed opportunities for even happier customers and more positive reviews, but in this particular case, neither was enough to deter the customer from continued shopping at Amazon.

Contrast that with the challenge facing Facebook when it tries to remove offensive material from people’s timelines. Facebook’s news feed is put together by an AI predicting what a person would like to see. The data that drives that is what a person may have interacted with (and perhaps liked) in the past and also data on who that person is and what their friends have similarly liked. If someone that person follows posts something, Facebook will make that prominent in the news feed. And therein lies the problem: Facebook wants to rely on follower posts to assign priority, but relying on that becomes a problem when followers may post offensive material. What constitutes offensive material is hard to describe, even if people know it when they see it.

In 2018, Facebook acknowledged that it was unable to use AI prediction to identify offensive content without sufficient error. How much error was that? According to Facebook:

  • We took down 21 million pieces of adult nudity and sexual activity in Q1 2018—96% of which was found and flagged by our technology before it was reported. Overall, we estimate that out of every 10,000 pieces of content viewed on Facebook, 7 to 9 views were of content that violated our adult nudity and pornography standards.
  • For graphic violence, we took down or applied warning labels to about 3.5 million pieces of violent content in Q1 2018—86% of which was identified by our technology before it was reported to Facebook.
  • For hate speech, our technology still doesn’t work that well and so it needs to be checked by our review teams. We removed 2.5 million pieces of hate speech in Q1 2018—38% of which was flagged by our technology.7

Guy Rosen, Facebook’s VP of product development explained the challenge:

[W]e have a lot of work still to do to prevent abuse. It’s partly that technology like artificial intelligence, while promising, is still years away from being effective for most bad content because context is so important. For example, artificial intelligence isn’t good enough yet to determine whether someone is pushing hate or describing something that happened to them so they can raise awareness of the issue.8

Interestingly, in terms of “false positives” (content that was considered appropriate but, in fact, wasn’t), that is a very high rate of predictive success. The issue wasn’t that AI was doing a bad job; in fact, it was surprisingly good, but that wasn’t enough because these transactions were very high stakes.

To Facebook, the consequences of a false positive were much higher than that of a false negative. In the latter case, some piece of content was blocked, which may have led to a disgruntled user. In the former case, content that was allowed may have led to many unhappy users. That imbalance meant that it could not just rely on AI to make a final recommendation. Instead, it used AI to flag content that might be inappropriate with a very low bar catching more appropriate than inappropriate content and then gave humans in the loop the final call.

Facebook employs 15,000 people as content moderators to do that job.9 This is a significant fraction of Facebook’s 60,000 total employees.10 And by all accounts, it is neither a well-paid nor pleasant job, as those people get to see all of the content Facebook doesn’t want its users to see.11 Because the thresholds for AI performance are so high, and because others are continually working to sidestep that AI, the false-positive rate may never get low enough to eliminate that particular job.12

Stakes and Decisions

We can express the notion of high- versus low-stakes predictions in decision trees. Figure 13-1 shows the decision trees facing both Amazon and Facebook. Amazon faces a decision of whether to recommend or not. For a given product, the AI prediction says that there is either a 90 percent chance or a 10 percent chance it is a good match for the customer. Obviously, if it is 90 percent, Amazon will recommend the product; otherwise, it won’t. As the prediction isn’t perfect, there is a 10 percent chance of a false positive (recommending a product the customer doesn’t like) and a 10 percent chance of a false negative (failing to recommend a product the consumer will like). In those cases, Amazon’s payoff falls by half to 100 (rather than 200). Thus, in adopting the AI, there is always an expected payoff shortfall of 10 (= 0.1*100).

FIGURE 13-1

Stakes for Amazon and Facebook

image

By contrast, for Facebook, the stakes are very different. The AI prediction has the same error rate as for Amazon, with a false-negative and false-positive rate of 10 percent. But, in this case, if Facebook adopts the AI, then, taking into account potential losses, it earns a payoff of –10, on average. Suppose that, if it did not use AI, Facebook could always identify offensive content perfectly using human curators. In that case, it would never have any losses and, thus, would earn a positive payoff if most content was acceptable. The inaccuracy of the AI in this case makes it unattractive for Facebook to adopt because the stakes involved in getting it wrong—upsetting a user with offensive content or upsetting a sharer who had acceptable content—are relatively high compared to the payoff from getting it right.

What this means is that two AIs (in our examples, Amazon’s product recommendation AI and Facebook’s content moderation AI) can be equally effective in terms of their error rates, but because of the stakes involved, they will be deployed and relied on in low-stakes environments but require substantial additional human resources in high-stakes ones. When AI predictions are faster and cheaper but not better than human ones, AI adopters need to be careful. When the stakes are high, then further care is needed. On Facebook, the AI enables the content moderators but does not replace them. In contrast, when the consequences of a mistake are low, then faster and cheaper might be enough. Amazon does not require human salesclerks to help its customers shop.

Introducing the Personal Touch

In many cases, by adopting AI prediction, the hope is you can introduce more products. Having greater product variety means you can find consumers who are more likely to find personal value from what you are offering. Indeed, taken to the extreme, AI prediction has offered the hope for complete product personalization: that is, you can supply as many products as there are people. This, however, while a seductive notion, places considerable weight on what AI can actually deliver. As the music-streaming service Spotify found out, just having the ingredients to provide a personalized product experience does not mean that consumers actually end up taking advantage of it. Instead, you need to think about what AI is doing in the context of all the activities involved in using a product.

Music streaming means that you can play virtually any music, anytime, and in any order. This is so commonplace that it is hard to remember what used to happen. Previously, you were limited to the collection of music you owned, which, save for the mixtape, meant you played by album collection or you listened to music on the radio that was curated for its entire audience. Music streaming meant that you could personalize to your heart’s content. For die-hard music fans, it was a dream. For the rest of us, it was an option, but one we rarely exercised.

In so many respects, what AI prediction allows is the ability to predict a person’s tastes. Thus, a music-streaming service, which has the potential to offer as many products as there are people, has to actually deliver that product. For a service such as Spotify, at the outset, this meant giving people the tools to build their own playlists and be as personal as they chose to be. In reality, personalization requires effort. Predicting your own tastes and building playlists to deliver it takes time and effort. And since no one has knowledge of all music, especially for new music, what ends up happening is that they listen to old staples: “Spotify was a powerful product—it gave you access to almost all the world’s music. But it wasn’t a very helpful product for those who didn’t already have that time or knowledge. In fact, for them it felt like a lot of work.”13

Instead of personalization that increased the diversity of music listened to, leaving curation to the hands of the people led to a high concentration of listening to just a few very popular artists. In the end, Spotify was a radio station with very little personalization. And if it was that, why would people pay for it?

Before investing in AI, Spotify tried to use the playlists generated by aficionados to build recommendations for others. However, that turned out to be imperfect because those playlists were personalized. They made sense to those people, but if others used them, they would occasionally be strange. For instance, there might be a Christmas song interspersed with heavy rock music. Spotify called it the “WTF problem.” Those playlists were not going to scale. The WTF problem was an indication that the stakes were too high.

Turning to AI prediction, the playlists became not the recommendation but the training data. Put this together with information about how people described songs on the internet, and you could break down songs and identify common elements within them. Then you could use all this to predict what songs made sense together, eliminate the WTFs in the mix, and then serve up recommendations that satisfied listeners. With this, in 2014, Spotify launched its “Discover Weekly” playlist that became popular among its listeners. It was personalized, broadened tastes, and gave Spotify users an experience that was otherwise not readily available.

Interestingly, because that AI prediction in Discover Weekly was influenced by the playlists in its own system, Discover Weekly was still mostly popular among users who already had very diverse tastes. Spotify still had an issue of how to scale the prediction for personalized playlists for most of its users. The problem was that Spotify was still just looking at the music. But it realized that, for most people, there was a reason for or a context in which they were listening to the music. They were exercising, relaxing, or driving in the car or any number of things. This, however, was not something that could easily be resolved by AI prediction. But a person could judge that.

Thus, Spotify engaged editors who understood context well. Those editors could identify the seven hundred songs most likely to be “sung in the car.” But rather than choosing fifty as a radio editor might, the AI prediction engine ranked those seven hundred songs according to the user’s personal tastes. This set of potential songs to include reduced the stakes. Even if the personalization wasn’t perfect, a mistake would still mean a song that fit the moment. The playlist went from what fifty songs you would like to hear without context to what fifty songs we assess are those you would want to sing in the car or would like to hear. Spotify’s users embraced that concept, and the resulting product was a widespread hit.

Judgment Reveals Stakes

When radar was first introduced in World War II, it was tough to interpret. The idea was that the radar could detect enemy aircraft and prepare for their arrival sooner. But there was a problem. How could you be sure that what you were seeing was an aircraft or some other anomaly, or worse, one of your own planes? You couldn’t. Radar operators had to get a feel for those other possibilities and make a call. But no matter how good they were at predicting, there was always a high probability of a false alarm (it wasn’t an enemy plane) or a miss (an enemy plane that wasn’t identified). That raised the question: How sure do you want to be before making a call? 50:50, 70:30, 80:20?

We can reframe this in terms of loss functions. Loss functions are not just a measure of how accurate a prediction is relative to the stakes but also the consequences of following that prediction with the action you take. In the self-driving car case, the engineer needs to determine what level of confidence that the object is a person should trigger a stop or swerve action. The higher that level, the more there will be false positives with the benefit of fewer false negatives. The same issue exists for the radar-signaling problem.

The relevant stakes to use come from judgment, which, as we have already noted, is determined wholly by humans. While judgment tells you what the value of different possibilities is overall, stakes focus on one particular aspect of that judgment—the relative consequences of errors. Thus, in deploying AI, you want judgment to come from the right person.

For cars, pre-self-driving, judgment rested with drivers who would see objects and take action accordingly. They could make mistakes, but could be relied on to not want to hit other people and ensure the system worked. What we could not know was what each driver’s prediction error was to calculate the stakes involved. Somehow the combination just worked out.

For a self-driving car, you need to measure the prediction error and determine the stakes. This means for the loss function you have to quantify judgment. It requires being explicit about the risk of loss of life versus the usefulness of the car.

This unpleasant and ethically fraught task is why we may never rid ourselves of the human at the wheel. For radar detection, alerts can be given, but real actions require a human sign-off. As anyone who remembers the movie War Games knows, you don’t really want decisions based on radar interpretation to be fully automated. Instead, the prediction is separate from the decision, and the decision is ultimately made by a person.

Thus, while we have framed the choice as to whether to adopt AI prediction or not in terms of the level of stakes, the choice about whether to fully automate a decision or not on the basis of AI prediction relies on how measurable stakes can be. In the end, as of 2022, we are still left to wonder whether we will achieve that measurability to allow cars to drive themselves.

KEY POINTS

  • No prediction—AI or otherwise—is perfect. Sometimes there are errors. Furthermore, there are different types of errors. For example, a simple binary prediction has two types of errors: false positives and false negatives. Different error types may incur different costs, depending on the decision they inform. So, when determining how to best use predictions in decision-making, businesses must consider the costs of different types of errors.
  • The costs of errors determine the stakes of the prediction. In low-stakes situations, if you follow an incorrect prediction, the costs are relatively low (for instance, recommending the wrong product to an Amazon customer). Other decisions involve high stakes (for instance, flagging Facebook content as safe when it is not). The returns to automating low-stakes decisions are usually higher than for high-stakes decisions, where the benefits to human oversight often outweigh the costs.
  • Whether a decision is low or high stakes depends on judgment—humans determine the cost of errors for every decision. Judgment is even more important when the predictions being relied on are more imperfect.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset