Chapter 5. Case study: click prediction for online advertising

This chapter covers

  • A real, large-scale intelligent system
  • Targeting users with web-browsing data
  • Ranking users’ propensity to interact using logistic regression

Online click prediction is a specific problem in the world of advertising and is an excellent example of a high-velocity, high-throughput problem where decisions must be made with a very low latency. In general, this class of problems has a large number of applications. Online trading, website optimization, social media, the Internet of Things, sensor arrays, and online gaming all generate high-velocity data and can benefit from fast processes that make decisions in light of the most recent information possible.

Every time you open a browser and navigate to a web page, thousands if not millions of decisions are made in order to determine what ads to put in front of you. These decisions are made by communicating across many data stores and by understanding whether a specific ad will have a positive impact on the user. This has to happen quickly, because decisions must be made before the page can finish loading.

From an advertiser’s standpoint, we can only proxy a positive impact through interaction with an ad. So, we try to predict the likelihood of interaction with an ad on a web page, based on all of the information surrounding a user and the context. Due to the volume and velocity of data generated on the web and the requirement for a rapid answer, this isn’t a simple problem.

The solutions presented here should be equally applicable to different applications. For example, if we consider a smart city, where all cars are enabled with sensors to determine the speed, velocity, and chosen route, low-latency decisions could be used to open/close contraflow lanes or reroute traffic to keep it moving. This application follows a similar problem pattern. Can we take a huge amount of data that’s arriving with high velocity and build models around this to perform decisioning quickly? Luckily, the answer is yes! In order to properly explore this solution, we’re going to have to delve a little into the specifics of online advertising. For maximum generality, we’ll try to keep advertising specifics to a minimum and focus on the intelligent algorithms, drawing parallels with other applications as much possible.

5.1. History and background

Opinions regarding the practice of online advertising are polarized among web users. Despite this, it does look like online advertising is here to stay. Spending has seen year on year increases of around 5%, set to reach $189.38 billion US in 2015, increasing steadily year on year to 2018, where total spend will hit a massive $220.55 billion US.[1] As more money moves from traditional media channels such as print and TV to the digital world, and advertisers begin to use the new ways in which we interact with media and our devices, technology will play an even more prevalent role.

1

“Total US Ad Spending to See Largest Increase Since 2004,” eMarketer, July 2, 2014, http://mng.bz/An1G.

The initial growth of online advertising was at least in part due to the increased opportunity for performance measurement.[2] Advertisers could for the first time measure exactly how many users they had reached and, more important, their subsequent behavior. This was a boon for the industry, because with traditional media channels the efficacy of advertising could only be measured through the indirect effect it had on aggregate sales. It was possible to determine whether a particular user bought a product and their exact behavior leading up to the purchase. It also became possible to test the effectiveness of ads in place. Advertisers could now test many different ads, images, and copy in real time, discounting those that did not perform well and increasing budgets in those that did. In general, optimization problems such as this one have many applications outside advertising.

2

Bill Gurley, The End of CPM,” Above the Crowd, July 10, 2000, http://mng.bz/Vq6D.

Ultimately, what was witnessed was a drive toward data-driven decisioning. More and more data could be captured and used in order to better target users. This in turn led to the development of a whole new ecosystem that was created to service both buyers and sellers in this environment. In this chapter, we’ll introduce the concept of the exchange and the demand-side platform. We’ll further develop these concepts in this chapter, and you’ll learn how you can use the information collected about a user to increase the odds of interaction with an ad.

Figure 5.1 provides a graphical overview of the ecosystem. You can see that the exchange is the medium through which publishers (websites such as theguardian .com, huffingtonpost.com, and more) can sell advertising space (slots and spaces of a predefined size) to individual advertisers (think companies like Nike, Adidas, O2, and Vodafone). Advertisers often buy through what’s known in the industry as a demand-side platform (DSP).

Figure 5.1. The online advertising ecosystem. Publishers make their content (advertising slots) available through an exchange that allows access programmatically—that is, through an intelligent algorithm. A demand-side platform aggregates demand for content from advertisers and buys on their behalf.

A DSP acts on behalf of multiple advertisers and combines buying power to reach the very best publishers on the web. There are many examples of DSPs on the web, such as Turn (www.turn.com), MediaMath (www.mediamath.com), and DataXu (www.dataxu.com). These platforms are a conduit through which it’s possible to buy advertising. We can easily draw parallels with the world of equity trading. Not everybody is allowed on the trading floor, so representatives act on behalf of many individuals who wish to buy. In this parallel, the DSP is equivalent to the trader, where the advertising exchange is equivalent to the stock exchange. In the first case, the DSP aggregates demand, just as an individual trader does. In the second case, both exchanges act as a platform for assets to be bought and sold.

In both the advertising and equity markets, an intelligent algorithm now more often performs trading than a human. In the world of finance, this is often referred to as high-frequency trading. In the advertising world, this is known as programmatic buying, because advertising space is bought through the use of a computer program (an intelligent algorithm) as opposed to in person.

An intelligent algorithm is used to determine the quality of the user and ad available for purchase, and this ultimately impacts the price that’s bid (and subsequently paid) to place an ad in front of that user. In the following sections, we’ll look at a specific algorithm used to determine quality, but you first need to know a bit more about the process by which an exchange makes media available for trading. We’ll then provide some real data and discuss some of the specific complications that arise when training large quantities of data. Without further ado, let’s jump in!

5.2. The exchange

Figure 5.1 presented a high-level overview of how ad placements (the space in which an ad is placed on a website) are bought and sold. The exchange cloud hides a high degree of complexity. In this section, we’ll lift the lid on this complexity and provide a stripped-back version that illustrates the process involved blow by blow. Figure 5.2 provides a graphical overview of this process.

Figure 5.2. A graphical overview of the steps involved in buying ad placements on an exchange. Events should be read from top to bottom. First, a cookie match occurs. This can be performed either exchange side or DSP side. Once the DSP has a unique identifier that it can use, it uses this to look up additional information about the user and construct a bid price (the subject of section 5.4.5 onward). If the bid is successful, bid notification is returned for logging by the DSP. The next stage is for the DSP to specify the ad to be served. Physically, this can be served by the exchange itself or by the DSP. Events in the ad are fired directly back to the DSP regardless of where the ad is hosted.

5.2.1. Cookie matching

This is the first step that occurs every time a page that contains ads is loaded. A cookie is a unique identifier for a user for each domain (or website/web service) that’s held in the browser. It’s guaranteed to be unique, but it lasts for only a specific lifetime or until a user flushes their browser cookie cache. Because of strict security enforced by the browser, web services can’t share each other’s cookies; if an exchange wants to provide an individual’s reference to a DSP so they can make a decision about whether they should place an ad in front of the user, one of two things must happen.

Exchange-side cookie matching

The first option relies on a unique mapping of DSP cookies to the exchange’s cookies to be hosted by the exchange itself. In this scenario, whenever a page is loaded, the exchange is contacted and looks up the appropriate cookie ID for the DSP. The details of the ad placement are then returned to the DSP, along with that DSP’s unique identifier. It’s then up to the DSP to look up further information about this user from its stores and decide what to do with this opportunity. The benefit of this setup is that you don’t need to host a service yourself to do the cookie translation; but the service will more than likely cost you extra with the exchange.

DSP-side cookie matching

The second option relies on the DSPs themselves hosting this service. Whenever a page is loaded, the exchange contacts the DSP with its cookie ID (and additional information about the ad placement), and you perform the lookup. This obviously adds extra complexity to your decisioning stack due to the time-critical nature of lookups (exchange responses must occur within milliseconds). It might cost you less in the short term, but the service must be developed and maintained.

In either case, the result is exactly the same for the DSP. Every time a page is loaded for which there’s an opportunity through the exchange, the DSP obtains a unique identifier for the user that can be used to look up further information. The DSP also obtains specific information about the ad placement that’s available. What the DSP does with it gets to the heart of the intelligent algorithm presented in this chapter. In order to perform well against the other DSPs in the space, it’s imperative that it uses all of its historical information to predict the future behavior of this user. This allows the DSP to pay the right price for this user, maximizing its chances of serving an ad and minimizing the chances of wasting money.

5.2.2. Bid

Given a unique identifier for the user and some general information about ad placement under offer, the DSP has to now decide two things: first, whether to place an ad in this placement; and second, if it does, how much is it willing to pay for the ad?

This seemingly simple question hides a raft of complexity. If a DSP is paid on a cost-per-click basis, then you might wish to bid based on the expectation of a click (and therefore monetary return). This basic approach is the subject of section 5.4.4, but it’s possible there are other objectives to meet. Advertising campaigns of different sizes have different budgets, and it may be important to those advertisers to meet their budget or to pace their budget evenly throughout the running time of the campaign. Campaigns may also be in contention with each other for the same space on certain publishers. They may not be interested in clicks at all but perhaps conversions (that a user then clicks and subsequently buys something). So, you see, what appears at first to be simple is actually a difficult multi-objective optimization problem!

5.2.3. Bid win (or loss) notification

Once a DSP has decided if it will bid and by how much, subsequently placing the bid, the next thing that occurs is a notification from the exchange specifying whether the DSP was successful. In the event of failure, no further action is taken. Because the DSP didn’t win the placement, the failure will be logged, and that will be the end of communication for this particular opportunity.

If the DSP was successful in securing the ad placement, then the cost of this placement is returned to the DSP. The majority of exchanges use what is known as a second price, or Vickrey, auction[3] in that the winner (highest bidder) of the auction pays only the price of the second-highest bid. Consequently, this final price must be returned to the DSP for logging before the next phase proceeds.

3

Vijay Krishna, Auction Theory, 2nd ed. (Academic Press, 2009).

5.2.4. Ad placement

Ad placement is the most straightforward of all of the phases; once an ad placement has been won, the ad must then be rendered in it. All exchanges use a defined set of ad sizes[4] for which each advertiser and each campaign has a pre-prepared set of ads (known sometimes as creatives). Some exchanges go as far as to have these creatives preapproved, so that the exchange can check these for quality and brand safety before a bid is even made.

4

Internet Advertising Bureau, “IAB Standard Ad Unit Portfolio,” www.iab.com/guidelines/iab-standard-ad-unit-portfolio.

As with cookie matching, ads can be served by DSPs themselves or by the exchange. The latter has clear safety benefits for the exchange but will come with cost implications for the DSP.

5.2.5. Ad monitoring

The final stage of the auction sequence allows us to gather data on the ad that was eventually shown. This information is invaluable for several reasons. The first is that advertisers need to know whether an ad has been interacted with—that is, if it has been clicked, watched, or shared. Without this, we’d have no metric by which the clients can measure success, nor could we generate an invoice based on these metrics! The second is that this information is crucial to us as a DSP. If we know what interests a user, then we can build and target ads that lead to more positive responses for that user and to users similar to this one. Monitoring is both an ongoing requirement and the means to operate effectively as a business.

5.3. What is a bidder?

Understanding the process of bidding is the first step toward understanding the context in which an intelligent algorithm must operate. So far, we’ve only provided a high-level overview of how we buy ads on an exchange. All we know is that we’re given some information by the exchange and that we must return a bid price in order to secure this user and this placement. In practice, a bidder that follows loosely the design presented in figure 5.3 achieves this.

Figure 5.3. The bidder consists of an outer shell that’s responsible for performing the crucial tasks associated with bidding and the decision engine that’s responsible for taking into account all data associated with the bid.

The bidder is made up of an outer shell and an inner core. The outer shell is responsible for performing the crucial tasks associated with bidding, such as retrieving user information and responding to the exchange in a timely manner. At the inner core of the bidder is the decision engine, responsible for taking into account all user, placement, and context data and returning an estimate of click probability. In the rest of this section, we’ll look at some requirements for a bidder (the outer shell) before delving further into the details of the decision engine (the inner shell) in section 5.4.

5.3.1. Requirements of a bidder

In general, we can evaluate a bidder against its performance in the following categories. Although the list isn’t exhaustive, the items on it are of upmost importance for high-velocity, low-latency applications. We discuss the ramifications of these and their possible tradeoffs here:

  • Speed— This is arguably the most important of these categories. A bidder must be fast. All exchanges place an extremely tight time requirement for response on their partners. From initial page load to the auction occurring and ad placement being filled, the typical duration is somewhere in the region of 100 milliseconds. Failing to meet this time requirement consistently can lead to missed opportunities and, in the worst case, to being barred from the exchange! The exchange has a reputation to uphold with its publishers and can’t risk the possibility of empty ad slots that not only look unsightly on the publisher’s page but also represent missed revenue for them.
  • Simplicity— One way to achieve speed is through adopting a simple design. The bidder should perform only one function and should be easy to debug and maintain. Remember, once your bidder is live and participating in auctions, then every moment it’s offline represents a loss of revenue for both you and your advertisers.
  • Accuracy— Exchanges contain a lot of inventory. Part of the difficulty of participating in an exchange is that it’s easy to buy lots of poor inventory and to buy it very quickly! For this reason it’s extremely important that your bidder be accurate. Over time, you should be able to determine a positive correlation with the price that you bid and the performance of these ads.
  • Extensibility— There are many exchanges available through which inventory can be bought. Some inventory is available through only one exchange, whereas others are available on many. In order to maximize your reach as a DSP, you’ll want to integrate with as many exchanges as possible. For this reason, extending your bidder should be easy. Don’t be tempted to tightly couple your bidder with the first exchange that you integrate with, but keep your solution as generic as possible.
  • Evaluation— Once your bidder is in place and performing reasonably well, you’ll want to start improving your decisioning to try to obtain the best users at the lowest cost. Every change to your bidder and its associated decisioning must be monitored to ensure that changes have the right effect. Often the results of a change can only be seen over days and weeks, not minutes and hours.

5.4. What is a decisioning engine?

In the previous sections, we introduced the basics of the problem of buying ad placements online through an exchange and how this is performed in practice through the use of a bidder. In this section, we’re going to concentrate on a specific aspect of the bidder. We’ll uncover how to take the raw data provided by the exchange and convert this into a value you’re willing to pay to deliver an ad. Let’s look at the types of information you can use to generate these probabilities.

5.4.1. Information about the user

In reality, the only information provided about a user is a cookie identifier. It’s up to the DSP to record and store such information. When someone is seen for the first time, no information is available about them whatsoever! If they interact with an ad they’re shown, this will be stored, as will any other interaction that occurs in the scope of a DSP’s advertising: for example, if the user starts a video or cancels a full-screen advertisement. These small interactions are stored away and looked up each time this user is seen in the wild. This allows a DSP to build up a picture of user behaviors that are synonymous with the likelihood to click, as well as those that aren’t.

5.4.2. Information about the placement

Quite separate from information about the user, information about the ad placement can be predictive of a click. Independent of the user, certain sizes of ads tend to perform well, as do certain publishers. For example, ads on premium news sites such as theguardian.co.uk may have a higher click-through rate (CTR) than those on small business retail outlets.

5.4.3. Contextual information

Finally, there are sources of information that don’t fall into either user or placement groups. Such information is typically harder to use but is no less valuable. For example, we know that long weekends such as Black Friday[5] and Cyber Monday[6] typically lift CTR across the board and that sunnier days tend to yield fewer clicks than rainy days, when people may venture out and shop at brick-and-mortar stores rather than stay in and buy online!

5

BBC News, “Black Friday: Online Spending Surge in UK and US,” November 27, 2015, www.bbc.co.uk/news/business-34931837.

6

BBC News, “Cyber Monday Adds to Online Sales Surge,” November 30, 2015, www.bbc.co.uk/news/business-34961959.

5.4.4. Data preparation

The first step in building a decisioning engine is to prepare your data. Given that the measurement of success is the user clicking the ad, essentially you have two outcomes: a positive one, where the use clicks the ad, and a negative one, where the user doesn’t. Thus, you require training instances for each. Because the volume of data received is very large, this might be problematic, not just from the point of storage but also from the point of model training. You need a way to reduce the size of this data without losing the patterns contained within. This is achieved by down-sampling—that is, randomly selecting a smaller subset of the data.

You generate a new, down-sampled dataset in the following way. First, you keep all ad impressions that resulted in a click. These precious data points can’t be down-sampled, but because they represent only a small fraction of all impressions shown (an industry average click-through rate is much less than 1%), you don’t need to. Instead, you can heavily down-sample the impressions that didn’t lead to a click, leaving a dataset of a much more manageable size. This also has the effect of rebalancing your dataset. So, instead of having a very large dataset with very few positive impressions (which few classification or regression techniques will manage well), you now have a much more balanced dataset in which the positive instances represent a larger proportion of the total. We’ll discuss the importance of balancing your dataset further in section 5.5.2.

5.4.5. Decisioning engine model

In chapter 4, we introduced logistic regression and used this to perform fraud detection on a very small dataset. This same approach has been used successfully in many DSPs to calculate the probability of a user interacting with an ad. In the previous chapter, we placed a hard threshold on the output of the logistic regression model in order to create a classifier; but in this chapter, we’ll use the output of the logistic regression model as is, taking the output probability as the likelihood of a click. Recall that the general form of the logistic regression model is given as follows

where y is the probability of the event under investigation, 1-y the inverse probability, βi for i>0 represents the feature coefficients, and β0 represents the threshold. Thus, in order to determine the log odds of the likelihood of clicking, we’ll train a model using the data prepared as per section 5.4.4 and calculate the log odds (and subsequently the probability) of a click at the time of bid request by substitution into the previous equation.

5.4.6. Mapping predicted click-through rate to bid price

Assuming that we have a trained model as given previously and that we’ve just received a bid request from the exchange, mapping the output of our model to a bid value is a fairly straightforward process. We’ll define a few terms first to make this easier:

  • CPI— Cost per impression
  • CPM— Cost per 1,000 impressions
  • CTR— Click-through rate (∑ clicks / ∑ impressions)
  • CPC— Cost per click

In general, exchanges accept bids on a CPM basis. Rather than specify how much we’re willing to pay for a single request, we return how much we’d pay for 1,000 such requests. Assuming a client provides us with a CPC target (how much they’re willing to pay for a click), we have a well-defined problem:

Bid CPM = CPC × E(CTR) × 1, 000

Bid CPM = CPC × y × 1, 000

That is, our returned bid should be proportional to the probability of a click (given the information known about the placement and the user) multiplied by the cost of a single click, multiplied by 1,000.

5.4.7. Feature engineering

Sections 5.4.1 through 5.4.4 described the basic features that may be available at the time of a bid request, but these aren’t always used in their basic form. Such features are used to generate several hundred more features, many of which then become binary indicators of a particular user status. For example, we could cross the domains visited with an ad-size feature to gain an individual feature for every particular combination. This removes the assumption that the domain and placement ad size are independent predictors of a click and allows the model to pick out excellent-performing ad placements on a site-by-site basis. Many DSPs and algorithm designers in this space will be understandably guarded about such details because these represent their secret sauce! Data science teams are constantly striving to improve the performance of their algorithms through feature engineering.

5.4.8. Model training

Because of the large volume of training data available, it’s important to be able to perform out-of-core learning. Although Python with scikit-learn is great for data exploration and applications with a modest amount of data, it’s not suitable at all when the amount of this data increases dramatically. This is because the implementations in scikit-learn require that all data being operated over must reside in memory. Consequently, we chose to adopt John Langford’s open source Vowpal Wabbit (VW, https://github.com/JohnLangford/vowpal_wabbit). VW is currently receiving a high degree of interest and is being actively developed by many in the machine-learning community. We’ll provide a brief primer on its use so that you can evaluate some exchange data, but we encourage you to read the documentation on GitHub.[7],[8]

7

John Langford, “Vowpal Wabbit command-line arguments,” http://mng.bz/YTlo

8

John Langford, Vowpal Wabbit tutorial, http://mng.bz/Gl40.

5.5. Click prediction with Vowpal Wabbit

As just mentioned, Vowpal Wabbit is a fast machine-learning library that’s able to train models without reading all data into memory. In general, VW supports a class of algorithms known as linear supervised learning (although VW does now provide some support for nonlinear algorithms). They’re supervised in the sense that the classes that are to be learned are provided and linear in the sense that the general form of the model is linear, as you saw earlier.

In this section, we’ll use the power of VW and use some real-world data generated by Criteo, a leading DSP. Criteo has made the data available in order to spur research in this area, but it comes with a catch. The data is absent of any semantics, so we don’t know what each field refers to or the meaning of the values contained within. Download the full dataset (4.5 GB),[9] and we’ll outline the steps to get the classifier up and running and navigate around some of the pitfalls that might beset the would-be practitioner.

9

Criteo, “Kaggle Display Advertising Challenge Dataset,” http://mng.bz/75iS.

First you’ll need to install VW; you can find detailed instructions for this on the GitHub page[10] as well as in the prerequisites file accompanying this book. Because VW is an active research project, documentation can sometimes lag behind the most recent version of the code. Thus the most up-to-date information can often be found on the mailing list, where you’ll also be able to interact with the developers of VW.

10

John Langford, Vowpal Wabbit wiki, http://mng.bz/xt7H.

VW is accessed from the command line and is controlled, in the most part, via command-line switches. It’s an extremely powerful platform, and we’ll touch on only some of its features here. As ever, the GitHub page and mailing list will have the most detailed information, and I strongly urge you to keep checking there if you’re going to develop further applications with VW.

5.5.1. Vowpal Wabbit data format

Vowpal Wabbit uses a specific file format that is very flexible but can be difficult to understand at first. Before we get started, we need to convert the file provided to us by Criteo into one that Vowpal Wabbit can understand. A sample of the raw data provided by Criteo is given in figure 5.4.

Figure 5.4. Overview of the raw data provided by Criteo. The first field relates to the event we’re trying to predict. This relates to the user performing some event, such as clicking the ad. The next 13 columns are integer properties with count semantics; a higher number means some event has been performed more times than if a lower number is seen. The next 26 features are categorical and most likely represent some property of the bid request: for example, the site from which the bid request was made.

Criteo provides 13 features, each taking an integer value and 26 categorical features. The data is presented such that for a single log line, all integer features occur first, followed by the categorical features.

According to the “readme” file provided with the data, the integer features consist of mostly count information (for example, the number of times a user has performed something), whereas the values of the categorical features have been hashed into 32 bits. This means although integer values have integer semantics (a higher number means an event has been performed more than a lower number), we can’t prescribe any such semantics to the categorical values. We might imagine that each categorical value represents some property: for example, which site the user is on at the time of the bid request. Note that hashing ensures consistency across bid requests—that is, the same hashed value implies the same non-hashed value. So although we don’t know the meaning of a particular hash, the meaning is consistent every time we see that hashed value.

To be compatible with the VW file format, we first need to change the class labels. VW refers to positive instances with a 1 and negative instances with a -1. More important, we must change the format of each log line to use the pipe (|) to separate features and specify to VW which features are categorical and which are continuous. This latter point is extremely important, because if we don’t distinguish between the two, our model won’t use the count semantics just mentioned. Each unique count value will be treated as a label, and the model will lose the ability to understand that all the count values are related through integer semantics: for example, that 2 is greater than 1, 3 is greater than 2, and so on. If you look in the accompanying folder for this chapter, you’ll find that we’ve already written this code for you!

In order to proceed with the example in this chapter, you’ll need to download dac.tar.gz from Criteo,[11] untar/gz the file, and extract the train.txt data file from the dac subdirectory. Next, run the included Python script against the training file using the following command:

11

python ch5-criteo-process.py </path/to/train.txt>

This will create two files called train_vw_file and test_vw_file. These are created in the current directory of the aforementioned script. Note that if you wish, you can change the output files from this script by providing two additional parameters:

python ch5-criteo-process.py </path/to/train.txt> </path/to/train_vw_file>
</path/to/test_vw_file>

Each resulting file contains data in the following format:

-1 |i1 c:1 |i2 c:1 |i3 c:5 |i4 c:0 |i5 c:1382 |i6 c:4 |i7 c:15 |i8 c:2 |i9
c:181 |i10 c:1 |i11 c:2 |i13 c:2 |c1 68fd1e64 |c2 80e26c9b |c3 fb936136
|c4 7b4723c4 |c5 25c83c98 |c6 7e0ccccf |c7 de7995b8 |c8 1f89b562 |c9
a73ee510 |c10 a8cd5504 |c11 b2cb9c98 |c12 37c9c164 |c13 2824a5f6 |c14
1adce6ef |c15 8ba8b39a |c16 891b62e7 |c17 e5ba7672 |c18 f54016b9 |c19
21ddcdc9 |c20 b1252a9d |c21 07b5194c |c23 3a171ecb |c24 c5c50484 |c25
e8b83407 |c26 9727dd16

Let’s look at this in detail. In general, attributes of this data point are separated by the pipe, or vertical bar (|). Note that the first entry is -1. This is the target class for this datum. This data was extracted from an ad that didn’t end up receiving a click from the user it was targeted at; positive 1 would denote that a click did occur.

Notice the human-readable tags: i1, i2, i3, and so on. These are called namespaces. We’ve chosen namespaces beginning with i to denote integer features and those beginning with c to denote categorical features. Namespaces in VW are an advanced feature that allows you to experiment with the combination of different spaces. This would allow you, for example, to generate model coefficients for derived features with a simple command-line switch: if you knew that namespace c1 represented the site that the bid request was seen on and namespace c2 represented the ad unit size, you could easily include “ad size per site” as a new feature in the model. Also notice that some namespaces include an additional piece of information, denoted by c:. This is the name of the feature in the namespace. Where no name is specified, its name is implied by the value. This allows you to express the difference between categorical variables and continuous ones. Let’s take a concrete example to ensure that you understand this.

The c1 namespace in the previous example contains the feature 68fd1e64. Implicit to this is a value, 1, that denotes that feature 68fd1e64 is present; thus, it’s categorical. We could equivalently write this as 68fd1e64:1. In the example of i1, the namespace has a feature, c. This has associated with it a continuous variable, 1, denoting the count on the number of events (of name c) relating to i1 that have occurred. In practice, this distinction is important because it means the difference between training a single coefficient for a continuous variable (the correct way to proceed for integer semantics) or a coefficient for every seen value of the variable (incorrect!). We won’t go further into this here, but I encourage you to read the associated documentation on namespaces on the project wiki.[12]

12

John Langford, Vowpal Wabbit input format, http://mng.bz/2v6f.

How features are represented in VW

VW uses a bit vector of a specified size called a feature table, and each feature is represented by hashing into this vector. Weights are learned, not per feature, but per entry within the vector.

This has a few benefits, including a reduced fixed-size learning space, but it means collisions can occur (where two different features trigger the same bit). In general, provided an appropriate vector size is chosen, the impact of this on the accuracy of the trained classifier is negligible. It also means that to uncover the impact of a feature (analyze its learned weights), we need to hash into the bit vector and perform the dot product of the vector and the weights. We’ll encounter this later.

We mentioned earlier that you need to create two files. You create these files directly from the Criteo training data (train.txt). Because the Criteo data was originally compiled for a competition, the test data file (test.txt) that comes with the package doesn’t contain any labels. This would make it impossible for us to discuss the efficacy of our classifier without submitting it to the competition. In order to create our own test set, we further split the train.txt set into two files at the rate of 70/30: 70% of the training data is placed in a new training dataset file (train_vw_file), whereas we hold the remaining 30% in a test file (test_vw_file) to evaluate our solution. Now that you understand the format of the data that we’re working with and its internal representation in VW, and you have both a training set and a testing set, let’s proceed with data preparation.

5.5.2. Preparing the dataset

First, we’d like to understand how balanced the data is. By balanced, we mean is the proportion of positive samples roughly equal to the proportion of negative samples? This is an important step because the training algorithm we’ll use is sensitive to such properties of the data. Without a balanced dataset, the coefficients of the logistic model will be updated to reflect the negative samples more often than the positive samples. Put another way, the negative samples will have more importance than the positive ones. This isn’t ideal and can be best tackled by forcing the data to be balanced before training.

We can ensure that this is true with a bit of command-line manipulation. A word to the wise: these files are large, and editing them in your favorite text editor may cause your memory to fill up and your application to crash! Take a look at the following listing for an appropriately safe method to do this.

Listing 5.1. Determining how balanced the data is

Here we use some simple command-line utilities to determine the number of positive and negative examples in the training set. Because every training example is on its own line in the data file, the first line tells us how many training points are available in the training set. The second and third lines use grep to select and count only those lines starting with negative and positive outcomes, respectively. We see that there are approximately three times more negative examples than positive ones. Note that this isn’t representative of the general click-through rate. Clickers generally represent a much smaller proportion of the population; we believe this dataset has previously been down-sampled by Criteo an undisclosed amount.

We’ll also need to shuffle our data. Why? VW performs what is known as online learning. Remember, it’s out of core, in that the whole dataset doesn’t need to be in memory. In order to learn effectively on a line-by-line basis, data must be drawn randomly from the underlying distribution. The simplest way to do this is by shuffling the input data before running over this line by line. For more information on stochastic gradient descent (SGD), the online method we’ll use to train our model, the following papers will be useful: Bottou, “Online Learning and Stochastic Approximations,”[13] and Bottou, “Stochastic Gradient Tricks.”[14]

13

Leon Bottou, “Online Learning and Stochastic Approximations,” in Online Learning and Neural Networks, by David Saad (Cambridge University Press, 1998).

14

Leon Bottou, “Stochastic Gradient Tricks,” in Neural Networks: Tricks of the Trade, ed. Gregoire Montavon, (Springer-Verlag, 2012): 421–36.

To balance and shuffle our dataset for training, we’ll continue to work at the command line. The next listing provides the command to extract, shuffle, and create the dataset.

Listing 5.2. Balancing, shuffling, and creating the training dataset

This listing uses several more command-line tools to down-sample and shuffle our data, creating a balanced dataset. As in listing 5.1, we use grep to select those lines of the training file that we’re interested in (positive or negative samples) before using sort to shuffle the data into an output file. We use awk to pick every third line from the negative examples and rely on redirection to move data between files. Listing 5.2 is heavily annotated for those interested in the precise details, but the result of running this code snippet is the creation of a balanced and shuffled training set called all_examples_shuffled.dat; this training set is then further down-sampled by 10 to create a more manageable dataset, all_examples_shuffled_down.dat. Both of these datasets are balanced, but one is a (smaller) subset of the other. The rest of this chapter uses the more aggressively down-sampled dataset, because it will return a trained model much more quickly. But we must note that because VW is out of core, you should have no problem training on either dataset; the only difference will be the time taken to complete, and possibly the size of the feature table required (discussed shortly). I’ll leave it you to re-create the training and testing steps using the larger training set.

With our data shuffled and ready to train, we can now proceed to call VW. The next listing provides the code to train and to create a human-readable view of our model.

Listing 5.3. Training a logistic regression model using VW[15]

15

This has been chosen because on analysis of the dataset used here, there are more than 2.1 million individual features. To be able to represent these uniquely, you’d require a binary feature vector of size at least log2(2.1 × 106), which, rounded up to the nearest integer, is 22. Note that when using the full, non-down-sampled data, this number may have to be revised. You can easily count the number of features in this case by running a full pass of listing 5.3 over your new dataset and performing a line count on readable.model.

The first line trains our model using VW. Several command-line switches perform specific functions, and more information regarding these can be found in Langford, “Vowpal Wabbit command-line arguments,” cited previously. In order to perform logistic regression, we need to tell VW to use the logistic loss function to calculate the difference between the desired output and the prediction. This will allow SGD to modify the coefficients to converge on a solution whereby the linear combination of the coefficients and the features approaches the log-odds of the target variable. The -c tells VW to use a cache. This stores the input data in a format that’s native to VW, making subsequent passes over the data much faster. The -b switch tells VW to use 22 bits for the feature vector, which is appropriate for our dataset. The -f switch tells VW to store the resultant model (essentially the coefficients of the model) in a file called model.vw.

If you recall from the previous chapter, the basics of model training are simple to understand—to update the coefficients to minimize the error in prediction—that is, the loss function. To do this effectively, VW stores the features in a hashed format. So, every feature, f, is represented by an associated bitmask given by its hash: h(f). Thus, it’s quick to look up a feature to change its coefficients on each learning pass. This has an associated advantage known as the hashing trick,[16] which allows for very large feature spaces to be stored using a fixed-size representation. This has the effect that learning cost is capped without too much impact to the predictive power of the data.

16

Olivier Chapelle, Eren Manavoglu, and Romer Rosales, “Simple and Scalable Response Prediction for Display Advertising,” ACM Transactions on Intelligent Systems and Technology 5, no. 4 (2015): 61.

Essentially, model.vw contains a vector of learned weights; but, unfortunately, to human eyes it doesn’t make much sense, because we don’t know how the human-readable features relate to this vector. As machine-learning practitioners, we’re interested in the human-readable features so we can gain some intuition as to what features the model thinks are important. This is the focus of the second line of listing 5.3. We call vw again in test mode (-t) using the model we’ve already trained. In general, you should never test on the same data you train on, but in this case we’re not interested in the results! By running through the train data in test mode a single time with the –invert_hash option set, we build a table that links the human-readable feature name to the effective impact on the classifier, through the weights associated with the hashed feature value.

Because there are so many learned values, and they’re written out to the readable .model file in no particular order, before we can analyze the coefficients we must sort this file. We do this using the command-line tools awk, sort, and head. Essentially, we throw away the first nine rows (which include header information and preamble) and then sort the remaining rows on the third column, telling sort that the columns are denoted by the colon (:). The final output is sent to readable_model_sorted_top, and we reproduce the top 20 coefficients in the listing that follows. The features you see here are those that, if set (holding all other features constant), cause the greatest increase to log odds.

Listing 5.4. Human-readable output from VW

Listing 5.4 presents a snippet of the human-readable output. Values after the ^ character are the values of feature in the raw data; the second and third columns present the hash value and the coefficient, respectively. We’ve presented only the top 20 coefficients here (there are over 2 million in the down-sampled set, probably because domain is used in the dataset and every domain is encoded singularly), but we can already see some interesting trends. The most important feature appears to be c24, followed by a c3, and then a c12 . So in general, each of these three features causes the largest increase in log odds if set (in the order presented), assuming all other features are held constant. Note that none of the features with integer semantics appear here.

You may also notice a repetition of weight values for some of the features in this list. This is because those features activate the same bits of the feature vector after they’re hashed (hash collision). You can see this: the number after the first colon is the same for several lines of listing 5.4. Consequently, their impact is unified into the single same value when the dot product between the feature vector and the learned weights is performed.

In reality, this isn’t ideal, although it might not be so bad—the impact is averaged by the learner. In essence, if either of these features is present, a net positive or negative effect is added to the log likelihood of the event that is to be predicted.[17]

17

More information about feature hashing and collision avoidance can be found at http://mng.bz/WQi7.

Unfortunately, because Criteo anonymized the data, we’re unable to decipher what these features relate to. If we could, though, it would be great from an analytics point of view and would make a wonderful story! What we’re really interested in determining is how well our model performs, and (luckily!) we can do this without decoding the features. Recall that in a previous chapter we introduced the AUC, or the area under the receiver operating characteristic (ROC) curve, whereby a value close to 1 represents a perfect classifier. We’ll now apply this methodology to an unseen test set, but first we must use VW and our previously trained model to obtain some predictions from our test data.

5.5.3. Testing the model

The following listing provides the code to evaluate a test set against a trained model and to generate both the probabilities and ground truth that will be used in the next step to create a ROC curve.

Listing 5.5. Testing against a trained model in VW
vw -d test_vw_file 
    -t 
    -i model.vw 
    --loss_function=logistic 
    -r predictions.out
~/dev/vowpal_wabbit/utl/logistic -0 predictions.out > 
    probabilities.out | cut -d ' ' -f 1 test_vw_file | 
   sed -e 's/^-1/0/' > ground_truth.dat

The first command uses the –d option to specify the data file, along with the –t option for testing (no learning). The –i option loads the previously trained model into memory, and we specify the –r switch to output the raw predictions. The next line uses a helper function to transform these raw predictions (which are of the form ) into their associated probability, as specified by y in section 5.4.5. Note that you’ll most likely need to change the path to the helper function on your system. The logistic script can be found under the utl subdirectory of the VW repository (http://mng.bz/pu5U) originally checked out as per the book’s prerequisites/requirements file.

These resulting probabilities are stored in the probabilities.out file. The final line of listing 5.5 separates the columns of the training file by splitting on white space and selecting the first column, which contains the ground-truth class data. This is then piped to sed, which replaces any occurrences of -1 (how VW deals with negative classes in logistic regression) with 0. The output is sent to the ground_truth.dat file.

We’re now finally at the position where we can evaluate the efficacy of our model against the test data. The next listing shows the scikit-learn code that’s required to generate the AUC and a graph of the ROC curve, shown in figure 5.5.

Figure 5.5. The receiver operating characteristic for our model based on the supplied test data. The area under the curve is given as 0.76.

Listing 5.6. Generating the ROC curve with scikit-learn

After the imports, the first 11 lines here are essentially preamble, reading in the ground truth and the probabilities required in order to generate the ROC curve. The data for the ROC curve is extracted in a single line. The rest of the code generates the graph and has been taken from the scikit-learn documentation.[18] Figure 5.5 shows the output from this code.

18

scikit-learn, “Receiver Operating Characteristic (ROC), http://mng.bz/9jBY.

The results show a value of 0.76 for the area under the curve based on our trained model and held-out test data. To translate this into a more intuitive number, if we chose a threshold corresponding to the point on the ROC curve closest to the top-left corner of figure 5.5 and used this threshold as a hard classifier to choose between clickers and non-clickers, we’d have a true positive rate of about 0.7 (that is, when the model predicted a click, it was right 70% of the time) and a false positive rate of about 0.3 (that is, 30% of non-clickers were incorrectly identified as clickers). To obtain these numbers, find the point closest to the upper-left corner on the upper curve in the figure, and read off the values on the x- and y-axes, respectively. In reality, however, models such as these are used not to classify impressions that will be clicked but to determine how much money should be paid to show an ad, as per the methodology in section 5.4.6.

5.5.4. Model calibration

In the previous section, we alluded to the fact that we don’t use the output of our trained model to create a classifer per se; rather, we use the output of the logistic regression model to estimate the probability of the click occurring. But if you remember, to achieve efficient model training, we changed the frequency of the positive events in the data such that the positive and negative events represented an even proportion of the training data. Thus, in order to obtain outputs from the model that are representative of the underlying probabilities of a click, we must calibrate the model.

The method we use here is taken from the paper by Olivier Chapelle et al.[19] and is based on some simple observations. First, recall that, in our model, the log odds of the positive event occurring are linearly related to the input data via the learned weights

19

Chapelle et al., “Simple and Scalable Response Prediction for Display Advertising,” ACM Transactions on Intelligent Systems and Technology 5, no. 4 (January 2014): 1–34.

where we now use the model probability in the numerator and denominator of this equation. Consider now only the distribution of the data. If we apply Bayes’ rule to the top and bottom of the odds and simplify, we obtain the following relationship:

Using the fact that only negative instances of a click are down-sampled and, as in Chapelle’s paper, using Pr′ to denote the resultant probability distribution after sampling the data down, then this may be equivalently written as follows

where r is the rate by which the negative samples have been down-sampled.[20] This leads us to the relationship between the odds in the original dataset versus those in the down-sampled dataset:

20

For example, starting with a dataset of 1,000 positive examples and 9,000 negative examples, if you were to sample the negative examples to 1,000 and create a balanced dataset, then you would have employed a value of r = 1/9.

Thus, we return to our original model equation,

or equivalently

You can train on a down-sampled dataset, removing only negative training instances; but if you wish to obtain output from the model that reflects the probability of the positive event accurately, you must complete a further step.

As you can see, the trained model will have an intercept that’s ln(r) lower than that which would be obtained if you didn’t used a down-sampled set. So if you wish to correct for this, you must add ln(r) to the resultant score. Don’t forget, though, that although we’ve down-sampled the data, it was also down-sampled by an unknown amount by Criteo! Consequently, we’re unable to calibrate the model fully without additional information.

In this section, we covered a great deal and discussed the basics of performing online ad-click prediction. Note that there’s a lot more to this problem than building a good model! In general, you must consider the competing needs of many different advertisers, their metrics for success, and your profit margins. Building an effective model is only the start, and many advertising-tech companies have entire teams of data scientists working on these problems. In the next section, we’ll cover just a few of the complexities you’d encounter when deploying a model such as this.

5.6. Complexities of building a decisioning engine

As the idiom says, the devil is the details. In order to build an effective click-prediction engine, we must address various issues and answer various questions. In this section, we’ll pose some open questions for you to think about—these are very much still open research questions that are being addressed by the ad-tech machine-learning community at the time of writing. In the following paragraphs, we’ll often frame from an advertising perspective, but these problems are equally applicable to models in any domain.

First, let’s touch on the issue of training intervals and frequency. How often do we train a model, and how long does it remain valid? You might wish to build a model on day n and apply it to day n+1, but what if today’s model isn’t representative of tomorrow? You might train the model for a week or build specific models for particular days of the week that could provide better performance on a day-to-day basis, but they might not. You need a way to track or monitor the performance over time.

This leads us to the question of drift. If you build a model on day n and apply this to subsequent days, n+1,n+2, and so on, you’ll most likely see that the model drifts. In other words, the performance of the model decreases over time. This is because the underlying behavior of the population is changing. In the advertising world, publishers change their content, certain sites become unavailable, and even the weather changes! Ultimately, unless you’re learning in real time and updating your model with every data point as you see it, the model is out of date the moment it’s trained, and it will become increasingly out of date the longer it’s used.

This brings us to the question of a training pipeline. Remember, data is constantly flowing into our systems from millions of interactions on the web. As this happens, we need to capture, direct, store, preprocess, and process continuously, generating training datasets every single moment, hour, day, and week. This is a complex task, and it requires significant engineering effort. If you consider this book’s appendix, you should see that collecting and massaging your data is the first step in a pipeline of steps that make up a larger intelligent application. This pipeline may contain a data-ingestion step, a data-cleansing step, a down-sampling step, a training step, and a release step, each stage firing after the completion of the last and running over a fixed amount of data (say, an hour of data at a time).

5.7. The future of real-time prediction

In this chapter, we provided an example of a real-world problem that requires you to make decisions in real time from a huge amount of data. Click prediction is a specific problem in the world of advertising, but many of its solutions are easily applicable to other domains. I believe the future of real-time prediction will be improved in two important areas: through real-time data acquisition and processing and through adaptive and responsive models.

Taking the first in hand, we’ve already discussed the importance of a data pipeline—a way of acquiring, shaping, and processing data to get it ready for model training and deployment. To date, many systems built to achieve this task are based on batches of data moving from one step to the next. The reason for this is that most data-processing systems have been built in this manner! Only recently have stream-processing engines such as Storm, Storm Trident, and Spark Streaming become available, not to mention the exciting work on Millwheel[21] and FlumeJava[22] at Google. These systems have the potential to change the game for prediction by lowering the latency between an event being captured and a model being trained to incorporate this behavior. This low latency can be achieved because each data point can flow through the system without having to wait for the batch in which it’s contained completing.

21

Tyler Akidau et al., “MillWheel: Fault-Tolerant Stream Processing at Internet Scale,” The 39th International Conference on Very Large Data Bases (VLDB, 2013): 734–46.

22

Craig Chambers et al., “FlumeJava: Easy, Efficient Data-Parallel Pipelines,” ACM SIGPLAN Conference on Programming Language Design and Implementation (ACM, 2010): 363–75.

This leads us to our second area of importance, adaptive and responsive modeling. By reducing the latency of data flowing around the system, data can be delivered to the model-training step much earlier. If we can then reduce the latency between this data point and a model reflecting this data point, we can act on events before we were otherwise able to. This might lead us away from batch-training methods to truly online learning—not just in the sense that models are updated sequentially in an out-of-core fashion, but in that they’re updated sequentially as the data arrives.

5.8. Summary

  • We’ve built an intelligent algorithm for online click prediction.
  • We provided a complete introduction to a very important application on the intelligent web. Very few applications have received such attention from the ML community and continue to do so at this level.
  • You learned about the use of an out-of-core machine-learning library called Vowpal Wabbit that allows training to occur without reading all the data into memory.
  • We discussed the future of real-time click prediction on the web.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset