List of Figures

Chapter 1. Building applications for the intelligent web

Figure 1.1. Overview of an intelligent algorithm. Such algorithms display intelligence because the decisions they make change depending on the data they’ve received.

Figure 1.2. Graphical overview of one aspect of the Google Now project. In order for Google Now to make predictions about your future locations, it uses a model about past locations, along with your current position. Priors are a way to distill initially known information into the system.

Figure 1.3. The intelligent-algorithm lifecycle

Figure 1.4. Taxonomy of intelligent algorithms

Figure 1.5. The machine-learning data flow. Data is used to train a model, which can then be applied to previously unseen data. In this case, the diagram illustrates classification, but the diagrams for clustering and regression are conceptually similar.

Figure 1.6. ROC curves for two theoretical classifiers. As we modify the algorithm parameter, we see a change in behavior against a given dataset. The closer the trace gets to touching the top-left corner of the diagram, the closer that classifier is to being ideal, because the TPR=1 and the TNR=1 (because TNR=1 – FPR).

Chapter 2. Extracting structure from data: clustering and transforming your data

Figure 2.1. Visualizing structure in data. x- and y-axis values are arbitrary and unimportant. This data shows three different classes of data, all colored differently. You see that the x and y values appear to cluster around certain areas of the x, y plot dependent on their class. You also see that higher values of x correlate to higher values of y regardless of the class. Each of these properties relates to the structure of the data, and we’ll cover each in turn in the remainder of this chapter.

Figure 2.2. The curse of dimensionality: every point tends to have the same distance from any other point.

Figure 2.3. A number line containing two possible clusters in a single dimension. An entry indicates the existence of that number in the dataset. Data points represented by the same shape indicate one possible intuitive clustering of the data.

Figure 2.4. Initial assignment of the means of the k clusters. The mean of cluster k1 is denoted by a triangle, and a circle denotes the mean of cluster k2.

Figure 2.5. Initial assignment of data points to clusters. Those data points assigned to cluster k1 are shaded green (diamonds), and those assigned to cluster k2 are shaded red (squares).

Figure 2.6. Updated means given the new cluster assignment. Cluster centroids have been updated to better represent the data points that have been assigned to them.

Figure 2.7. Final assignment of the data points to the clusters. Final cluster centroids are given by the triangle and circle, respectively. Data points below 6.8 have been assigned to cluster k1, and data points above this have been assigned to cluster k2. Additional iterations of the k-means algorithm won’t change this clustering.

Figure 2.8. K-means clustering of the Iris dataset. This is achieved through the execution of listing 2.5. Note that k-means represents the centroids of the clusters in n-dimensional space, where n is the number of features (four, in the case of the Iris dataset). For visualization purposes, we plot all combinations of features against each other.

Figure 2.9. Histogram of a normal distribution, in this case the height of 334 fictitious people. The modal (most frequently occurring) bin is centered at 180 cm.

Figure 2.10. Fictional data on the height of 334 people. The sample probability of a given user occurring in a given height range is shown in red (the bar on the left in each pair). The probability from a Gaussian model, with parameters μ=180,σ=28, is shown in green (the bar on the right).

Figure 2.11. Probability distributions of height for men and women. Note that these probabilities are conditioned on the fact that gender is known: so, for example, given that we know a particular person is a woman, her probability of having a height in a particular bucket can be read off the y-axis.

Figure 2.12. Output from listing 2.6. Each panel shows the 4-D Gaussian clustering of the Iris dataset mapped onto two of the axes. The colored clusterings are representations of the underlying four-dimensional Gaussian in two dimensions only.

Figure 2.13. The two principal components of a two-dimensional feature set. The eigenvector u has the largest eigenvalue because it provides the largest direction of variance. The vector v has a smaller value in relation. The two vectors are orthogonal: that is, at 90 degrees from each other. In general, for higher dimensions, the dot product of any two pairs of eigenvectors will equal zero.

Figure 2.14. Eigenvector decomposition of the Iris dataset using the first two eigenvectors. The x-axis shows the direction in which the data has maximal variance. This is the first eigenvector, and its corresponding eigenvalue is maximal relative to all solutions given this data. The y-axis represents the second eigenvector, and its corresponding eigenvalue has the second-largest value of all solutions given this data. Points in the scatter plot are transformed from the original data into this new axis with respect to these first two eigenvectors.

Chapter 3. Recommending relevant content

Figure 3.1. The similarity between two users can be measured by evaluating the extent of overlap (if any) between the two lines in plots like this. Thus, Frank and Constantine (top) are more similar than Frank and Catherine (bottom).

Figure 3.2. Naïve similarity curves as functions of the Euclidean distance

Figure 3.3. The number of films that have been rated by user 10 as 4 or higher. These are broken out by genre, so user 10 has rated more than 40 comedy films with a 4 or a 5. Only the top 19 genres are displayed here.

Figure 3.4. The number of films that have been ranked as lower than 3 by user 10, broken out by genre. In this case, the user has ranked three comedy films and two action/sci-fi films below 3 and has watched several other categories of films and ranked them with a score of 1.

Figure 3.5. Number of ratings >3 aggregated by genre over all users in the set for whom we are to recommend Star Wars: The Phantom Menace

Chapter 4. Classification: placing things where they belong

Figure 4.1. The basic elements of a reference structure (a rudimentary ontology). Ellipses denote concepts, whereas rounded rectangles denote attributes. Rectangles denote instances. Concepts inherit attributes from above.

Figure 4.2. An overview of classification algorithms based on their design

Figure 4.3. The lifecycle of a classifier: training, validation (or testing), and use in production

Figure 4.4. Graphical illustration of a linear model. The data points provide a clear linear relationship between x and y. The linear line provides the line of best fit modeled by y=mx+c.

Figure 4.5. Calculating the residuals, or the distance between the model and the data. In this illustration, we use the vertical distance in y as a measurement of the error, but other methods are available.

Figure 4.6. The response of y with x for a logistic curve. This particular curve is centered on x=10 and has been plotted using 20 samples uniformly distributed on x. Note the larger differences in y for values of x around 10 and much smaller differences in y around x=0 and x=20.

Figure 4.7. Data for fraud detection

Figure 4.8. Output of the model plotted against the test data ground truth. Values for the data labels may only take values of 1 or 0, indicating fraud and not fraud, respectively. Log-probability may take values between 0.0 and -∞ (graph left). Exponentiation of the log-probability provides our estimated probability of fraud. This lies in the range 1.0 to 0.0. We note that most cases are well classified; the probability of fraud output from the model (right) is near to the value of the label (where fraud is given by a label of 1).

Chapter 5. Case study: click prediction for online advertising

Figure 5.1. The online advertising ecosystem. Publishers make their content (advertising slots) available through an exchange that allows access programmatically—that is, through an intelligent algorithm. A demand-side platform aggregates demand for content from advertisers and buys on their behalf.

Figure 5.2. A graphical overview of the steps involved in buying ad placements on an exchange. Events should be read from top to bottom. First, a cookie match occurs. This can be performed either exchange side or DSP side. Once the DSP has a unique identifier that it can use, it uses this to look up additional information about the user and construct a bid price (the subject of section 5.4.5 onward). If the bid is successful, bid notification is returned for logging by the DSP. The next stage is for the DSP to specify the ad to be served. Physically, this can be served by the exchange itself or by the DSP. Events in the ad are fired directly back to the DSP regardless of where the ad is hosted.

Figure 5.3. The bidder consists of an outer shell that’s responsible for performing the crucial tasks associated with bidding and the decision engine that’s responsible for taking into account all data associated with the bid.

Figure 5.4. Overview of the raw data provided by Criteo. The first field relates to the event we’re trying to predict. This relates to the user performing some event, such as clicking the ad. The next 13 columns are integer properties with count semantics; a higher number means some event has been performed more times than if a lower number is seen. The next 26 features are categorical and most likely represent some property of the bid request: for example, the site from which the bid request was made.

Figure 5.5. The receiver operating characteristic for our model based on the supplied test data. The area under the curve is given as 0.76.

Chapter 6. Deep learning and neural networks

Figure 6.1. Visualizing a deep network for recognizing cars. Some graphical content reproduced from Andrew Ng’s talk on the subject, cited previously. A base training set of pictures is used to create a basis of edges. These edges can be combined to detect parts of cars, and these parts of cars can be combined to detect an object type, which is in this case a car.

Figure 6.2. On the left, we provide a schematic of a human biological neuron. To the right, we show a human-inspired neural network implemented using weighted summation, an inhibitory input, and a threshold.

Figure 6.3. MCP as a 2-D linear model without inhibitory input. The weights of the model correspond to the coefficients of a linear model. In this case, our neuron supports only a single input, and the weight has been set to 1 for illustration. Given a threshold of 0, all inputs with a value less than or equal to 0 would inhibit the neuron from firing, whereas all inputs with a value greater than 0 would cause the neuron to fire.

Figure 6.4. The perceptron. Inputs x1 through xn are received and multiplied by their associated weight, with perceptron bias, w0, being added in afterward. This output, given by a, is then passed through a threshold function to obtain the output.

Figure 6.5. Graphical overview of the data for a single perceptron. Data with class label 1 is represented by a round dot, whereas data with class label 0 is represented by a star. It’s the aim of the perceptron to separate these points.

Figure 6.6. The projection of our separating plane onto the viewing plane (y = 0). All points to the top right of the figure satisfy the constraint w · x > 0, whereas points to the bottom left of the line satisfy the constraint w · x < 0.

Figure 6.7. A multilayer perceptron. Read from top to bottom, it consists of an input layer (vector of values x), a number of hidden layers, and an output layer that returns the vector y.

Figure 6.8. Graphical representation of the XOR function. Positive classes are specified with circles, whereas negative classes are specified with stars. No single hyperplane can separate these data points into two sets, so we say that the dataset isn’t linearly separable.input layernegative classesoutput layerpositive classes

Figure 6.9. A two-layer perceptron can separate a nonlinearly separable function (XOR). Values on the connections demonstrate the weight of those connections. The introduction of the bias term ensures correct operation with an activation threshold of 0. Conceptually, the two hidden neurons correspond to two hyperplanes. The final combining perceptron is equivalent to a logical AND on the output of the two hidden neurons. This can be thought of as picking out the areas of the (x1,x2) plane for which the two hidden neurons activate together.

Figure 6.10. Separating a nonlinearly separable dataset using the neural network given in figure 6.9. You should notice that the neuron to the left of figure 6.9 when intersecting with the viewing plane creates the bottom line, whereas the rightmost neuron creates the top line. The leftmost neuron fires an output 1 only when the input is above the bottom line, whereas the rightmost neuron fires an output 1 only when the input is below the top line. The final combining neuron fires y1=1 only when both of these constraints are satisfied. Thus, the network outputs 1 only when the data points are in the narrow corridor between the bottom and top lines.

Figure 6.11. Output profiles for several activation functions over the same range as provided in figure 6.3. Activation profiles demonstrated are square root, logistic, negative exponential, and hyperbolic.

Figure 6.12. Overview of the backpropagation example. Given a set of inputs x1, x2 and target variable y1 that follow the XOR function over x1 and x2, can we learn the values of w that minimize the squared difference between the training values and the network output? In this example, we use the logistic activation function: .

Figure 6.13. Graphical overview of a Restricted Boltzmann Machine. A RBM is a bipartite graph between hidden and visible units, with each element of each partition fully connected to other units in the opposing partition. We’ll use h to refer to the vector composed of the hidden units and v to refer to the vector composed of the visible units. Note that each node may have an associated bias weight; these aren’t represented here for simplicity.

Figure 6.14. A graphical representation of the weights between the hidden and visible units in our RBM. Each square represents a single hidden unit, and the 64 grayscale values within represent the weights from that hidden value to all the visible units. In a sense, this dictates how well that hidden variable is able to recognize images like the one presented.

Chapter 7. Making the right choice

Figure 7.1. Formalizing the problem definition. In this situation, our actor is faced with a decision, and, depending on their choice, they reach an outcome. This outcome has an associated reward that is returned to the actor. Given that the user is exposed to this decision continually, how do they formulate a strategy to obtain the maximum reward (or, equivalently, the minimum cost)?

Figure 7.2. We’re given a fixed value for the conversion rates of the A/B group, the relationship between the number of users in the A/B groups, and the z value. Assuming the conversion rate wouldn’t change as we collected more data, we’d need around 3,000 users in each group to hit a confidence interval of 70%. This rises to around 5,000 per group for 80%, 7,500 for 90%, and 12,000 for 95%.

Figure 7.3. The multi-armed bandit problem. A user is faced with a number of one-armed bandits, each of which has a different probability of paying out. The user doesn’t know what these probabilities are and may only uncover them through playing the machines. What strategy should the user employ in order to maximize their return? The user must explore the space in order to determine the probabilities of paying out, but they must also exploit machines with high probabilities of paying out in order to maximize their return.

Figure 7.4. Modeling the knowledge about the payout rate of three bandits using the Bayesian bandit method. The mean payout rates for arms 1, 2, and 3 are 0.1, 0.3, and 0.4, respectively. Bandit 1 has a lower mean but a much wider variance. Bandit 2 has a higher mean and smaller variance. Bandit 3 has an even higher mean and a smaller variance. In order to choose the bandit to play, each distribution is sampled, and the bandit corresponding to the distribution from which the highest sample was drawn will be picked. After selection, the bandit is played and the respective distribution is updated. This has the effect that even bandits with a low payout rate have a chance to redeem themselves if their mean rate is uncertain (high variance).

Figure 7.5. Probability density function over the beliefs in payout probabilities for each bandit. Note that the three plots are identical and that the area under the curve is equal to 1. This illustrates our belief that all payout probabilities are equally likely for each bandit. This will change as we observe the bandits in action.

Figure 7.6. Total expected regret in a single execution of 1,000 plays. Knowledge is acquired only through play (exploration). Consequently, it’s likely that suboptimal decisions will be made on the path to learning the true payout probabilities.

Figure 7.7. Posterior probability distributions after a single run of 1,000 plays. This represents our belief over the possible true distributions of the bandits, which have an unobservable true probability of payout equal to 0.1, 0.3, and 0.8, respectively.

Figure 7.8. Plotting the regret for 100 runs of a 1,000-play experiment. Due to the probabilistic nature of Bayesian bandits, the outcome isn’t guaranteed to be the same each time the experiment is run.

Figure 7.9. Average cumulative expected regret at each step, over 100 iterations. Note that the gradient decreases as the number of steps increases. This is because the Bayesian bandit makes comparative better choices as the experiment continues. Consequently, it’s less likely to incur additional regret.

Figure 7.10. Cumulative expected regret for five experiments, each using three bandits. The overall probabilities of payouts are given in the legend. You can see that there is a tendency toward higher and longer exploration periods where the true probabilities of the bandits are closer together.

Figure 7.11. Cumulative expected regret for five experiments using scaled-down probabilities. Consider this graph in relation to figure 7.10. Note the inverse relationship and slower growth due to the reduced probabilities.

Figure 7.12. Cumulative expected regret for 10 experiments, each with an increasing number of bandits. As we increase the number of bandits, their probabilities of payout are spread equally through the space, excluding the zero-probability bandit.

Figure 7.13. A contextual bandit. The payout probability isn’t fixed and is dependent on the context or situation. One way to think about this is that the player has assigned to them a vector of attributes at a given time, and, depending on those attributes, the payout probabilities of the machine will change. The best strategy must minimize the cumulative regret with this in mind.

Figure 7.14. The adversarial bandit problem. In this variant of the MAB problem, solutions make no assumptions about the underlying distributions of the rewards. Instead, the problem is modeled as a game between a player and an adversary. At each time step, the adversary selects a reward vector before the player chooses an arm. In one variant of the game (full information), the player gets to see the full reward vector, and in another (partial information) the player sees the reward of their chosen bandit.

Appendix Capturing data on the web

Figure A.1. Overview of a demand-side platform. DSPs use their knowledge of the user to provide better interactions. Consequently, they can arbitrage content bought on a CPM basis, selling this on a CPC basis.

Figure A.2. Overview of the log-processing architecture of a typical DSP. Web page interactions are recorded by the browser and sent back to the DSP’s server(s) via HTTP. Depending on the interaction type, these will be written to different files and stored in different locations. Over time, these files are collated and imported to a database for downstream processing.

Figure A.3. Anatomy of a topic, taken from the Kafka documentation. Topics consist of a partitioned log, which can only be appended to. The partitions are immutable: they can’t be changed once committed.

Figure A.4. The different levels of acknowledgments in Kafka. From left to right, they offer increasing degrees of availability at the cost of acknowledgement latency.

Figure A.5. Kafka replication, reproduced from Rao. In this example, we have four brokers and one topic with two partitions. Broker 1 leads partition 1, and broker 4 leads partition 2.

Figure A.6. Our final Kafka cluster. In this example, we use a low-level producer to perform our own partitioning of the stream of data. This cluster is operating with three brokers, three partitions, and level-3 replication. Two consumer groups exist, one with a single consumer called group-1 and the other with three consumers, known as group-2.

Figure A.7. In this example, a consumer group of spouts is consuming from the Kafka cluster that has been partitioned by GUID. This ensures that each consumer spout receives all the data for a given user. Through Storm’s field grouping, we can also partition data on GUID as we send this data to one of several deduplication bolts that operate on the incoming clicks. Field grouping again on GUID, we can send these to one of several producers that will write these back into a new, deduplicated topic, again partitioned by GUID.

Figure A.8. Overview of Kafka plus Hadoop integration. Camus is used to generate MapReduce jobs, which extract data from Kafka in parallel, placing the data in HDFS. Camus jobs must be executed periodically; hence the scheduler. High-level languages such as Hive and Pig can then be used to query the imported data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset