A.3. Statistics and machine learning

A.3.1. Statistics terms

Explain the terms mean, median, and mode to an eight-year old.

Allan Butler

Example answer

Mean, median, and mode are three different types of averages. Averages let us understand something about a whole set of numbers with just one number that summarizes something about the whole set.

Suppose that we did a poll of your class to see how many siblings each person has. You have five people in your class, and let’s say you find that one person has no siblings, one has one, one has two, and two have five.

The mode is the most common number of siblings. In this case, that’s 5, as two people have five siblings compared with only one person who has every other number.

To get the mean, you get the total number of siblings and divide that by the number of people. In this case, we add 0 + 1*1 + 1*2 + 5*2 = 13. You have five people in the class, so the mean is 13/5 = 2.6.

The median is the number in the middle if you line them up from smallest to largest. We’d make the line 0, 1, 2, 5, 5. The third number is in the middle, and in our case, that means the median is two.

We see that the three types of averages come up with different numbers. When do you want to use one instead of the other? The mean is the most common, but the median is helpful if you have outliers. Suppose that one person had 1,000 siblings! Suddenly, your mean gets much bigger, but it doesn’t really represent the number of siblings most people have. On the other hand, the median stays the same.

Notes

It’s unlikely that someone interviewing for a data science position won’t know about the different types of averages, so this question is really testing your communication skills rather than whether you get the definitions right (although if you get them wrong, that’s a red flag). In our example, we used a simple example that an eight-year old might encounter in real life. We recommend keeping the number of subjects simple; you don’t want to get tripped up doing the math for the mean or median because you’re trying to calculate them for 50 data points. If there’s a whiteboard in the room, it might be helpful to write out the numbers to keep track of them. As a bonus, you can add as we did when you might want to use one type of average instead of another.

A.3.2. Explain p-value

Can you explain to me what a p-value is and how it’s used?

Example answer

Imagine that you were flipping a coin and got 26 heads out of 50. Would you conclude that the coin wasn’t fair because you didn’t get exactly 25 heads? No! You understand that randomness is at play. But what if the coin came up heads 33 times? How do we decide what the threshold is for concluding that it’s not a fair coin?

This is where the p-value comes in. A p-value is the probability that, if the null hypothesis is true, we’d see a result as or more extreme than the one we got. A null hypothesis is our default assumption coming in, such as no differences between two groups, that we’re trying to disprove. In our case, the null hypothesis is that the coin is fair.

Because a p-value is a probability, it’s always between 0 and 1. The p-value is essentially a representation of how shocked we would be by a result if our null hypothesis is true. We can use a statistical test to calculate the probability that, if we were flipping a fair coin, we would get 33 or more heads or tails (both being results that are as extreme as the one we got). It turns out that probability, the p-value, is .034. By convention, people use .05 as the threshold for rejecting the null hypothesis. In this case, we would reject the hypothesis that the coin is fair.

With a p-value threshold of .05, we’re accepting that 5% of the time, when the null hypothesis is true, we’re still going to reject it. This is our false-positive rate: the rate of rejecting the null hypothesis when it’s actually true.

Notes

This question is testing whether you both understand what a p-value is and can communicate the definition effectively. There are common misconceptions about the p-value, such as that it’s the probability that a result is a false positive. Unlike the averages question in the preceding section, it’s possible for someone to get this wrong. On the communication side, we recommend using an example to guide the explanation. Data scientists need to be able to communicate with a wide variety of stakeholders, some of whom have never heard of p-values and some who think that they understand what they are but don’t. You want to show that you both understand p-values and can share that understanding with others.

A.3.3. Explain a confusion matrix

What’s a confusion matrix? What might you use it for?

Example answer

A confusion matrix lets you see for a given model how your predictions compare with the actual results. It’s a 2x2 grid that has four parts: the number of true positives, false positives, true negatives, and false negatives. From a confusion matrix, you can calculate different metrics, such as accuracy (the percentage classified correctly as true positive or true negative) and sensitivity, otherwise known as the true positive rate, as well as the percentage of positives correctly classified as such. Confusion matrixes are used in supervised learning problems in which you’re classifying or predicting an outcome, such as whether a flight will be late or whether a picture is of a cat or a dog. Let me draw an example one for the flight outcomes.

 

Actual late

Actual on-time

Predicted Late 60 15
Predicted on-time 30 120

In this case, 60 flights that were predicted to be late actually were, but 30 predicted to be on time were actually late. That means our true positive rate is 60 / (60 + 30) = 2/3.

Seeing the confusion matrix instead of a single metric can help you understand your model performance better. Let’s say that for a different problem you just calculated the accuracy, for example, and found that you have 97% accuracy. That sounds great, but it could turn out that 97% of flights are on time. If the model simply predicted that every flight is on time, it would have 97% accuracy, as all the on-time ones are classified correctly, but the model would be totally useless!

Notes

This question tests whether you’re familiar with supervised learning models. It also tests whether you know different ways of evaluating the performance of models. In our answer, we shared two metrics that you could calculate from a confusion matrix, showing that you understand how it could be used, as well as a case in which seeing the whole matrix instead of just one metric is useful.

A.3.4. Interpreting regression models

How would you interpret these two regression model outputs, given the input data and model? This model is on a dataset of 150 observations of 3 species of flowers: setosa, versicolor, and virginica. For each flower, the sepal length, sepal width, petal length, and petal width are recorded. The model is a linear regression predicting the sepal length from the other four variables.

Input data to the model
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
Model call
model <- lm(Sepal.Length ~ ., iris)
Output 1
term              estimate std.error statistic  p.value
<chr>                <dbl>     <dbl>     <dbl>    <dbl>
(Intercept)          2.17     0.280       7.76 1.43e-12
Sepal.Width          0.496    0.0861      5.76 4.87e- 8
Petal.Length         0.829    0.0685     12.1  1.07e-23
Petal.Width         -0.315    0.151      -2.08 3.89e- 2
Speciesversicolor   -0.724    0.240      -3.01 3.06e- 3
Speciesvirginica    -1.02     0.334      -3.07 2.58e- 3
Output 2
variable          value
<chr>             <dbl>
r.squared         0.867
adj.r.squared     0.863
sigma             0.307
statistic           188
p.value        2.67e-61
df                    6
logLik            -32.6
AIC                79.1
BIC                 100
deviance           13.6
df.residual         144
Example answer

Looking at the model summary results, it looks like a very good model; the R-squared is 0.867, meaning that the predictors explain 86.7 percent of the variance in sepal length. The predictors are all significant at the p less than .05 level. I see that the wider the sepal and the longer the petal, the longer the sepal, whereas wider petals actually are associated with shorter sepals. Both the versicolor and virginica species have negative coefficients, which means that we’d predict those species to have a smaller sepal length than the setosa species.

Suppose that we found a new flower, with a sepal width of 1, petal length of 2, petal width of 1, and that it was the virginica species. Our model would predict the sepal length to be the following: 2.17 + .496 * 1 + .829 * 2 – .315 * 1 – 1.02, which is about 3. Before using this model, though, I’d want to look at a few more diagnostics, such as whether the residuals are normally distributed, and I’d want to find a test set to see how it performs out of sample to make sure it’s not overfitting.

Notes

The interviewer is looking for multiple things, and you can get points depending on how many you get right. In this case, the interviewer is checking whether you understand the model statistics (such as R-squared), as well as the estimates and their associated p-values. Although this information wasn’t explicitly asked for, in our answer, we added how we would use this model to predict the sepal length of a new flower. Finally, we added some information about the model we’d want to know before we started using it. This type of open-ended question is a good opportunity to hit what the interviewer is probably looking for and to add bonus information. Avoid trying too hard and ending up spending 20 minutes on a single question; show that you understand as many concepts as you can and then move on.

A.3.5. What is boosting?

What does the term boosting mean when referring to machine learning algorithms?

Example answer

Boosting refers to a whole class of machine learning algorithms that are built on taking a weak model and reusing it enough times so that it becomes a strong one. The idea is to train a weak model on data, look for areas where the model had errors, and train a second model of the same type that weights the data points where there were errors more heavily, hoping that the second model will fix some of the mistakes of the first. You repeat this process again and again until you hit some limit to the number of models. Then you use all of these models together to make the prediction. By having a large set of models, you’ll get a more accurate result than if you’d used a single model.

One very popular implementation of a boosting method is XGBoost, which is used heavily in both R and Python.

Notes

Boosting is an uncommon-enough term that it’s entirely possible someone with a basic data science background may not know exactly what it means. Thus, this question is more a test of being senior than of having any data science expertise. The question is also a little bit academic; you can imagine someone using XGBoost successfully in their code for years without thinking too deeply about how it works. This question is more “It’s nice to get it right, but not the end of the world if you don’t” than “If you don’t get this question right, you’re unlikely to get the job.”

A.3.6. Favorite algorithm

What’s your favorite machine learning algorithm? All right, can you explain it to me?

Jeroen Janssens

Example answer

My favorite machine learning algorithm is a recurrent neural network. I have been doing a lot of work with natural language processing lately, and recurrent neural networks are great models for classifying text quickly.

Do you know a linear regression? A neural network is like a linear regression except that you have groups of linear regressions and the output of one group of linear regressions is the input to the next group. By tying all these linear regressions together into layers of models, you can make predictions much more accurately.

A recurrent neural network is a special case of a neural network that’s tuned for data that falls in sequences. In the case of natural language processing a block of text, the output part of the way through a sequence of words is the input for the model of the next words.

Notes

This question is one of the many you might get during an interview that are designed to see whether you can explain a complex idea in a simple way. What algorithm you choose for your answer isn’t nearly as important as being able to express how it works clearly. That said, this question is a great opportunity to highlight interesting past work you’ve done by expressing an algorithm that relates to the work and talking about it.

A.3.7. Training vs. test data

What is training data, and what is test data? What is your general strategy for creating these datasets?

Example answer

Training data is data that is used to train a machine learning model. Test data is data that is not used in training a machine learning model; instead, it is used to validate how well the model works. These datasets need to be separate because if data is used to train a model, the model can learn the correct result for the data and will be artificially good at fitting to it.

There are many ways to split training data and test data. My general approach is to take a small random sample, such as 10%, at the beginning of an analysis and use that as my test data for all my models while the other 90% is training data. When I’ve found a model I like that performs well enough, I retrain the model on all the data (both training and test) to get the most accurate model to deploy to production.

Notes

It’s really important to have a good explanation of the difference between training and test data, because understanding the distinction and how to think about it is a fundamental part of creating a machine learning model. That being said, there are many valid strategies for splitting your data. Besides random sampling, for example, you could use cross-validation to avoid biasing your model while training it on more data. So long as you have a logical explanation of why you chose a method, you should be good.

A.3.8. Feature selection

How would you do feature selection if you had 1,000 covariates and had to reduce them to 20?

Alex Hayes

Example answer

There are several different ways to do this. One possible solution in the case of a prediction problem is to use a lasso regression. A lasso regression is a special type of linear regression that applies a penalty to increasing the value of the coefficients. By increasing the penalty term in the regression, you can make the model have fewer and fewer coefficients until it uses only the 20 most important covariates. In this way, the model selects what the coefficients should be in the model. Although a lasso regression has a lower accuracy score than a linear regression with all the covariates on the training data, it has the benefit of using only a small number of them and may perform better on the test data, as lasso reduces the likelihood of overfitting.

You could also use dimensional reduction techniques like Principal Components Analysis (PCA) to reduce the dimensionality of the problem from 1,000 to 20. The lasso approach will choose 20 features out of the existing 1,000. Methods like principal components analysis will create 20 new features that try to capture as much of the data from the 1,000 as they can.

Notes

There are many possible solutions to this question. One more solution is to try to use a stepping function to remove covariates over and over until you’re down to 20. You could even take many samples of sets of 20 features and choose the set that works best. Think of this question as less of a test of knowing the right approach to a problem and more a test of being able to show that you could find a solution if you faced this problem. This question is a test to make sure you won’t get stuck on the job. Can you think of anything you’d want to try? If so, great; you can go try it. If not, you may struggle when working on your own.

If you give multiple answers, be prepared to answer the follow-up question: when would you use one instead of the other? This question is a way to check whether you understand the techniques or just picked them because someone told you to use them. In this case, you could answer that there’s a trade-off between interpretability and capturing variability: lasso is easily interpretable, but PCA captures as much variability as possible. Which one you choose depends on what you’re looking to achieve with the analysis.

A.3.9. Deploying a new model

You developed a new model that performs better than your old model currently in production. How do you determine whether you should switch the model in production? How do you go about it?

Emily Spahn

Example answer

For me, the answer depends on a couple of factors in the environment. First, by what metric does the new model do better? Assuming that it’s overall accuracy, I’d check whether the model is sufficiently better that it’s worth swapping out the old model. If it’s only a percentage point better in accuracy, it may not be worth the effort of changing, because the effect might be negligible. Next, is there a risk to disrupting the current model? If the model was deployed using a well-maintained pipeline with clear logging and testing, I’d probably make the swap, but if the model was deployed by manually moving a model into a production system by a person who is no longer at the company, I’d probably hold off.

Finally, is there a way to A/B test the model first? Ideally, I’d like to have the old and new models run in parallel so that I could test for any problems with the new model or edge cases missed by it. No test system can cover everything from production, so being able to have it running for a select set of customers or inputs first would be ideal.

Notes

Deploying a model is often a labor-intensive and risky proposition for a company. This question determines whether you understand what that’s like and how you would approach the situation. A more-junior data scientist or machine learning engineer may feel that the right choice is to deploy the most accurate model as quickly as possible, but there are risks that need to be managed. If you have any experiences you can draw on (such as model deployments failing), this question is a great place to mention them. If you haven’t, that’s totally fine; just try to describe what you think might go wrong.

A.3.10. Model behavior

Given a model you developed, how would you design a metric to evaluate it from the end user’s perspective? How would you decide what errors are acceptable?

Tereza Iofciu and Bertil Hatt

Example answer

Standard model metrics like R-squared or accuracy can miss the end-user or business perspective. A classification model could be right 99% of the time, but the 1% of the time it’s wrong, it’s such a problem for the business that the model would never be used.

I find that the best way to evaluate a model is to try running an experiment with it. If I’m creating a model to cluster customers into segments, for example, I would present the clusters to marketing and have them try to do a test run of custom marketing to a sample set of customers from the different segments. I would compare how well the marketing performs with and without the customers segmented, and if there is a meaningful improvement, the model is a success. That’s totally different from using metrics about the model itself, such as how effectively it performs the segmentation, because those sorts of measures analyze only the model. Here, I’m actually analyzing how it performs compared with no model at all.

The downside of running an experiment with the model is that it’s often difficult to set up the experiment. Sometimes, you can’t split your customers into ones who get the model and ones who don’t. At other times, the effect of the model is so small that it wouldn’t show up in any KPIs that are easy to measure. But despite these difficulties, if it’s possible to run an experiment, that’s almost always the best approach.

Notes

This question is tricky because it’s very general, but to answer it, you need to talk about specifics. Your answer could vary dramatically for a predictive model versus an unsupervised model or, if you’re working with marketing, versus the operations department. You’ll want to talk a lot about the idea that the statistical measures aren’t the same as the measures that the business cares about; junior data scientists can get overly focused on maximizing the statistical measures and ignoring the business ones. But how you end up talking about these ideas is very open. As with many answers, if you can bring examples from your experiences, you can add a lot of depth.

A.3.11. Experimental design

(Question, answer, and notes by Ryan Williams)

You’re developing an app and want to determine whether a newly designed layout would be better than your current one. How would you structure a test to pick the better app layout?

Example Answer

There are lots of different ways to answer the specifics of this question, but A/B tests generally follow this type of flow:

  1. Define what better means by picking the metric(s) you care about improving: active users, button clicks, impressions, and so on.
  2. Choose a null hypothesis based around your success metric, such as “Button clicks will be the same for all groups.” Use that hypothesis to run a power calculation, which will tell you how long you need to run the test for to detect a change of a certain size.
  3. Randomly split your population of app users into groups, and provide each group a different version of the app.
  4. After you’ve run the test for the length of time you decided on in step 2, evaluate whether you see a statistically significant difference between the two groups by using an appropriate statistical test (like a t-test).
Notes

Questions like this one are common for data science roles on teams that are heavily involved in media measurement, app/web development, and so on. The interviewer usually just wants to know that you understand the purpose and general principles of A/B testing, especially for more-junior roles. Rather than getting bogged down in the specifics of stat testing (such as when to use a chi-square test instead of a t-test), we recommend sticking to a clear high-level approach when answering to demonstrate that you know how to design an experiment and determine causality.

A.3.12. Flaws in experimental design

(Question, answer, and notes by Ryan Williams)

Assume that you’ve done an A/B test to select a better app layout; what is a case in which you might not want to implement the new layout despite seeing a statistically significant improvement in the metric you’re testing?

Example answer

You wouldn’t want to implement the layout if you see it negatively affecting other important metrics (guardrail, or do-no-harm, metrics). An example might be a situation in which the metric you’re testing for is user click-throughs, and although you do see a significant improvement in click-throughs for users exposed to the new layout, you also see pages in the app taking longer to load in that layout. In this case, the degradation in app performance may not be worth the increase in click-throughs, because over time, the worse in-app experience may drive users away.

Notes

This question is very open-ended. What the interviewer wants to see is your recognition that just finding a low p-value isn’t always a good-enough reason to consider an experiment successful. It’s risky for a company to make changes to a live product like an app or website, and a single statistical test usually doesn’t encapsulate all the information needed to make the right decision. Some other reasonable answers to this type of question include seeing too small an improvement relative to the cost and risk of changing the app or bias in the sampling/splitting methodology.

A.3.13. Bias in sampled data

(Question, answer, and notes by Ryan Williams)

What types of biases should you be aware of when using sample data? How can you tell whether a sample is biased?

Example Answer

Many types of bias can affect sampled data. One of the most common biases in practical data science applications is selection bias (selecting your sample incorrectly). Selection bias can happen in scenarios such as selecting a random group of customers from of a transaction-level table, which overrepresents customers with multiple transactions. Other types of common bias include survivorship bias (the sample overrepresents a group that made it past some preselection process) and voluntary response bias (the sample overrepresents a group that was more likely to volunteer information about themselves).

There are statistical methods that you can use to identify bias in a sample, like comparing the mean value from your sample with a known or expected mean of the population. You should also think rationally about the sampling process to identify biases, trying to answer this question: is there something about the way we’ve sampled this group that might make it different from the population we care about?

Notes

This question is meant to test your understanding of limitations in working with data and drawing conclusions. It’s less important to understand specific terms, like selection bias and survivorship bias, than to understand the ways in which data can be limited or misinforming. The interviewer wants to see that you understand the nuances of working with real-world data—all of which is biased in one way or another—and all the messiness that this data entails. Using data from an optional survey, for example, has a clear voluntary response bias. This doesn’t mean that the data is unusable, but it does mean that you should be aware of the bias, think about the consequences it has on your analysis, and take it into account in any conclusions you make.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset