Chapter 6
Data Science Experiments and Evaluation of Their Results

So, you’ve come up with a promising question for your data, and you have formulated a hypothesis around it (actually you’d probably have come up with a couple of them). Now what? Well, now it’s time to test it and see if the results are good enough to make the alternative hypothesis you have proposed (Ha) a viable answer. This fairly straight-forward process is something we will explore in detail in this chapter, before we delve deeper into it in the chapter that ensues.

The Importance of Experiments

Experiments are essential in data science, and not just for testing a hypothesis. In essence, they are the means of the application of the scientific method, an empirical approach to obtaining refined knowledge objectively in order to acquire and process information. Also, it is the make-or-break way of validating a theory and the most concrete differentiator of science and philosophy as well as data science and the speculation of “experts” in a given domain. Experiments may be challenging, and their results may not always comply with our expectations or agree with our intuition. Still, they are always insightful and gradually advance our understanding of things, especially in complex scenarios or cases where we don’t have sufficient domain knowledge.

In more practical terms, experiments are what make a concept we have grasped or constructed either a fact or a fleeting error in judgment. This not only provides us with confidence in our perception, but also allows for a more rigorous and somewhat objective approach to data analytics. Many people have lost faith in the results of statistics and for good reason. They tend to draw conclusions that are so dependent on assumptions that they fail to have value in the real world.

The pseudo-scientists that often make use of this tool do so only to propagate their misconceptions rather than do proper scientific work. In spite of all that, there are parts of statistics that are useful in data science, and one of them is the testing of hypotheses. Experiments are complementary to that, and since they are closely linked to these statistical tests, we’ll group them together for the purpose of this chapter.

How to Construct an Experiment

First of all, let’s clarify what we mean by experiments, since sci-fi culture may have tainted your perception of them. Experiments in data science are usually a form of a series of simulations you perform on a computer, typically in an environment like Jupyter (http://bit.ly/2oriE8B). Optionally, you can conduct all of your experiments from the command line. The environment you use is not so important, though if you are comfortable with a notebook setting, this would be best, as that kind of environment allows for the inclusion of descriptive text and graphics along with several other options, such as exporting the whole thing as a PDF or an HTML file.

Constructing an experiment entails the following things. If it’s a fairly straight-forward question that you plan to answer and everything is formulated as a clear-cut hypothesis, the experiment will take the form of a statistical test or a series of such tests (in the case that you have several alternative hypotheses).

The statistical tests that are more commonly used are the t-test and the chi-square test. The former works on continuous variables, while the latter is suitable for discrete ones. Oftentimes, it is the case that you employ several tests to gain a deeper understanding of the hypotheses you want to check, particularly if the probability value (p-value) of the test is close to the threshold.

I recommend that you avoid using statistical tests like the z-test, unless you are certain that the underlining assumptions hold true (e.g. the distribution of each variable is Gaussian). If it is a more complex situation you are dealing with, you may need to do several simulations and then do a statistical test on the outputs of these simulations. We’ll look into this case in more detail in the section that follows.

Something that may seem obvious is that every scientific experiment has to be reproducible. This doesn’t mean that another experiment based on the same data is going to output the exact same results, but the new results should be close to the original ones. In other words, the conclusion should be the same no matter who performs the experiment, even if they use a different sample of the available data. If the experiment is not reproducible, you need to re-examine your process and question the validity of your conclusions. We are going to look more into this matter in the next chapter.

Another thing worth noting when it comes to constructing an experiment is that just like everything else you do throughout the data science pipeline, your experiments need to be accompanied by documentation. You don’t need to write a great deal of text or anything particularly refined, as this documentation is bound to remain unseen by the stakeholders of the project, but if someone reads it, they will want to get the gist of things quickly.

Therefore, the documentation of your experiment is better off being succinct and focusing on the essential aspects of the tasks involved, much like the comments in the code of your scripts. In addition, even if you don’t share the documentation with anyone, it is still useful since you may need to go back to square one at some point and re-examine what you have done. Moreover, this documentation can be useful when writing up your report for the project, as well as any supplementary material for it, such as slideshow presentations.

Experiments for Assessing the Performance of a Predictive Analytics System

When it comes to assessing the performance of a predictive analytics system, we must take a specific approach to setting up the experiments needed to check if the models we create are good enough to put into production. Although this makes use of statistical testing, it goes beyond that, as these experiments must take into account the data used in these models as well as a set of metrics.

One way to accomplish this is through the use of sampling methods. The most popular such method is K-fold cross validation, which is a robust way of splitting the dataset in training and testing K times, minimizing the bias of the samples generated. When it comes to classification problems, the sampling that takes place takes into account the class distribution, something particularly useful if there is an imbalance there. This sort of sampling is referred to as stratified sampling, and it is more robust than conventional random sampling (although stratified sampling has a random element to it too, just like most sampling methods).

When you are trying out a model and want to examine if it’s robust enough, you’ll need to do a series of trainings and testings of that model using different subsets of the dataset and calculating a performance metric after each run. Then you can aggregate all the values of this metric and compare them to a baseline or a desired threshold value, using some statistical test. It is strongly recommended that you gather at least 30 such values before trying out a statistical evaluation of them, since the reliability of the results of the experiment depends on the amount of data points you have. Oftentimes, this experimental setup is in conjunction with the K-fold cross validation method mentioned previously, for an even more robust result. Experiments of this kind are so vigorous that their results tend to be trustworthy enough to render a scientific publication (particularly if the model you use is something new or a novel variant of an existing predictive analytics system).

Since simulations like these often take some time to run, particularly if you are running them on a large dataset, you may want to include more than one performance metrics. Popular metrics to measure performance are F1 for classification and MSE for regression or time-series analysis. In addition, if you have several models you wish to test, you may want to include all of them in the same experiment so you can run each of them on the same data as the others. This subtle point that is often neglected can add another layer of robustness to your experiments, as no one can say that the model you selected as having the best performance simply lucked out. If you have enough runs, the probability of one model outperforming the others due primarily to chance diminishes greatly.

A Matter of Confidence

Confidence is an important matter, not just as a psychological attribute, but also in the way the questions to the data are answered through this experiment-oriented process. The difference is that in this latter case, we can have a reliable measure of confidence, which has a very important role in the hypothesis testing process, namely quantifying the answer. What’s more, you don’t need to be a particularly confident person to exercise confidence in your data science work when it comes to this kind of task. You just need to do your due diligence when tackling this challenge and be methodical in your approach to the tests.

Confidence is usually expressed in a heuristic metric that takes the form of a confidence score, having values ranging between 0 and 1. This corresponds to the probability of the system’s prediction being correct (or within an acceptable range of error). When it comes to statistical methods, this confidence score is the inverse of the p-value of the statistical metric involved. Yet, no matter how well-defined all these confidence metrics are, their scope is limited and dependent on the dataset involved. This is one of the reasons why it is important to make use of a diverse and balanced dataset when trying to answer questions about the variables involved.

It is often the case that we need to pinpoint a given metric, such as the average value of a variable or its standard deviation. This variable may be a mission-critical or key-performance index. In cases like this, we tend to opt for a confidence interval, something that ensures a given confidence level for a range of values for the metric we are examining. More often than not, this interval’s level of confidence is set to 95%, yet it can take any value, usually near that point. Whatever the case, its value directly depends on the p-value threshold we have chosen a-priori (e.g. in this case, 0.05). Note that although statistics are often used for determining this interval, it doesn’t have to be this way. Nowadays, more robust, assumption-free methods, such as Monte-Carlo simulations, are employed for deriving the exact borders of a confidence interval for any distribution of data.

Regardless of what methods you make use of for answering a data science question, it is important to keep in mind that you can never be 100% certain about your answer. The confidence score is bound to be less than 1, even if you use a large number of iterations in your experiments or a large number of data points in general. For all practical purposes, however, a high enough confidence score is usually good enough for the project. After all, data science is more akin to engineering than pure mathematics, since just like all other applied sciences, data science focuses on being realistic and practical.

Embracing this ambiguity in the conclusions is a necessary step and a differentiator of the data science mindset from the other, more precision-focused disciplines of science. In addition, this ambiguity is abundant in big data, which is one of the factors that make data science a priceless tool for dealing with this kind of data. However, doing more in-depth analysis through further testing can help tackle a large part of this ambiguity and shed some light on the complex problems of big data by making it a bit more ordered and concrete.

Finally, it is essential to keep in mind when answering a question that even if you do everything right, you may still be wrong in your conclusions. This counter-intuitive situation could be because the data used was of low veracity, or it wasn’t cleaned well enough, or maybe parts of the dataset were stale. All these possibilities go on to demonstrate that data science is not an exact science, especially if the acquisition of the data at hand is beyond the control of the data scientist, as is often the case.

This whole discipline has little room for arrogance; do not rely more on fancy techniques than a solid understanding of the field. It is good to remember this, particularly when communicating your conclusions. You should not expect any groundbreaking discoveries unless you have access to large volumes of diverse, reliable, and information-rich data, along with sufficient computing resources to process it properly.

Evaluating the Results of an Experiment

The results of the experiment can be obtained in an automated way, especially if the experiment is fairly simple. However, evaluating these results and understanding how to best act on them is something that requires attention. That’s because evaluation and interpretation of the results are closely intertwined. If done properly, there is little room for subjectivity. Let’s look into this in more detail by examining the two main types of experiments we covered in this chapter, statistical tests and performance of predictive analytics models, and how we can evaluate the results in each case.

If it is a statistical test that you wish to interpret the results of, you just need to examine the various statistics that came along with it. The one that stands out the most, since it is the one yielding the most relevant information, is the p-value, which we talked about previously. This is a float number which takes values between 0 and 1 (inclusive) and denotes the probability of the result being caused by chance alone (i.e. the aggregation of various factors that were not accounted for, which contributed to it, even if you were not aware of their role or even their existence). The reason this is so important is that even in a controlled experiment, it is possible that the corresponding result has been strongly influenced by all the other variables that were not taken into account. If the end-result is caused by them, then the p-value is high, and that’s bad news for our test. Otherwise, it should take a fairly small value (the smaller the better for the experiment). In general, we use 0.05 (5%) as the cut-off point, below which we consider the result statistically significant. If you find this threshold a bit too high, you can make use of other popular values for it, such as 0.01, 0.001, or even lower ones.

If you are dealing with the evaluation of the performance of a classifier, a regressor, or some other predictive analytics system, you just need to gather all the data from the evaluation metric(s) you plan to use for all the systems you wish to test and put that data into a matrix. Following this, you can run a series of tests (usually a t-test would work very well for this sort of data) to find out which model’s performance, according to the given performance metrics, is both higher than the others and with a statistical significance. The data you will be using will correspond to columns in the aforementioned matrix. Make sure you calculate the standard deviation or the variance of each column in advance though, since in certain tests, having equal variances is a parameter in the test itself.

As we saw in the previous chapter, there are cases where a similarity metric makes more sense for answering a question. These kinds of metrics usually take values between 0 and 1, or between -1 and 1. In general, the higher the absolute value of the metric, the stronger the signal in the relationship examined. In most cases, this translates into a more affirmative answer to the question examined. Keep in mind that when using a statistical method for calculating the similarity (e.g. a correlation metric), you will end up with not just the similarity value, but also with a p-value corresponding to it. Although this is usually small, it is something that you may want to consider in your analysis.

It is good to have in mind that whatever results your tests yield, they are not fool-proof. Regardless of whether they pass a statistical significance test or not, it is still possible that the conclusion is not correct. While the chances of this happening are small (the p-value is a good indicator for this), it is good to remember that so that you maintain a sense of humility regarding your conclusions and you are not taken by surprise if the unexpected occurs.

Finally, it is sometimes the case that we need to carry out additional iterations in our experiments and even perform new tests to check additional hypotheses. This may seem frustrating, but it is a normal part of the process, and to some degree expected in data science work. If by the end of the experiments, you end up with a larger number of questions than answers to your original ones, that’s fine. This is just how science works in the real world!

Summary

Experiments are essential in checking a hypothesis in a scientific manner and gaining a better understanding of the dynamics of the variables examined. When it comes to simple questions that have clear-cut hypotheses, experiments take the form of a statistical test. The most common tests in data science are the t-test (continuous variables) and the chi-square test (discrete variables).

Although many experiments involve a statistical test on your data directly, often they require more work, such as when you need to assess the performance of a predictive analytics model. In this case, you need to perform a number of runs of your model, calculate a performance metric or two, and perform some statistical analysis in the aggregate of this metric’s values, usually through a t-test or something equivalent. K-fold cross validation is a popular method for sampling the data in these kinds of experiments and allows for more robust results.

Confidence in the statistical analysis of results, be it from a straight-forward statistical test or from the testing of the outputs of a simulation, is important. This is usually expressed as a probability value (p-value) and its relationship to a predefined threshold. The lower the p-value in a test, the higher the confidence of the result, which relates to the disproving of the null hypothesis in favor of the alternative hypothesis.

Evaluating the results of an experiment is done through the confidence measure of the tests performed, its comparison with a given threshold, as well as with the use of sensitivity analysis of the conclusion drawn.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset