Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5
Data Science Questions and Hypotheses

Although everyone talks about the data science pipeline, the data scientist’s toolbox, and all the great things that data science can do for you, few data science professionals bother talking about one of the most essential means of tackling data science problems: questions. This is probably because it is less high-tech than other parts of the craft, while it is perceived as a less data science savvy task. This is however, yet another misconception about the field.

Questions in data science may stem from a project manager to some extent, but most of them come from you. The only way to find answers to your questions is through the data you have, so the questions need to be constructed and processed in such a way that they are answerable and able to yield useful information that can help guide your project. This whole process involves the creation of hypotheses, the formal scientific means for turning these questions into something that can be tackled in an objective manner.

In this chapter, we will look at what kinds of questions we can ask and which hypotheses correspond to them. Furthermore, in later chapters, we’ll delve deeper into this matter to see how we can turn these questions into a source of useful information through experiments and the analysis of the results.

Importance of Asking (the Right) Questions

Although the people who drive the data science projects are usually the ones who ask the key questions that need to be answered through these projects, you need to ask your own questions for two reasons:

The questions your superiors ask tend to be more general and very hard to answer directly, leading to potential miscommunications or inadequate understanding of the underlying problem being investigated
As you work with the various data streams at your disposal, you gain a better understanding of problems and can ask more informative (specialized) questions that can get into the heart of the data at hand

Now you may ask “What’s the point of asking questions in data science if the use of AI can solve so many problems for us?” Many people who are infatuated with AI tend to think that, and as a result, consider questions a secondary part of data science, if not something completely irrelevant. Although the full automation of certain processes through AI may be a great idea for a sci-fi film, it has little to do with the reality and truth. Artificial Intelligence can be a great aid in data science work, but it is not at a stage where it can do all the work for us. That’s why, regardless of how sophisticated the AI systems are, they cannot ask questions that are meaningful or useful, nor can they communicate them to anyone in a comprehensive and intuitive way.

Sometimes it helps to think of such things with metaphors so that we obtain a more concrete understanding of the corresponding concepts. Think of AI as a good vehicle that can take you from A to B in a reliable and efficient manner. Yet, even if it is a state-of-the-art car (e.g. a self-driving one), it still needs to know where B is. Finding this is a matter of asking the right questions, something that AI is unable to do in its current state.

As insights come in all shapes and forms, it is important to remember that some of them can only be accessed by going deeper into the data. These insights also tend to be more targeted and valuable, so it’s definitely worth the extra effort. After all, it’s easy to find the low-hanging fruit of a given problem! To obtain these more challenging insights, you need to perform in-depth analysis that go beyond data exploration. An essential part of this endeavor is formulating questions about the different aspects of the data at hand. Failing to do that is equivalent to providing conventional data analytics (e.g. business intelligence, or econometrics), which although fine in and of itself, is not data science, and passing it as data science would undermine your role and reputation.

Naturally, all these questions need to be grounded in a way that is both formal and unambiguous. In other words, there needs to be some scientific rigor in them and a sense of objectivity as to how they can be tackled. That’s where hypotheses enter the scene, namely the scientific way of asking questions and expanding one’s knowledge of the problem studied. These are the means that allow for finding something useful, in a practical way, with the questions you come up with, while pondering on the data.

Finally, asking questions and formulating hypotheses underline the human aspect of data science in a very hands-on way. Data may look like ones and zeros when handled by the computer, but it is more than that. Otherwise, everything could be fully automated by a machine (which it can’t, at least not yet). It is this subtleness in the data and the information it contains that make asking questions even more important and useful in every data science project.

Formulating a Hypothesis

Once you have your question down, you are ready to turn it into something your data is compatible with, namely a hypothesis. A hypothesis is something you can test in a methodical and objective manner. In fact, most scientific research is done through the use of different kinds of hypotheses that are then tested against measurements (experimental data) and in some cases theories (refined information and knowledge). In data science, we usually focus on the experimental evidence.

Formulating a hypothesis is fairly simple as long as the question is quantifiable. You must make a statement that summarizes the question in a very conservative way (basically something like the negation of the statement you use as a question).

The statement that takes the form of the hypothesis is always a yes-or-no question, and it’s referred to as the Null Hypothesis (usually denoted as H0). This is what you opt to disprove later on by gathering enough evidence against it.

Apart from the Null Hypothesis, you also need to formulate another hypothesis, which is what would be a potential answer to the question at hand. This is called the Alternative Hypothesis (symbolized as Ha), and although you can never prove it 100%, if you gather enough evidence to disprove the Null Hypothesis, the chances of the Alternative Hypothesis being valid are better. It is often the case that there are many possibilities beyond that of the Null Hypothesis, so this whole process needs to be repeated several times in order to obtain an answer with a reasonable level of confidence. We’ll examine this dynamic in more detail in the following chapter. For now, let’s look at the most common questions you can ask and how you can formulate hypotheses based on them.

Questions Related to Most Common Use Cases

Naturally, not all questions are suitable for data science projects. Also, certain questions lend themselves more to the discovery of insights, as they are more easily quantifiable and closer to the essence of the data at hand, as opposed to other questions that aim to mainly facilitate our understanding of the problem.

In general, the more specific a question is and the closer it is related to the available data, the more valuable it tends to be. Specifically, we can ask questions related to the relationship between two features, the difference between two subsets of a variable, how well two variables in a feature set collaborate with each other for predicting another variable, whether a variable ought to be removed from the set, how similar two variables are to each other, whether variable X causes the phenomenon mirrored in variable Y to occur, and more. Let’s now look at each one of these types of question in more detail.

Is Feature X Related to Feature Y?

This is one of the simplest questions to ask and can yield very useful information about your feature set and the problem in general. Naturally, you can ask the same question with other variables in the dataset, such as the target ones. However, since usually you cannot do much about the target variables, more often than not you would ask questions like this by focusing on features. This way, if you find that feature X is very closely related to feature Y, you may decide to remove X or Y from your dataset, since it doesn’t add a great deal of information. You can think of it as having two people in a meeting that always agree. However, before taking any action based on the answer you obtain about the relationship between these two features, it is best to examine other features as well, especially if the features themselves are fairly rich in terms of the information they contain.

The hypothesis you can formulate based on this kind of question is also fairly simple. You can hypothesize that the values of these features are of the same population (i.e. H0: the similarity between the values of X and Y is zero).

If the features are continuous, it is important to normalize them first, as well as remove any outliers they may have (especially if you are using a basic metric to measure their relationship). Otherwise, depending on how different the scale is or how different the values of the outliers are in relation to the other values, you may find the two features different when they are not. Also, the similarity is usually measured by a specific metric designed for this task. The alternative to this hypothesis would be that the two features are not of the same population (i.e. Ha: there is a measurable similarity between the values of X and Y).

If this whole process is new to you, it helps to write your hypotheses down so that you can refer to them easily afterwards. However, as you get more used to them, you can just make a mental note about the hypotheses you formulate as you ask your questions.

An example of this type of question is as follows: we have two features in a dataset, a person’s age (X1) and that person’s work experience (X2). Although they correspond to two different things, they may be quite related. The question therefore would be, “Is there a relationship between a person’s age and their work experience?” Here are the hypotheses we can formulate to answer this question accurately:

H0: there is no measurable relationship between X1 and X2

Ha: X1 is related to X2

Although the answer may be intuitive to us, we cannot be sure unless we test these hypotheses, since the data at hand may have a different story to tell about how these two variables relate to each other.

Is Subset X Significantly Different to Subset Y?

This is a very useful question to ask when you are examining the values of a variable in more depth. In fact, you can’t do any serious data analysis without asking and answering this question. The subsets X and Y are usually derived from the same variable, but they can come from any variable of your dataset, as long as both of them are of the same type (e.g. both are integers).

However, usually X and Y are parts of a continuous variable. As such, they are both of the same scale, so no normalization is required. Also it is usually the case that any outliers that may exist in that variable have been either removed or changed to adapt to the variable’s distribution. So, X and Y are in essence two sets of float numbers that may or may not be different enough as quantities to imply that they are really parts of two entirely different populations. Whether they are or not will depend on how different these values are. In other words, say we have the following hypothesis that we want to check:

H0: the difference between X and Y is insubstantial (more or less zero)

The alternative hypothesis in this case would be:

Ha: the difference between X and Y is substantial (greater than zero in absolute value)

Note that it doesn’t matter if X is greater than Y or if Y is greater than X. All we want to find out is if one of them is substantially larger than the other, since regardless of which one is larger, the two subsets will be different enough. This “enough” part is something measurable, usually through a statistic, and if this statistic exceeds a certain threshold, it is referred to as “significant” in scientific terms.

If X and Y stem from a discrete variable, it requires a different approach to answer this question, but the hypothesis formulated is the same. We’ll look into the underlying differences between these two cases in the next chapter.

Do Features X and Y Collaborate Well with Each Other for Predicting Variable Z?

This is a very useful question to ask. Many people don’t realize they could ask it, while others have no idea how they could answer it properly. Whatever the case, it’s something worth keeping in mind, especially if you are dealing with a predictive analytics problem with lots of features. It doesn’t matter if it’s a classification, a regression, or even a time-series, when you have several features, chances are that some of them don’t help much in the prediction, even if they are good predictors on their own.

Of course, how well two features collaborate depend on the problem they are applied on. So, it is important to first decide on the problem and on the metric you’ll rely on primarily for the performance of your model. For classification, you’ll probably go for F1 and Area Under Curve (AUC, relating to the ROC curve). In this sense, the collaboration question can be viewed from the perspective of the evaluation metric’s value. Therefore, the question can take the form of the following hypothesis (which like before, we need to see if we can disprove):

H0: the addition of feature Y does not affect the value of evaluation metric M when using just X in the predictive model

The alternative hypothesis would be:

Ha: adding Y as a feature in a model consisting only of X will cause considerable improvement to performance of the model, as measured by evaluation metric M.

Also, the following set of hypotheses would also be worth using, to formalize the same question:

H0: the removal of feature Y does not affect the value of evaluation metric M when using both X and Y in the predictive model

Ha: removing Y from a model consisting of X and Y will cause considerable degradation in its performance, as measured by evaluation metric M

Note that in both of these approaches to creating a hypothesis, we took into account the direction of the change in the performance metric’s value. This is because the underlining assumption of features X and Y collaborating is that having them work in tandem is better than either one of them working on its own, as measured by our evaluation metric M.

Should We Remove X from the Feature Set?

After pondering the potential positive effect of a feature on a model, the question that comes to mind naturally is the reciprocal of that: would removing a feature, say X, from the feature set be good for the model (i.e. improve its performance)? Or in other words, should we remove X from the feature set for this particular problem? If you have understood the dynamics of feature collaboration underlined in the previous section, this question and its hypothesis should be fairly obvious. Yet, most people don’t pay enough attention to it, opting for more automated ways to reduce the number of features, oftentimes without realizing what is happening in the process.

If you would rather not take shortcuts and you would prefer to have a more intimate understanding of the dynamics in play, you may want to explore this question more. This will not only help you explain why you let go of feature X, but also gain some insight on the dynamics of features in a predictive analytics model in general. So, when should you take X out of the feature set? Well, there are two distinct possibilities:

X degrades the performance of the model (as measured by evaluation metric M)
The model’s performance remains the same whether X is present or not (based on the same metric)

One way of encapsulating this in a hypothesis setting is the following:

H0: having X in the model makes its performance, based on evaluation metric M, notably higher than omitting it from the model

Ha: removing X from the model either improves or maintains the same performance, as measured by metric M

Like in the previous question type, it is important to remember that the usefulness of a feature greatly depends on the problem at hand. If you find that removing feature X is the wisest choice, it’s best to still keep it around (i.e. don’t delete it altogether), since it may be valuable as a feature in another problem, or perhaps with some mathematical tinkering.

How Similar are Variables X and Y?

Another question worth asking is related to the first one we covered in the chapter, namely the measure of the similarity of two variables X and Y. While answering whether the variables are related may be easy, we may be interested in finding out exactly how much they are related. This is not based on some arbitrary mathematical sense of curiosity. It has a lot of hands-on applications in different data science scenarios, such as predictive analytics. For example, in a regression problem, finding out how similar a feature X is to the target variable Y is a good sign that it ought to be included in the model. Also, in any problem that involves continuous variables as features, finding that two such variables are very similar to each other may lead us to omit one of them, even without having to go through the process of the previous section, thus saving time.

In order to answer this question, we tend to rely on similarity metrics, so it’s usually not the case that we formulate hypotheses for this sort of question. Besides, most statistical similarity metrics come with a set of statistics that help clarify the significance of the result. However, even though this is a possibility, the way statistics has modeled the whole similarity matter is both arbitrary and weak, at least for real-world situations, so we’ll refrain from examining this approach. Besides, twisting the data into a preconceived idea of how it should be (i.e. a statistics distribution) may be convenient, but data science opts to deal with the data as-is rather than how we’d like it to be. Therefore, it is best to measure similarity with various metrics (particularly ones that don’t make any assumptions about the distributions of the variables involved) rather than rely on some statistical method only.

Similarity metrics are a kind of heuristics designed to depict how closely related two features are on a scale of 0 to 1. You can think of them as the opposite of distances. We’ll look at all of these along with other interesting metrics in detail in the heuristics chapter toward the last part of this book.

Does Variable X Cause Variable Y?

Whether a phenomenon expressed through variable X is the root-cause of a phenomenon denoted by variable Y is a tough problem to solve, and it is definitely beyond the scope of this book, yet a question related to this is quite valid and worth asking (this kind of problem is usually referred to as root-cause analysis). Nevertheless, unless you have some control over the whole data acquisition pipeline linked to the data science one, you may find it an unsurmountable task. The reason is that in order to prove or disprove causality, you need to carry out a series of experiments designed for this particular purpose, collect the data from them, and then do your analytics work. This is why more often than not, when opting to investigate this kind of question, we go for a simpler set-up known as A-B testing. This is not as robust, and it merely provides evidence of a contribution of the phenomenon of variable X in that of variable Y, which is quite different than saying that X is the root-cause of Y. Nevertheless, it is still a valuable method as it provides us with useful insights about the relationship between the two variables in a way that correlation metrics cannot.

A-B testing is the investigation of what happens when a control variable has a certain value in one case and a different value in another. The difference in the target variable Y between these two cases can show whether X influences Y in some measurable way. This is a much different question than the original one of this section. Still, it is worth looking into it, as it is common in practice.

The hypothesis that corresponds to this question is fairly simple. One way of formulating the null hypothesis is as follows:

H0: X does not influence Y in any substantial way

The alternative hypothesis in this case would be something like:

Ha: X contributes to Y in a substantial way

So, finding out if X is the root-cause of Y constitutes first checking to see if it influences Y, and then eliminating all other potential causes of Y one by one. To illustrate how complex this kind of analysis can be, consider that determining that smoking cigarettes is beyond a doubt a root cause of cancer (something that seems obvious to us now) took several years of research. Note that in the majority of cases of this kind of analysis, at least one of the variables (X, Y) is discrete.

Table of Contents for Chapter 5 Data Science Questions and Hypotheses

Create new playlist

Sign In

Sign Up

Chapter 5 Data Science Questions and Hypotheses

Table of Contents for
Chapter 5 Data Science Questions and Hypotheses

Chapter 5
Data Science Questions and Hypotheses