Chapter 9
Mistakes Through the Data Science Process

Mistakes are inevitable, not just in coding as we saw in the previous chapter, but also through the whole data science pipeline. This is more or less obvious. What is not obvious is that these mistakes are also learning opportunities for you, and that no matter how experienced you are, they are bound to creep up on you. Naturally, the better your grasp of data science, the lower your chances of making these mistakes, but as the field constantly changes, it is likely that you will not be on top of every aspect of it.

In this chapter, we will examine the difference between mistakes and bugs, the most common types of mistakes in data science, how the selection of a model can be erroneous (even if it does not create apparent issues that will make you identify it as a mistake), the value of a mentor in discovering these mistakes, and additional considerations about mistakes in the data science process.

How Mistakes Differ From Bugs

Although bugs are mistakes of sorts, the mistakes in the data science process are of higher-level and more challenging to deal with overall. In particular, they have to do with either some misunderstanding of the pipeline or some general issues with how it is applied to a particular problem. The main issue with mistakes is that because they are high-level, they do not typically yield errors or exceptions, so they are easy to neglect, creating issues in the end-result of your project, whether that is in the data product or the insights.

As I discussed in the first part of the book, the data science pipeline can be time-consuming, as new ideas often come up as you go through it. This urges you to go back to previous steps and revisit your feature set, making a certain amount of back-and-forths inevitable. However, a simple mistake can force you into more back-and-forths in your process, causing additional delays. Being able to limit these mistakes can significantly improve your efficiency and the quality of your insights, as you will have more time to spend on meaningful tasks rather than troubleshooting issues you could have avoided.

Still, as you progress in your career as a data scientist, the mistakes you make are bound to become more and more subtle (and as a result, more interesting). Some bystanders may even not recognize them as mistakes, which is why those mistakes tend to require more effort and commitment to excellence to remedy them. One thing is for certain: mistakes will never vanish completely. Also, the sooner you realize that they are part of the daily life of a data scientist, the more realistic your expectations about your work and the field in general. With the right attitude, you can turn these mistakes into opportunities for growth that will make you a better data scientist and transform this otherwise vexing process of discovering and amending into priceless lessons.

Most Common Types of Mistakes

Let us now take a look at the most common types of mistakes in the data science process that people in the field tend to make, particularly in the early part of their careers. Even if your scripts are entirely bug-free, you may still fall victim to subtle errors which can oftentimes be more challenging than the simpler issues with the code you write.

The majority of the data science process mistakes involve the data engineering part of the pipeline – data cleaning in particular. A lot of data science practitioners nowadays are fond of data modeling and tend to forget that for the models to function well, the data that is fed into them has to be prepared properly. This takes place in the data engineering stage, an essential and time-consuming part of the pipeline. However, data cleaning has to do with more than merely getting rid of corrupt data and formatting the remaining parts of the raw data so that it forms a dataset. If there are a lot of missing values, we may have to examine these data points and see how they relate to the rest of the dataset, especially the target variables when we are dealing with a prediction analytics problem. Moreover, sometimes the arithmetic mean is not the right metric to use when replacing these missing values with something else. Plus, when it comes to classification problems, we need to take into account the class structure of the dataset before replacing these missing values. If you neglect any one of these steps, you are bound to distort the signal of the dataset, which is a costly mistake in this part of the pipeline.

The process of feature creation is an integral part of the data science pipeline, especially if the data at hand does not lend itself to advanced data analytics methods. Feature creation is not an easy task. The majority of beginners in the data science field find it very challenging and tend to neglect it. On the bright side, the programming part of it is fairly straight-forward, so if you are confident in your programming skills, it is unlikely to yield any bugs. Also, if you pay attention to the feature creation stage, you will likely save a lot of time later on, while at the same time get better results in your models. The mistakes data scientists make in this stage are usually not related to the feature creation per se, but rather to the fact that the time they dedicate in this process is insufficient. Coming up with new features is not the same as feature extraction, an often automated process for condensing the feature set into a set of meta-features (also known as super-features). Creating new features, even if many of them end unused, is useful in that it allows you to get to know the data on a deeper level through a creative process. Moreover, the new features you select from the ones you have come up with are bound to be useful additions to the existing features, making the parts of your project that follow considerably easier.

Issues related to sampling are not as common, but they are still a type of mistake you can encounter in your data science endeavors. Naturally, sampling a dataset properly (i.e. with random sampling) that does not have any measurable biases is essential for training and testing your models. This is why we often need to use several samples to ensure our models are stable, as we briefly saw in Chapter 7. Using a single or a small number of samples will usually bring about models that do not have adequate generalization. Therefore, not paying attention to this process when building your models is a serious mistake that can throw off even the most promising models you create.

As we saw in a previous chapter, model evaluation is an important part of the data science pipeline related to the model development stage. Nevertheless, it often does not get the necessary attention, and many people rush to use their models without spending enough time evaluating them. Model evaluation is essential for making sure that there are no biases present, a process that is often handled through K-fold cross validation as we have seen. Yet using this method only a single time is rarely enough. Therefore, the conclusions drawn from an insufficient analysis of the model can easily be a liability, particularly if that model is chosen to go into production afterwards. All of this constitutes a serious mistake that is unfortunately all too common among those unaware of the value of sampling.

Over-fitting a model is another issue that comes about from a superficial approach to the data modeling stage and which constitutes another important mistake. This is closely linked to the previous issues, as it involves the performance of models. Specifically, it has to do with models performing well for some data but horribly for most other data. In other words, the model is too specialized and its generalization is insufficient for it to be broadly useful. Allowing over-fitting in a model is a serious mistake which can put the whole project at risk if it is not handled properly.

Another mistake deals with the assumptions behind the tests we do or the metrics we calculate. Sometimes, these assumptions do not apply to the data at hand, making the conclusions that stem from them less robust then we may think and subject to change if the underlying discrepancies are stretched further. For most tests performed in everyday data science, this is a common but not a crucial problem (e.g. the t-test can handle many cases where the assumptions behind it are not met, without yielding misleading results). Since some cases are more sensitive than others, when it comes to their assumptions, it is best to be aware of this issue and avoid it whenever possible.

Finally, there are other types of mistakes which are not related to the above types, as they are more application-specific (e.g. mistakes related to the architectural design of an AI system or to the modeling of an NLP problem). I will not go into any detail about them here, but I recommend that you are aware of them, since all the mistakes mentioned here are merely the tip of the iceberg. The data science process may be simple enough to understand and apply effectively, but it entails many complications, and every aspect of it requires considerable attention. The more mindful you are about the data pipeline, the less likely you are to make any mistakes which can cause delays or other issues with your data science projects.

Choosing the Right Model

Choosing the right model for a given problem is often the root of a fundamental mistake in data science that tends to go unnoticed, which is why I dedicate a section to it. This is linked to understanding the problem you are trying to solve and figuring out the best strategy to move forward. There are a variety of models out there that can work with your data, but that does not mean that they are suitable as potential solutions to the data modeling part of the pipeline.

What is the right model anyway? More often than not, there is no single model that is optimum for your data, which is why you need to try several models before you make a choice. Take into account a variety of factors that are relevant to your project. This is how you decide on the model you plan to utilize and eventually put it into production. All of this can be challenging, especially if you are new to the organization. The most common related scenarios which can constitute different manifestations of the model-selection mistake are the following:

  • Choosing a model just because it is supposed to be good or popular in the data science community. This is the most common mistake related to model selection. Although there is nothing wrong with the models presented in articles or the models used by the “experts,” it is often the case that they are not the best ones to choose for every situation. The right model for a particular problem depends on various factors, and it is virtually impossible to know which one it would be beforehand. So, going for an expert’s a priori view on what should work would be unscientific and imprudent.
  • Selecting a model because it has a very high accuracy rate. This kind of model-related mistake is more subtle. Although accuracy is important when it comes to predictive analytics problems, it is not always the best evaluation metric to rely on. Sometimes the nature of the problem calls for a very specific kind of performance metric, such as the F1 metric, or perhaps a model that is easy to interpret or implement. There is also the case that speed is of utmost importance, so a faster model is preferred to a super-accurate one.
  • Going for a model because it is easy to understand and work with. This is a typical rookie mistake, but it can happen to data sciences pros as well. If models are overly simple and have a ton of literature on them, some data scientists go for them, as they lack the discernment to make a more informed decision. Models like that may be easy to defend as well, since everyone in a data analytics division has heard of a statistical inference system, for example.
  • Deciding on a model because it is faster than every other option you have. Having a fast model to train and use is great, but in many cases, this is not enough to make it the optimum choice. It is worth it to consider a more time-consuming model that can guarantee a higher value in terms of accuracy rate or F1 metric. So, choosing a model for its speed may not be a great option, unless speed is one of the key requirements of the systems you plan to deploy.

Dealing with each one of these possibilities will not only help you avoid a mistake related to the data modeling stage, but also increase your confidence in the model you end up selecting.

Value of a Mentor

Having a mentor in your life, especially in the beginning of your career, is a priceless resource. Not only will he be able to help you with general career advice, but he can answer questions you may have about specific technical topics, such as methodological matters in the data science pipeline. In fact, a mentor would probably be your best source of information about these matters, especially if it is someone who is active in the field and has hands-on knowledge of the craft.

It is important to have specific topics to discuss with your mentor in order to make the most of your time with them. What’s more, he may be able to help you develop a positive approach to dealing with the mistakes you make and enable you to go deeper into the ideas behind each one of them. This will also help you develop a holistic understanding of data science and its pipeline.

Some Useful Considerations on Mistakes

Mistakes in the data science process are not something to be ashamed of or to conceal. In fact, mistakes are worth discussing with other people in the field, especially more knowledgeable ones. They provide opportunities to delve into what you need to learn the most. No one is perfect, and even things that are considered good in data science now may prove to be sub-optimal or even bad in the future. So, if you start feeling complacent about your work, this may be a sign that you are not putting in enough effort. Data science is not an exact science, and the solutions that may be acceptable for now may not be good enough in the future. Keeping that in mind will help you avoid all kinds of issues throughout your data science career.

Allowing your thinking of the data science process to become stagnant is possibly the worst kind of mistake a person can make in this field, as data science greatly depends on having a flow of ideas, driven by creativity and evaluated by experimentation.

Summary

Mistakes in the data science process are inevitable, but they can be educational, particularly for newcomers in the field. Mistakes are different from programming bugs, as they tend to be more high-level, methodological, and harder to pinpoint.

The most common types of mistakes in the data science process are related to one or more of the following areas:

  • Data engineering, particularly data cleansing
  • Feature creation
  • Sampling
  • Model evaluation
  • Over-fitting
  • Not adhering to the assumptions behind a test or a process

Choosing the right model for a data science project is a process that requires special attention. The main mistakes related to this process are:

  • Selecting a model merely because it is supposed to be good or it is popular among data scientists
  • Using accuracy as the sole determinant of model-selection
  • Deciding on a model because it is easier to understand than the other options
  • Using speed as the only criterion for model-selection

A mentor can be a great asset in figuring out potential mistakes in the data science process and resolving them effectively, while at the same time learning about the ideas behind them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset