Chapter 8
Programming Bugs

Programming bugs are every coder’s nightmare, and since data science is intimately connected to coding, they are also every data scientist’s nightmare. This is because real data science (i.e. the data science done in the real world and in research centers of most universities) involves a lot of coding. Perhaps some data scientists can get by using a few off-the-shelf methods, but with automation becoming more and more widespread, it is doubtful these people will be adding real value for much longer. In other words, if you want to remain relevant in the data science field, getting your hands dirty with coding is a requirement, not a matter of choice. And wherever there is coding, there are programming bugs. You may be able to resolve some of them through some searching on the web, but unfortunately, many of these coding issues require additional work.

In this chapter, we will explore the most common areas where bugs creep up and the different types of bugs that you are bound to encounter. We will examine how you can tackle these issues and what we can learn from them to improve our craft.

The Importance of Understanding and Dealing with Programming Bugs

With all the talk about AI these days and how it is revamping the data science field, you may be tempted to think it is going to solve all of your problems. However, even if AI makes things easier and more efficient, it cannot change the fact that bugs and human errors are still going to be around (unless of course the whole process is outsourced to AI, something we will discuss in the last part of this book). If you believe that AI is going to do away with all your programming problems, you may want to reconsider your viewpoint!

It is good to keep in mind that even the most adept data scientists have bugs in their code from time to time. Being more experienced may allow you to come up with solutions faster, and the quality of these solutions is bound to be higher than that of a newcomer. However, experience will not get rid of all your mistakes when writing code, since many of these mistakes are due to factors beyond your control (such as fatigue and having too many things on your mind). Therefore, coming up with a robust strategy for dealing with these problems is going to be useful, if not essential, for many years to come.

Bugs are not always bad. If you look past the frustration they cause, they can be the source of useful lessons, especially in the early stages of your career. Examining them closely and tackling them with the right attitude can be a great way to learn more about the programming language you are using, the algorithms implemented, and how all of this fits into the data science pipeline. Let’s look into this in more detail, starting with the places where bugs tend to appear.

Places You Usually Find Bugs

Looking at the various places where you are more likely to encounter bugs in your code is an important endeavor, as it is bound to help you classify these bugs and gradually gain a better understanding of your strengths and weaknesses when it comes to the coding aspect of your work as a data scientist. This is especially useful when you are new to coding and wish to improve your skills quickly.

A place where bugs often flock is variables, particularly when you are new to the programming language. Fortunately, most high-level languages such as Julia and Python are able to adapt the variables’ types so that they are best suited for the values assigned to them in your code. Still, it is not uncommon, even when using such languages, to make mistakes with how you use these variables, leading to exceptions and errors in your scripts. You will always need to be conscious of how you handle variables when you are programming.

Coding structures, such as conditional statements and loops, can also be a nest for bugs. These bugs may be subtle and are equally vexing and can sometimes be seriously problematic, since they do not always throw errors when you run the scripts that contain bugs like this.

Functions are complex structures, and as such, they deserve lots of attention. Specifically in modern programming languages such as Scala and Julia where they play a more important role, functions tend to be a place where bugs creep up. This is even if these functions are tested individually and work fine in the majority of cases they are tested on.

An even more subtle kind of buggy situation appears whenever there are issues with your code’s logic. This is the most difficult situation you will encounter, as bugs in the area of a code’s logic are far more elusive and tend to remain unnoticed until they create issues.

Finally, a bug may come out in a combination of the aforementioned places. Bugs like that are even more challenging to resolve, but may give you valuable insight about the inner workings of the code you use. Now let’s look at the different types of bugs more closely.

Types of Bugs Commonly Encountered

Much like the bugs found in nature, coding bugs vary greatly, with some of them being more frustrating than others. Yet all of them can be dealt with if you understand them properly and learn to identify them when they creep up in your scripts.

First, we have the bugs which are fairly simple and relatively easy to tackle. These bugs have to do with the type of a variable. Since most modern languages are forgiving when it comes to variable types, it is easy to fall into the habit of not paying attention to them. Most data scientists do not particularly care about programming, to the extent people training in that skill do, so it is often the case that types are not set properly, resulting in various issues with the corresponding variables. Best case scenario, you lose some of the accuracy of the variables that should have been defined as Floats but were defined as Integers. Worst case scenario, the problem with the variable types gets unnoticed and creates issues later on. If the compiler of the language identifies such a problem and throws an exception or an error, it should be a cause of celebration, since at least in that case, you become instantly aware of the issue and you can remedy it before it causes other, subtler bugs later on in the program.

Indexing bugs are also fairly common, especially if you are uncertain about the dimensionality of the arrays you are trying to access. Sometimes the language you use may not be able to accept binary arrays as indexes, resulting in errors. Other times, you may be using a different indexing paradigm than the one the language is designed with. For example, the indexes in Julia as well as in R always start with 1, unlike other, more traditional languages, such as Python and C, that start with 0. Also, these languages have a different last element in their arrays than you might expect. For example, a 20 x 1 array in Python (let’s call it A) has indexes ranging from 0 to 19, so trying to access element A[20] will yield an error. To access the last element, you will need to refer to it is as A[19] or A[-1]. Moreover, even though negative indexes are acceptable in Python, other languages may not understand them and will throw an error when you attempt to use them.

Parameter value issues are another type of bug closer to home when it comes to data science applications. Sometimes these values are not set right, resulting in issues with the functions they correspond to. These issues are not always easy to detect since they do not always translate to errors, so it is best to make sure that whenever you are setting a value for a parameter, you know what values you should choose from for the function to work in a meaningful way. Otherwise, you may end up with results that don’t make much sense or compromise the effectiveness of your models.

Another type of bug has to do with code that never runs (or runs so infrequently that you never get to test it properly under normal circumstances). This is due to the existence of conditionals that are peculiar in the sense that one or more of the conditions present may never (or very rarely) hold true, resulting in whole branches of your code remaining dormant. The code may look fine (i.e. be void of obvious bugs), but may not always yield the results you expect. Unfortunately, a compiler is not sophisticated enough to detect this kind of bug, and the issues they may cause are bound to surface after a long time, possibly after you are done with that part of the project. This can result in delays in your project (if you are fortunate), though it is quite possible that the issues may be much worse (e.g. problematic situations arising when the script is already in production).

Sometimes, conditionals may result in infinite loops, which are yet another type of bug. These bugs are generally easier to pinpoint, though not any less vexing than the other bugs mentioned in this chapter. Note that infinite loops can be very expensive when it comes to the computing resources they consume, so you need to be careful with this kind of bug. Also, since many scripts take a while to run, especially when you are testing them on a single computer, infinite loops may not be apparent and you may waste not just your computing resources, but a lot of time too (waiting for the script to finish running).

It is good to keep in mind that oftentimes the outputs of a function are diverse, depending on the inputs or on other factors (a common situation when dealing with languages supporting multiple dispatch). Even though in the vast majority of cases a function yields a certain kind of output, it is possible for it to yield a completely different one that may mess up your code if you have not accounted for that possibility. This type of bug is also subtle, so it will not be identified by the compiler. Instead, when the problem arises, the computer is bound to make sure you become aware of this kind of bug by yielding the corresponding error.

There are also other types of bugs beyond these ones. These other types are more application-specific, and because of this, it is hard to talk about them in any meaningful way. However, they exist, so it is best to be mindful of your code. Things can get very complex and fast as you build more and more scripts that rely on other scripts. Even if the individual pieces of code appear to be simple enough to be bug-free, sometimes just the sheer amount of code you need to run will generate problems you had not anticipated. Therefore, it is advised that you expect this and always budget time for it. This way, when the time comes to debug these scripts, you can do so without getting stressed out.

Some Useful Considerations on Programming Bugs

Although programming bugs are generally a cause for delays and vexing situations, they are part of the package and are what make the scripts valuable, in a way. If everyone could write a program easily and without any issues, no-one would want to pay someone to do it and do it well. Avoiding programming is not a solution though, since it is programming that most empowers data science. It is unlikely that you will be able to do much in the field of data science without writing some code. Also, data science algorithms are always evolving. Even if you can perform some processes without having to write your own scripts, chances are that sooner or later you will need to do some coding if you want to remain relevant as a data scientist.

Handling bugs is a skill that you gradually develop. Although it is unlikely that your code-writing will ever be completely void of bugs, if you pay close attention to your programming mistakes, you will be able to limit them. As a bonus, this kind of experience can enable you to be a good communicator of the programming mindset and a great troubleshooter, essential skills in all mentoring endeavors. Hopefully the information in this chapter will enable you to pinpoint and understand the bugs in your programs and gradually come to accept them as issues that need to be tackled, just like problematic data or obscure requirements.

Summary

Programming bugs are a frustrating, yet inevitable situation a data scientist encounters. However, with the right attitude, they can also be useful lessons in terms of the programming language used, the algorithms involved, and the data science pipeline overall.

AI cannot rid the data science domain of bugs altogether.

Programming bugs are typically encountered in the following areas:

  • Variable types. These involve using the wrong type in a variable
  • Coding structures. These are subtle issues that involve more elaborate aspects of coding, such as loops
  • Functions. This is particularly important in cases where the same function is used in different programs, as is often the case in modern programming languages
  • Logic of the code. Bugs in this area are harder to pinpoint, as they involve issues with the algorithm behind the code

The types of bugs that are most often encountered in data science scripts are the following:

  • Simple ones, such as those related to variable types
  • Indexing bugs, involving access to arrays, be it vectors, matrices, or multi-dimensional data structures
  • Bugs related to parameter values. Most subtle bugs involving the input of a function, such as inputting a value that is out of range or problematic when in combination with other parameter values
  • Code that never runs, or runs very rarely. A side-effect of conditionals where there is insufficient forethought about the various possibilities they cover
  • Infinite loops. Bugs that have to do with loops that never terminate, wasting computer resources and your time
  • Bugs having to do with the output of a function. When the output of a function is not what you expect and you feed it to some other function or to a model
  • Other bug types. Bugs that are specific to a particular application or framework, beyond the cases covered in this list
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset