Chapter 13
Data Science Ethics

We saw in the previous chapter that AI can facilitate the data science process by a great deal of automating. However, even if some parts of the pipeline become automated, certain aspects of data science will remain untouched. They cannot be fully automated due to their non-mechanical nature. Ethics is one part of the process that is currently beyond automation.

In this chapter, we will examine various aspects of data science ethics, such as why it is important, the role of confidentiality (mainly privacy, data anonymization, and data security), as well as licensing matters. Using ethics in our practices elevate the role of the data scientist and enables us to offer something more than interesting insights and pretty products.

The Importance of Ethics in Data Science

Ethics is not something that is just nice to have, as some people think, especially those in the technical professions. In fact, it can be more significant than the actual analytics work that we are requested to undertake, especially when it comes to matters of privacy, security, and other potential liabilities that often outweigh the potential benefit from harvesting the data at hand. One of the key aspects of ethics is that it enables constructive and mutually beneficial relationships to come about in every organization. In addition, data science can become dangerous without some ethical foundation behind it. What’s worse, in some cases it does. This is especially true when there is sensitive data involved, such as financial, medical, or other kinds of personal data. Ethics is like a fail-safe, keeping data science in check when it comes to these kinds of situations.

These days, anyone can take a course in data science, read a book or two, watch a few videos, play around with a few datasets, and get the basics down of the data science craft. Although this is great, it does not make someone a data science professional. However, with ethics, all this skill can be put to good use, making the difference between a professional data scientist and one who just possesses the relevant know-how.

Confidentiality Matters

Confidentiality means keeping information accessible only to the people that really need to know it. Although this is often associated with encryption, a process for turning comprehensible information into gibberish in order to keep the information inaccessible, confidentiality involves more than just that. In the world of digital information, confidentiality is a very valuable asset, which unfortunately does not get the attention it needs in the context of data science. As data scientists, we need to take an ethical approach to confidentiality much the same as a doctor or a lawyer, especially when dealing with sensitive data. Doing otherwise is without a doubt unethical.

The parts of confidentiality that are most relevant to the data science field are privacy, data anonymization, and data security. In this section, we will look at each one of these concepts as they correspond to data science, and learn about how we can take them into account so that our work remains ethical.

Privacy

Protecting data from outsiders involves many different processes. One of the most important is privacy. Privacy is key in data science, especially in projects where sensitive data is involved, since it can easily “break” an organization. Even companies that have a good reputation and have gained their clients’ trust can lose everything if there is a privacy issue in their data. Take for example the case of Yahoo. Management blunders aside, Yahoo’s data privacy was severely compromised, which led to the loss of trust and respect from clients and society at large. Data exposed included names, email addresses, phone numbers, hashed passwords, and more, for over 500 million user accounts.

Ensuring privacy in the data handled in a data science project should always be kept in mind. If the data is being processed inside a company, this should not be an issue, as there are usually specialized professionals ensuring that everything inside the office space is private and secure. In the cases when you wish to work from home or have to be on the road for a business trip, the best and most secure option would be to use a virtual private network (VPN) or a TCP tunneling technique for connecting to the servers where the data is.

This is due to the fact that all sensitive data tends to be stored in private servers. When outside of the private servers, its privacy of the data therein could be compromised. The worst part of all this is that if this happens, you will probably not be aware of it when it happens. Unlike movies, hackers in the real-world do not leave witty messages on the computers they gain access to, even though some of the more amateur ones may accidentally leave some kind of trail. Whatever the case, it would be best to make it as hard as possible for them to access your data. If it takes too much effort to compromise your data’s privacy, they will probably move on to their next target.

One thing to keep in mind as far as privacy is concerned is that you ought to think of the worst-case-scenario in advance. This can be a great motivator and guide for the lengths you will need to go to in order to ensure that all data you use remains private throughout the duration of your project. Also, this can help you anticipate the vulnerabilities of your process and ensure that no private data is compromised.

Finally, it is important to remember that it is not just data that needs to be kept private, but metadata (data about the data) too. Also, someone’s privacy can be compromised not only with a single piece of data (e.g. their social security number), but also with a combination of things, such as a medical condition, a location, and their demographical makeup.

Data Anonymization

Good confidentiality also means making sure your data is anonymous. In other words, all personal identified information (aka PII) needs to be removed or hidden so that it is not possible for anyone to find the people behind the data points analyzed. Data anonymization not only helps mitigate the risk of the data being abused by third parties, but also removes any temptation you may have to abuse it yourself.

Data anonymization makes data useless to people who would gain access to it, when it comes to exploiting the people behind it. This way, the data is useful only for your data analytics projects, through the patterns it has as a whole. Each data point on its own is practically useless. This kind of confidentiality is essential in the finance industry, where payment data is common. However, even if you are working for a company that deals in online transactions and your projects involve credit card data, you have to pay attention to data anonymization.

If you have to use the variables containing sensitive information in your models, you can try mapping a hashing value to them. This way, the uniqueness of their values will be maintained, and the actual hashes will be meaningless to everyone accessing them. You can think of hashing as a transformation that is easy to do in one direction but extremely time-consuming, if not impossible, to reverse. Reversing a hash is equivalent to breaking an encryption code.

Since you do not want to take any risks when anonymizing these variables, it is a good practice to apply some “salt” in the hashing process, to ensure that it is even harder to break. The salt is usually a few random characters added to every data point, and it ensures a much stronger level of anonymization.

Similar to the privacy aspect of confidentiality, when dealing with data anonymization, you ought to consider the worst thing that could happen if the data you anonymize is leaked. This way you will have an accurate estimate of how much time you should dedicate to the whole process and ensure that you take the right steps to keep all sensitive data anonymous.

Data Security

Data security is another part of confidentiality, and it is probably the one most widely used, even outside the data science field. If you have bought something on an online store, or have accessed your bank account through the web or an app, you have used a form of data security, even if you were unaware of it. Without data security, all online transfers of information would be extremely risky and inviable.

The main methods that are used when it comes to native security are encryption and steganography. The first has to do with turning the data into gibberish, as we mentioned previously, while the latter is all about hiding it in plain sight by inserting it into some usually large data file, such as an image, an audio clip, or even a video. You can use both of these methods in conjunction for extra security (i.e. encrypt the data and then apply steganography to it).

When it comes to security beyond your computer’s hard drive, you have to take additional precautions. This is because in most cases your computer can be accessed through the Internet if certain ports are left open. Keeping ports open can be useful at times (e.g. for software updates), but it is a common liability that is favorable by black-hat hackers. So, keeping vital ports in your computer closed when you don’t use them is a good way to keep hackers at bay. Usually a good firewall program can help you manage that easily.

Naturally, it is also important to have secure software on your computer and especially a secure operating system (OS). This is particularly important for whatever programs you have set up to run on the cloud (e.g. APIs). Although certain OSes are more secure than others, how secure your computer is depends on how well you secure it, regardless of the OS you have. Even the most secure OS is vulnerable to hackers if it is not set up properly. For this kind of security, it would be best to consult a network engineer or a white-hat hacker.

Finally, storing important data is something that every data scientist has to deal with on his day-to-day work, so it is important that it is done properly. Whether it is passwords, data, or code, everything needs to be stored in a secure location, preferably in an encrypted format. Remember that any programming code you produce is part of your organization’s intellectual property, so it should be treated as an asset. The passwords are best kept in a password database, such as KeePass (KeePassX for Linux systems) or LastPass. Also, all important files are better off backed up in a remote location. Backing things up is something that needs to take place on a regular basis, which is why many back-up programs offer an automation mechanism for this.

If you apply these security pointers, your data is bound to remain safe. In case this seems like overkill, remember that it only takes one security breach to jeopardize a company’s assets and potentially its reputation. Security matters are not only part of data science ethics, but also of your organization’s integrity.

Licensing Matters

Let us now examine licensing a bit, a topic that usually doesn’t get any attention in data science. Even though we often do not pay much attention to copyright when using programs and content we encounter on the web when it comes to personal use, infringement of copyright is a serious issue, especially when the copyrighted material is used commercially. Therefore, the ethical approach to this matter is to pay close attention whenever handling any material with the © symbol.

Keep in mind that even data can be under copyright if it is proprietary, so using it for a data science project may require a certain kind of licensing. This is why you must be extra careful when scraping data from the web. The data in that case may be there for viewing, but not for using it for other purposes.

When it comes to open-source software, there is no issue with copyright, as it is usually free to use (oftentimes there is a different licensing in place, such as Creative Commons (CC), also known as copyleft). Sometimes, this software may not be free for commercial purposes, so keep that in mind. Also, just because something is free now does not mean that it is going to be free in the future.

In addition, if you make an innovation, it is a good idea to check for existing patents to minimize the risk of getting sued by some other inventor. This is particularly important if you plan to use that innovation commercially, which is what patents are for.

Finally, if you make use of images in your projects (e.g. as part of a presentation or a GUI for a data product) make sure that they are under CC license. If no licensing information is available for a given image, always assume that you will need to get permission before using it. Even if the owner of the graphic has no issue with you using it, the ethical way to approach it is to ask for permission and document their response.

Other Ethical Matters

Beyond these basic aspects of data science ethics, there are other things that are also important. These are not specific to data science, as they have to do with professional ethics in general. For example, being able to meet deadlines is an important ethical matter, especially when dealing with time-sensitive projects, as is often the case in data science. Also, making sure that everything is documented and passed on to other members of the team is essential in order to perform data science properly. Maintaining an objective stance regarding experiments is another issue of ethics that is paramount when it comes to testing hypotheses. After all, the excessive pressure of publishing papers that characterizes academia is non-existent in data science.

Some Final Considerations on Ethics

Ethics is often confused with morality, and although related, they are not the same. For starters, morality is internal and relates to a set of principles or values as well as a sense of right and wrong, while ethics is external and has to do with a set of behaviors and attitudes. Also, even if morality may take many years to develop, ethics is always within reach. This is because ethics is external, which even though it often stems from morality, it can exist independently.

Beyond the duality of ethics and morality, there are several other things related to ethics that are worth mentioning. For example, ethics is a matter of personal priorities. As such, it may not be asked of you directly or checked afterward. However, it is still expected of you, especially if you are in a responsible position in an organization, or you are branding yourself as a stand-alone data science consultant.

Summary

Ethics is a part of the data science profession that cannot be automated and which adds a lot of value to process, even if it is not usually perceived immediately. Ethics in data science involves the following:

  • Confidentiality – making sure the data is accessed only by the people who are supposed to access it. It involves privacy, data anonymization, and data security.
  • Licensing – handling copyright matters and ensuring that no one is sued by using external material and data in your projects
  • Privacy is an essential part of confidentiality related to keeping data accessible only to those who need to access it. This involves not just data but also metadata and anything that can reveal a person’s identity through a piece of data or a combination of things.
  • Data anonymization is about changing data to ensure that confidentiality is maintained
  • Data security is a common process that involves keeping data safe from external hazards, such as hackers and unpredictable catastrophes

Ethics is different from morality, although they are interlinked. Morality is an internal matter related to one’s values, while ethics is an external matter, related to one’s attitude and the manifestation of certain moral principles.

Ethics is one of the key differentiators between a professional and an amateur, especially in the data science field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset