Chapter 9. Fighting Bias

We like to think that machines are more rational than us: heartless silicon applying cold logic. Thus, when computer science introduced automated decision making into the economy, many hoped that computers would reduce prejudice and discrimination. Yet, as we mentioned earlier when looking at mortgage applications and ethnicity, computers are made and trained by humans, and the data that those machines use stems from an unjust world. Simply put, if we are not careful, our programs will amplify human biases.

In the financial industry, anti-discrimination is not only a matter of morality. Take, for instance, the Equal Credit Opportunity Act (ECOA), which came into force in 1974 in the United States. This law explicitly forbids creditors from discriminating applicants based on race, sex, marital status, and several other attributes. It also requires creditors to inform applicants about the reasons for denial.

The algorithms discussed in this book are discrimination machines. Given an objective, these machines will find the features that it’s best to discriminate on. Yet, as we’ve discussed discrimination is not always okay.

While it's okay to target ads for books from a certain country to people who are also from that country, it's usually not okay, and thanks to the ECOA, often illegal, to deny a loan to people from a certain country. Within the financial domain, there are much stricter rules for discrimination than those seen in book sales. This is because decisions in the financial domain have a much more severe impact on people's lives than those of book sales.

Equally, discrimination in this context is feature specific. For example, while it's okay to discriminate against loan applicants based on their history of repaying loans, it's not okay to do so based on their country of origin, unless there are sanctions against that country or similar overarching laws in place.

Throughout this chapter, we'll discuss the following:

  • Where bias in machines comes from
  • The legal implications of biased machine learning (ML) models
  • How observed unfairness can be reduced
  • How models can be inspected for bias and unfairness
  • How causal modeling can reduce bias
  • How unfairness is a complex systems failure that needs to be addressed in non-technical ways

The algorithms discussed in this book are feature extraction algorithms. Even if regulated features are omitted, an algorithm might infer them from proxy features and then discriminate based on them anyway. As an example of this, ZIP codes can be used to predict race reasonably well in many cities in the United States. Therefore, omitting regulated features is not enough when it comes to combating bias.

Sources of unfairness in machine learning

As we have discussed many times throughout this book, models are a function of the data that they are trained on. Generally speaking, more data will lead to smaller errors. So, by definition, there is less data on minority groups, simply because there are fewer people in those groups.

This disparate sample size can lead to worse model performance for the minority group. As a result, this increased error is often known as a systematic error. The model might have to overfit the majority group data so that the relationships it found do not apply to the minority group data. Since there is little minority group data, this is not punished as much.

Imagine you are training a credit scoring model, and the clear majority of your data comes from people living in lower Manhattan, and a small minority of it comes from people living in rural areas. Manhattan housing is much more expensive, so the model might learn that you need a very high income to buy an apartment. However, rural housing is much cheaper in comparison. Even so, because the model is largely trained on data from Manhattan, it might deny loan applications to rural applicants because they tend to have lower incomes than their Manhattan peers.

Aside from sample size issues, our data can be biased by itself. For example, "raw data" does not exist. Data does not appear naturally, instead it's measured by humans using human-made measurement protocols, which in themselves can be biased in many different ways.

Biases could include having sampling biases, such as in the Manhattan housing example, or having measurement biases, which is when your sample might not measure what it is intended to measure, or may even discriminate against one group.

Another bias that's possible is pre-existing social biases. These are visible in word vectors, for instance, in Word2Vec, where the mapping from father to doctor in latent space maps from mother to nurse. Likewise, the vector from man to computer programmer maps from woman to homemaker. This is because sexism is encoded within the written language of our sexist society. Until today, typically speaking doctors have usually been men and nurses have usually been women. Likewise, tech companies' diversity statistics reveal that far more men are computer programmers than women, and these biases get encoded into models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset