D.1. Data selection and avoiding bias

Data selection and feature engineering are frought with the hazards of bias (in human terms). Once you’ve baked your own biases into your algorithm, by choosing a particular set of features, the model will fit to those biases and produce biased results. If you’re lucky enough to discover this bias before going to production, it can require a significant amount of effort to undo the bias. Your entire pipeline must be rebuilt and retrained to be able to take advantage of the new vocabulary from your tokenizer, for example. You have to start over.

One example is the data and feature selection for the famous Word2vec model. Word2vec was trained on a vast array of news articles and from this corpus some 1 million or so N-grams were chosen as the vocabulary (features) for this model. This produced a model that excited data scientists and linguists with the possiblity of math on word vectors, such as “king - man + woman = queen.” But as researchers dug deeper, more problematic relationships revealed themselves in the model.

For example, for the expression “doctor - father + mother = nurse,” the answer “nurse” wasn’t the unbiased and logical result that they’d hoped for. A gender bias was inadvertently trained into the model. Similar racial, religious, and even geographic regional biases are prevalent in the original Word2vec model. The Google researchers didn’t create these biases intentionally. The bias is inherent in the data, the statistics of word usage in the Google News corpus they trained Word2vec on.

Many of the news articles simply had cultural biases because they were written by journalists motivated to keep their readers happy. And these journalists were writing about a world with institutional biases and biases in the real-world events and people. The word usage statistics in Google News merely reflect that there are many more mothers who are nurses than doctors. And there are many more fathers who are doctors than are nurses. The Word2vec model is just giving us a window into the world we have created.

Fortunately models like Word2vec don’t require labeled training data. So you have the freedom to choose any text you like to train your model. You can choose a dataset that is more balanced, more representative of the beliefs and inferences that you would like your model to make. And when others hide behind the algorithms to say that they’re only doing what the model tells them, you can share with them your datasets that more fairly represent a society where we aspire to provide everyone with equal opportunity.

As you’re training and testing your models, you can rely on your innate sense of fairness to help you decide when a model is ready to make predictions that affect the lives of your customers. If your model treats all of your users the way you would like to be treated, you can sleep well at night. It can also help to pay particularly close attention to the needs of your users that are unlike you, especially those that are typically disadvantaged by society. And if you need more formal justification for your actions, you can learn more about statistics, philosophy, ethics, psychology, behavioral economics, and anthropology to augment the computer science skills you’ve learned in this book.

As a natural language processing practitioner and machine learning engineer, you have an opportunity to train machines to do better than many humans do. Your bosses and colleagues aren’t going to tell you which documents to add or remove from your training set. You have the power to influence the behavior of machines that shape communities and society as a whole.

We’ve given you some ideas about how to assemble a dataset that’s less biased and more fair. Now we’ll show you how to fit your models to that unbiased data so that they’re also accurate and useful in the real world.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset