Text-specific feature construction

Until this point, we have been working with categorical and numerical data. While our categorical data has come in the form of a string, the text has been part of a single category. We will now dive deeper into longer—form text data. This form of text data is much more complex than single—category text, because we now have a series of categories, or tokens.

Before we get any further into working with text data, let's make sure we have a good understanding of what we mean when we refer to text data. Consider a service like Yelp, where users write up reviews of restaurants and businesses to share their thoughts on their experience. These reviews, all written in text format, contain a wealth of information that would be useful for machine learning purposes, for example, in predicting the best restaurant to visit.

In general, a large part of how we communicate in today's world is through written text, whether in messaging services, social media, or email. As a result, so much can be garnered from this information through modeling. For example, we can conduct a sentiment analysis from Twitter data.

This type of work can be referred to as natural language processing (NLP). This is a field primarily concerned with interactions between computers and humans, specifically where computers can be programmed to process natural language.

Now, as we've mentioned before, it's important to note that all machine learning models require numerical inputs, so we have to be creative and think strategically when we work with text and convert such data into numerical features. There are several options for doing so, so let's get started.

Table of Contents for Text-specific feature construction

Create new playlist

Sign In

Sign Up

Table of Contents for
Text-specific feature construction