Predicting Movie Reviews Using NLP and Spark Streaming

In this chapter, we will take an in-depth look at the field of Natural Language Processing (NLP), not to be confused with Neuro-Linguistic Programming! NLP helps analyze raw textual data and extract useful information such as sentence structure, sentiment of text, or even translation of text between languages. Since many sources of data contain raw text, (for example, reviews, news articles, and medical records). NLP is getting more and more popular, thanks to providing an insight into the text and helps make automatized decisions easier.

Under the hood, NLP is often using machine-learning algorithms to extract and model the structure of text. The power of NLP is much more visible if it is applied in the context of another machine method, where, for example, text can represent one of the input features.

In this chapter, we will apply NLP to analyze the sentiment of movie reviews. Based on annotated training data, we will build a classification model that is going to distinguish between positive and negative movie reviews. It is important to mention that we do not extract sentiment directly from the text (based on words such as love, hate, and so on), but utilize a binary classification that we have already explored in the previous chapter.

In order to accomplish this, we will take raw movie reviews that have been manually scored beforehand and train an ensemble-a set of models-which are as follows:

  1. Process the movie reviews to synthesize the features for our model.

Here, we will explore the various features we can create with text data ranging from a bag-of-words approach to a weighted bag-of-words (for example, TF-IDF) and then briefly explore the word2vec algorithm, which we will explore in detail in Chapter 5, Word2vec for Prediction and Clustering.

Alongside this, we will look at some basic ways of feature selection/omission, which include removing stopwords and punctuation, or stemming.

  1. Using the generated features, we will run a variety of supervised, binary classification algorithms to help us classify positive and negative reviews, which include the following:
    • Classification decision tree
    • Naive Bayes
    • Random forest
    • Gradient boosted trees
  1. Leveraging the combined predictive power of the four different learning algorithms, we will create a super-learner model, which takes all four "guesses" of the models as meta-features to train a deep neural network to output a final prediction.
  1. Finally, we will create a Spark machine learning pipeline for this process, which does the following:
    • Extracts features from new movie reviews
    • Comes up with a prediction
    • Outputs this prediction inside of a Spark streaming application (yes, you will build your first machine learning application in every chapter for the remainder of this book!)

If this sounds a tad ambitious, take heart! We will step through each one of these tasks in a manner that is both methodical and purposeful so that you can have the confidence to build your own NLP application; but first, a little background history and some theory behind this exciting field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset