Chapter 8. Text Mining and Social Network Analysis

In this chapter, we will cover the following recipes:

  • Creating a categorized corpus
  • Tokenizing news articles in sentences and words
  • Stemming, lemmatizing, filtering, and TF-IDF scores
  • Recognizing named entities
  • Extracting topics with non-negative matrix factorization
  • Implementing a basic terms database
  • Computing social network density
  • Calculating social network closeness centrality
  • Determining the betweenness centrality
  • Estimating the average clustering coefficient
  • Calculating the assortativity coefficient of a graph
  • Getting the clique number of a graph
  • Creating a document graph with cosine similarity

Introduction

Humans have communicated through language for thousands of years. Handwritten texts have been around for ages, the Gutenberg press was of course a huge development, but now that we have computers, the Internet, and social media, things have definitely spiraled out of control.

This chapter will help you cope with the flood of textual and social media information. The main Python libraries we will use are NLTK and NetworkX. You have to really appreciate how many features can be found in these libraries. Install NLTK with either pip or conda as follows:

$ conda/pip install nltk 

The code was tested with NLTK 3.0.2. If you need to download corpora, follow the instructions given at http://www.nltk.org/data.html (retrieved November 2015).

Install NetworkX with either pip or conda, as follows:

$ conda/pip install networkx 

The code was tested with Network 1.9.1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset