Visualization using matplotlib

matplotlib has been a popular plotting package besides a few other that are available today. The capability of matplotlib is now being realized by the Python community. John Hunter, the creator and project leader of this package summed it up as matplotlib tries to make easy things easy and hard things possible. You can generate very high-quality, publication-ready graphs with very little effort. In this section, we will pick a few interesting examples to illustrate the power of matplotlib.

Word clouds

Word clouds give greater prominence to words that appear more frequently in any given text. They are also called tag clouds or weighted words. You can tweak word clouds with different fonts, layouts, and color schemes. The significance of a word's strength in terms of the number of occurrences visually maps to the size of their appearance. In other words, the word that appears the largest in visualization is the one that has appeared the most in the text.

Beyond the obvious map to their occurrences, word clouds have several useful applications for social media and marketing. Some of the applications are as follows:

  • Businesses can get to know their customers and how they view their products. Some organizations have used a very creative method of asking their fans or followers to post words about what they think of their brand, taking all these words into a word cloud to better understand the most common impressions of their product brand.
  • Finding ways to learn about their competitors by identifying a brand whose online presence is popular. Creating a word cloud from their content to better understand what words and themes hook the product target market.

In order to create a word cloud, you can write the Python code or use something that already exists. Andreas Mueller from the NYU Center for Data Science created a pretty simple and easy-to-use word cloud in Python. It can be installed with the instructions given in the next section.

Installing word clouds

For faster installation, you can just use pip with sudo access, as shown in the following code:

sudo pip install git+git://github.com/amueller/word_cloud.git

Alternatively, you can obtain the package via wget on Linux or curl on Mac OS with the following code:

wget https://github.com/amueller/word_cloud/archive/master.zip
unzip master.zip
rm master.zip 
cd word_cloud-master 
sudo pip install -r requirements.txt

For the Anaconda IDE, you will have to install it using conda with the following three steps:

#step-1 command
conda install wordcloud

Fetching package metadata: ....
Error: No packages found in current osx-64 channels matching: wordcloud

You can search for this package on Binstar with
# This only means one has to search the source location
binstar search -t conda wordcloud

Run 'binstar show <USER/PACKAGE>' to get more details:
Packages:
                          Name | Access       | Package Types   | 
     ------------------------- | ------------ | --------------- |
             derickl/wordcloud | public       | conda           |
Found 1 packages

# step-2 command
binstar show derickl/wordcloud

Using binstar api site https://api.binstar.org
Name:    wordcloud
Summary:
Access:  public
Package Types:  conda
Versions:
   + 1.0

To install this package with conda run:
conda install --channel https://conda.binstar.org/derickl wordcloud

# step-3 command
conda install --channel https://conda.binstar.org/derickl wordcloud

Fetching package metadata: ......
Solving package specifications: .
Package plan for installation in environment /Users/MacBook/anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    cython-0.22                |           py27_0         2.2 MB
    django-1.8                 |           py27_0         3.2 MB
    pillow-2.8.1               |           py27_1         454 KB
    image-1.3.4                |           py27_0          24 KB
    setuptools-15.1            |           py27_1         435 KB
    wordcloud-1.0              |       np19py27_1          58 KB
    conda-3.11.0               |           py27_0         167 KB
    ------------------------------------------------------------
                                           Total:         6.5 MB

The following NEW packages will be INSTALLED:
    django:     1.8-py27_0
    image:      1.3.4-py27_0
    pillow:     2.8.1-py27_1
    wordcloud:  1.0-np19py27_1

The following packages will be UPDATED:
    conda:      3.10.1-py27_0 --> 3.11.0-py27_0
    cython:     0.21-py27_0   --> 0.22-py27_0
    setuptools: 15.0-py27_0   --> 15.1-py27_1



The following packages will be DOWNGRADED:

    
libtiff:    4.0.3-0       --> 4.0.2-1

Proceed ([y]/n)? y

Input for word clouds

In this section, there will be two sources where you can extract words to construct word clouds. The first example shows how to extract text from the web feeds of some known websites and how to extract the words from its description. The second example shows how to extract text from tweets with the help of search keywords. The two examples will need the feedparser package and the tweepy package, and by following similar steps (as mentioned for other packages previously), you can easily install them.

Our approach will be to collect words from both these examples and use them as the input for a common word cloud program.

Web feeds

There are well grouped and structured RSS or atom feeds in most of the news and technology service websites today. Although our aim is to restrict the context to technology alone, we can determine a handful of feed lists, as shown in the following code. In order to be able to parse these feeds, the parser() method of feedparser comes in handy. Word cloud has its own stopwords list, but in addition to this, we can also use one while collecting the data, as shown here (stopwords here is not complete, but you can gather more from any known resource on the Internet):

import feedparser
from os import path
import re

d = path.dirname(__file__)
mystopwords = [  'test', 'quot', 'nbsp']

feedlist = ['http://www.techcrunch.com/rssfeeds/',
'http://www.computerweekly.com/rss',
'http://feeds.twit.tv/tnt.xml',
'https://www.apple.com/pr/feeds/pr.rss',
'https://news.google.com/?output=rss'
'http://www.forbes.com/technology/feed/'                  'http://rss.nytimes.com/services/xml/rss/nyt/Technology.xml',         'http://www.nytimes.com/roomfordebate/topics/technology.rss',
'http://feeds.webservice.techradar.com/us/rss/reviews'            'http://feeds.webservice.techradar.com/us/rss/news/software',
'http://feeds.webservice.techradar.com/us/rss',
'http://www.cnet.com/rss/',
'http://feeds.feedburner.com/ibm-big-data-hub?format=xml',
'http://feeds.feedburner.com/ResearchDiscussions-DataScienceCentral?format=xml',        'http://feeds.feedburner.com/BdnDailyPressReleasesDiscussions-BigDataNews?format=xml',
'http://http://feeds.feedburner.com/ibm-big-data-hub-galleries?format=xml',          'http://http://feeds.feedburner.com/PlanetBigData?format=xml',
'http://rss.cnn.com/rss/cnn_tech.rss',
'http://news.yahoo.com/rss/tech',
'http://slashdot.org/slashdot.rdf',
'http://bbc.com/news/technology/']          

def extractPlainText(ht):
    plaintxt=''
    s=0
    for char in ht:
        if char == '<': s = 1
        elif char == '>': 
            s = 0
            plaintxt += ' '
        elif s == 0: plaintxt += char
    return plaintxt
    
def separatewords(text):
    splitter = re.compile('\W*')
    return [s.lower() for s in splitter.split(text) if len(s) > 3]
    
def combineWordsFromFeed(filename):
    with open(filename, 'w') as wfile:
      for feed in feedlist:
        print "Parsing " + feed
        fp = feedparser.parse(feed)
        for e in fp.entries:
          txt = e.title.encode('utf8') + 
               extractPlainText(e.description.encode('utf8'))
          words = separatewords(txt)
          
          for word in words:
            if word.isdigit() == False and word not in mystopwords:
               wfile.write(word)
               wfile.write(" ")
          wfile.write("
")
    wfile.close()
    return

combineWordsFromFeed("wordcloudInput_FromFeeds.txt")

The Twitter text

In order to access the Twitter API, you will need the access token and consumer credentials that consist of four parameters: access_token, access_token_secret, consumer_key, and consumer_secret. In order to obtain these keys, you will have to use a Twitter account. The steps involved in obtaining these keys are available on the Twitter website. The steps involved are:

  1. Log in to the Twitter account.
  2. Navigate to developer.twitter.com and use Manage My Apps to follow through and obtain the parameters mentioned before.

Assuming that these parameters are ready, with the tweepy package, you can access tweets via Python. The following code displays a simple custom stream listener. Here, as the tweets are streamed, there is a listener that listens to the status and writes the state to a file. This can be used later to create word clouds.

The stream uses a filter to narrow the Twitter text that is focused on the Python program, data visualization, big data, machine learning, and statistics. The tweepy stream provides the tweets that are extracted. This can run forever because there is unlimited data available out there. How do we set it to stop? The accessing speed may be slower than you would expect, and for the purposes of creating a word cloud, you would imagine that extracting a certain number of tweets is probably sufficient. We therefore set a limit and called it MAX_TWEETS to be 50, as shown in the following code:

import tweepy
import json
import sys
import codecs

counter = 0
MAX_TWEETS = 500

#Variables that contains the user credentials to access Twitter API 
access_token = "Access Token"
access_token_secret = "Access Secret"
consumer_key = "Consumer Key"
consumer_secret = "Consumer Secret"

fp = codecs.open("filtered_tweets.txt", "w", "utf-8")

class CustomStreamListener(tweepy.StreamListener):

     def on_status(self, status):
        global counter
        fp.write(status.text)
        print "Tweet-count:" +str(counter)
        counter += 1
        if counter >= MAX_TWEETS: sys.exit()

    def on_error(self, status):
        print status

if __name__ == '__main__':

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    streaming_api = tweepy.streaming.Stream(auth,
               CustomStreamListener(), timeout=60)

    streaming_api.filter(track=['python program', 'statistics', 
             'data visualization', 'big data', 'machine learning'])

Using any bag of words, you can write fewer than 20 lines of the Python code to generate word clouds. A word cloud generates an image, and using matplotlib.pyplot, you can use imshow() to display the word cloud image. The following word cloud can be used with any input file of words:

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from os import path

d = path.dirname("__file__")
text = open(path.join(d, 'filtered_tweets.txt')).read()

wordcloud = WordCloud(
    font_path='/Users/MacBook/kirthi/RemachineScript.ttf',
    stopwords=STOPWORDS,
    background_color='#222222',
    width=1000,
    height=800).generate(text)

# Open a plot of the generated image.
plt.figure(figsize=(13,13))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
The Twitter text

The required font file can be downloaded from any of a number of sites (one specific resource for this font is available at http://www.dafont.com/remachine-script.font). Wherever the font file is located, you will have to use this exact path set to font_path. For using the data from feeds, there is only one line that changes, as shown in the following code:

text = open(path.join(d, 'wordcloudInput_fromFeeds.txt')).read()
The Twitter text

Using the similar idea of extracting text from tweets to create word clouds, you could extract text within the context of mobile phone vendors with keywords, such as iPhone, Samsung Galaxy, Amazon Fire, LG Optimus, Nokia Lumia, and so on, to determine the sentiments of consumers. In this case, you may need an additional set of information, that is, the positive and negative sentiment values associated with words.

There are a few approaches that you can follow in a sentiment analysis on tweets in a restricted context. First, a very naïve approach would be to just associate weights to words that correspond to a positive sentiment as wp and a negative sentiment as wn, applying the following notation p(+) as the probability of a positive sentiment and p(-) for a negative sentiment:

The Twitter text

The second approach would be to use a natural language processing tool and apply trained classifiers to obtain better results. TextBlob is a text processing package that also has sentiment analysis (http://textblob.readthedocs.org/en/dev).

TextBlob builds a text classification system and creates a training set in the JSON format. Later, using this training and the Naïve Bayes classifier, it performs the sentiment analysis. We will attempt to use this tool in later chapters to demonstrate our working examples.

Plotting the stock price chart

The two biggest stock exchanges in the U.S. are the New York Stock Exchange (NYSE), founded in 1792 and the NASDAQ founded in 1971. Today, most stock market trades are executed electronically. Even the stocks themselves are almost always held in the electronic form, not as physical certificates. There are numerous other websites that also provide real-time stock price data, apart from NASDAQ and NYSE.

Obtaining data

One of the websites to obtain data is Yahoo, which provides data via the API, for example, to obtain the stock price (low, high, open, close, and volume) of Amazon, the URL is http://chartapi.finance.yahoo.com/instrument/1.0/amzn/chartdata;type=quote;range=3y/csv. Depending on the plotting method you select, there is some data conversion that is required. For instance, the data obtained from this resource includes date in a format that does not have any format, as shown in the following code:

uri:/instrument/1.0/amzn/chartdata;type=quote;range=3y/csv
ticker:amzn
Company-Name:Amazon.com, Inc.
Exchange-Name:NMS
unit:DAY
timestamp:
first-trade:19970516
last-trade:20150430
currency:USD
previous_close_price:231.9000
Date:20120501,20150430
labels:20120501,20120702,20121001,20130102,20130401,20130701,20131001,20140102,20140401,20140701,20141001,20150102,20150401
values:Date,close,high,low,open,volume
close:208.2200,445.1000
high:211.2300,452.6500
low:206.3700,439.0000
open:207.4000,443.8600
volume:984400,23856100
20120501,230.0400,232.9700,228.4000,229.4000,6754900
20120502,230.2500,231.4400,227.4000,227.8200,4593400
20120503,229.4500,232.5300,228.0300,229.7400,4055500
...
...
20150429,429.3700,434.2400,426.0300,426.7500,3613300
20150430,421.7800,431.7500,419.2400,427.1100,3609700

We will discuss three approaches in creating the plots. Each one has its own advantages and limitations.

In the first approach, with the matplotlib.cbook package and the pylab package, you can create a plot with the following lines of code:

from pylab import plotfile show, gca 
import matplotlib.cbook as cbook  
fname = cbook.get_sample_data('/Users/MacBook/stocks/amzn.csv', asfileobj=False) 
plotfile(fname, ('date', 'high', 'low', 'close'), subplots=False) 
show()

This will create a plot similar to the one shown in the following screenshot:

Obtaining data

There is one additional programming effort that is required before attempting to plot using this approach. The date values have to be formatted to represent 20150430 as %d-%b-%Y. With this approach, the plot can also be split into two, one showing the stock price and the other showing the volume, as shown in the following code:

from pylab import plotfile show, gca 
import matplotlib.cbook as cbook  
fname = cbook.get_sample_data('/Users/MacBook/stocks/amzn.csv', asfileobj=False) 
plotfile(fname, (0,1,5), plotfuncs={f:'bar'}) 
show()
Obtaining data

The second approach is to use the subpackages of matplotlib.mlab and matplotlib.finance. This has convenient methods to fetch the stock data from http://ichart.finance.yahoo.com/table.csv?s=GOOG&a=04&b=12&c=2014&d=06&e=20&f=2015&g=d, and to just show a sample, here is a code snippet:

ticker='GOOG'

import matplotlib.finance as finance
import matplotlib.mlab as mlab
import datetime

startdate = datetime.date(2014,4,12)
today = enddate = datetime.date.today()

fh = finance.fetch_historical_yahoo(ticker, startdate, enddate)   
r = mlab.csv2rec(fh); fh.close()
r.sort()
print r[:2]

[ (datetime.date(2014, 4, 14), 538.25, 544.09998, 529.56, 532.52002, 2568000, 532.52002)  (datetime.date(2014, 4, 15), 536.82001, 538.45001, 518.46002, 536.44, 3844500, 536.44)]

When you attempt to plot the stock price comparison, it does not make sense to display the volume information because for each stock ticker, the volumes are different. Also, it becomes too cluttered to view the stock chart.

matplotlib already has a working example to plot the stock chart, which is elaborate enough and includes Relative Strength Indicator (RSI) and Moving Average Convergence/Divergence (MACD), and is available at http://matplotlib.org/examples/pylab_examples/finance_work2.html. For details on RSI and MACD, you can find many resources online, but there is one interesting explanation at http://easyforextrading.co/how-to-trade/indicators/.

In an attempt to use the existing code, modify it, and make it work for multiple charts, a function called plotTicker() was created. This helps in plotting each ticker within the same axis, as shown in the following code:

import datetime
import numpy as np

import matplotlib.finance as finance
import matplotlib.dates as mdates
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

startdate = datetime.date(2014,4,12)
today = enddate = datetime.date.today()

plt.rc('axes', grid=True)
plt.rc('grid', color='0.75', linestyle='-', linewidth=0.5)
rect = [0.4, 0.5, 0.8, 0.5]

fig = plt.figure(facecolor='white', figsize=(12,11))

axescolor = '#f6f6f6' # the axes background color

ax = fig.add_axes(rect, axisbg=axescolor)
ax.set_ylim(10,800)

def plotTicker(ticker, startdate, enddate, fillcolor):
  """
     matplotlib.finance has fetch_historical_yahoo() which fetches 
     stock price data the url where it gets the data from is 
     http://ichart.yahoo.com/table.csv stores in a numpy record 
     array with fields: 
      date, open, high, low, close, volume, adj_close
  """

  fh = finance.fetch_historical_yahoo(ticker, startdate, enddate) 
  r = mlab.csv2rec(fh); 
  fh.close()
  r.sort()

  ### plot the relative strength indicator
  ### adjusted close removes the impacts of splits and dividends
  prices = r.adj_close

  ### plot the price and volume data
  
  ax.plot(r.date, prices, color=fillcolor, lw=2, label=ticker)
  ax.legend(loc='top right', shadow=True, fancybox=True)

  # set the labels rotation and alignment 
  for label in ax.get_xticklabels():
    # To display date label slanting at 30 degrees
    label.set_rotation(30)
    label.set_horizontalalignment('right')

  ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')

#plot the tickers now
plotTicker('BIDU', startdate, enddate, 'red')
plotTicker('GOOG', startdate, enddate, '#1066ee')
plotTicker('AMZN', startdate, enddate, '#506612')

plt.show()

When you use this to compare the stock prices of Bidu, Google, and Amazon, the plot would look similar to the following screenshot:

Obtaining data

Use the following code to compare the stock prices of Twitter, Facebook, and LinkedIn:

plotTicker('TWTR', startdate, enddate, '#c72020')
plotTicker('LNKD', startdate, enddate, '#103474')
plotTicker('FB', startdate, enddate, '#506612')
Obtaining data

Now, you can add the volume plot as well. For a single ticker plot with volume, use the following code:

import datetime

import matplotlib.finance as finance
import matplotlib.dates as mdates
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

startdate = datetime.date(2013,3,1)
today = enddate = datetime.date.today()

rect = [0.1, 0.3, 0.8, 0.4]   

fig = plt.figure(facecolor='white', figsize=(10,9))  
ax = fig.add_axes(rect, axisbg='#f6f6f6')

def plotSingleTickerWithVolume(ticker, startdate, enddate):
    
    global ax

    fh = finance.fetch_historical_yahoo(ticker, startdate, enddate)
    
    # a numpy record array with fields: 
    #     date, open, high, low, close, volume, adj_close
    r = mlab.csv2rec(fh); 
    fh.close()
    r.sort()
    
    plt.rc('axes', grid=True)
    plt.rc('grid', color='0.78', linestyle='-', linewidth=0.5)
    
    axt = ax.twinx()
    prices = r.adj_close

    fcolor = 'darkgoldenrod'

    ax.plot(r.date, prices, color=r'#1066ee', lw=2, label=ticker)
    ax.fill_between(r.date, prices, 0, prices, facecolor='#BBD7E5')
    ax.set_ylim(0.5*prices.max())

    ax.legend(loc='upper right', shadow=True, fancybox=True)
    
    volume = (r.close*r.volume)/1e6  # dollar volume in millions
    vmax = volume.max()
   
    axt.fill_between(r.date, volume, 0, label='Volume', 
                 facecolor=fcolor, edgecolor=fcolor)

    axt.set_ylim(0, 5*vmax)
    axt.set_yticks([])
    
    for axis in ax, axt:  
        for label in axis.get_xticklabels():
            label.set_rotation(30)
            label.set_horizontalalignment('right')
    
        axis.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')

plotSingleTickerWithVolume ('MSFT', startdate, enddate)
plt.show()

With the single ticker plot along with volume and the preceding changes in the earlier code, the plot will look similar to the following screenshot:

Obtaining data

You may also have the option of using the third approach: using the blockspring package. In order to install blockspring, you have to use the following pip command:

pip install blockspring

Blockspring's approach is to generate the HTML code. It autogenerates data for the plots in the JavaScript format. When this is integrated with D3.js, it provides a very nice interactive plot. Amazingly, there are only two lines of code:

import blockspring 
import json  

print blockspring.runParsed("stock-price-comparison", 
   { "tickers": "FB, LNKD, TWTR", 
   "start_date": "2014-01-01", "end_date": "2015-01-01" }).params

Depending on the operating system, when this code is run, it generates the HTML code in a default area.

Obtaining data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset