matplotlib
has been a popular plotting package besides a few other that are available today. The capability of matplotlib
is now being realized by the Python community. John Hunter, the creator and project leader of this package summed it up as matplotlib tries to make easy things easy and hard things possible. You can generate very high-quality, publication-ready graphs with very little effort. In this section, we will pick a few interesting examples to illustrate the power of matplotlib
.
Word clouds give greater prominence to words that appear more frequently in any given text. They are also called tag clouds or weighted words. You can tweak word clouds with different fonts, layouts, and color schemes. The significance of a word's strength in terms of the number of occurrences visually maps to the size of their appearance. In other words, the word that appears the largest in visualization is the one that has appeared the most in the text.
Beyond the obvious map to their occurrences, word clouds have several useful applications for social media and marketing. Some of the applications are as follows:
In order to create a word cloud, you can write the Python code or use something that already exists. Andreas Mueller from the NYU Center for Data Science created a pretty simple and easy-to-use word cloud in Python. It can be installed with the instructions given in the next section.
For faster installation, you can just use pip
with sudo
access, as shown in the following code:
sudo pip install git+git://github.com/amueller/word_cloud.git
Alternatively, you can obtain the package via wget
on Linux or curl
on Mac OS with the following code:
wget https://github.com/amueller/word_cloud/archive/master.zip unzip master.zip rm master.zip cd word_cloud-master sudo pip install -r requirements.txt
For the Anaconda IDE, you will have to install it using conda
with the following three steps:
#step-1 command conda install wordcloud Fetching package metadata: .... Error: No packages found in current osx-64 channels matching: wordcloud You can search for this package on Binstar with # This only means one has to search the source location binstar search -t conda wordcloud Run 'binstar show <USER/PACKAGE>' to get more details: Packages: Name | Access | Package Types | ------------------------- | ------------ | --------------- | derickl/wordcloud | public | conda | Found 1 packages # step-2 command binstar show derickl/wordcloud Using binstar api site https://api.binstar.org Name: wordcloud Summary: Access: public Package Types: conda Versions: + 1.0 To install this package with conda run: conda install --channel https://conda.binstar.org/derickl wordcloud # step-3 command conda install --channel https://conda.binstar.org/derickl wordcloud Fetching package metadata: ...... Solving package specifications: . Package plan for installation in environment /Users/MacBook/anaconda: The following packages will be downloaded: package | build ---------------------------|----------------- cython-0.22 | py27_0 2.2 MB django-1.8 | py27_0 3.2 MB pillow-2.8.1 | py27_1 454 KB image-1.3.4 | py27_0 24 KB setuptools-15.1 | py27_1 435 KB wordcloud-1.0 | np19py27_1 58 KB conda-3.11.0 | py27_0 167 KB ------------------------------------------------------------ Total: 6.5 MB The following NEW packages will be INSTALLED: django: 1.8-py27_0 image: 1.3.4-py27_0 pillow: 2.8.1-py27_1 wordcloud: 1.0-np19py27_1 The following packages will be UPDATED: conda: 3.10.1-py27_0 --> 3.11.0-py27_0 cython: 0.21-py27_0 --> 0.22-py27_0 setuptools: 15.0-py27_0 --> 15.1-py27_1 The following packages will be DOWNGRADED: libtiff: 4.0.3-0 --> 4.0.2-1 Proceed ([y]/n)? y
In this section, there will be two sources where you can extract words to construct word clouds. The first example shows how to extract text from the web feeds of some known websites and how to extract the words from its description. The second example shows how to extract text from tweets with the help of search keywords. The two examples will need the feedparser
package and the tweepy
package, and by following similar steps (as mentioned for other packages previously), you can easily install them.
Our approach will be to collect words from both these examples and use them as the input for a common word cloud program.
There are well grouped and structured RSS or atom feeds in most of the news and technology service websites today. Although our aim is to restrict the context to technology alone, we can determine a handful of feed lists, as shown in the following code. In order to be able to parse these feeds, the parser()
method of feedparser
comes in handy. Word cloud has its own stopwords
list, but in addition to this, we can also use one while collecting the data, as shown here (stopwords
here is not complete, but you can gather more from any known resource on the Internet):
import feedparser from os import path import re d = path.dirname(__file__) mystopwords = [ 'test', 'quot', 'nbsp'] feedlist = ['http://www.techcrunch.com/rssfeeds/', 'http://www.computerweekly.com/rss', 'http://feeds.twit.tv/tnt.xml', 'https://www.apple.com/pr/feeds/pr.rss', 'https://news.google.com/?output=rss' 'http://www.forbes.com/technology/feed/' 'http://rss.nytimes.com/services/xml/rss/nyt/Technology.xml', 'http://www.nytimes.com/roomfordebate/topics/technology.rss', 'http://feeds.webservice.techradar.com/us/rss/reviews' 'http://feeds.webservice.techradar.com/us/rss/news/software', 'http://feeds.webservice.techradar.com/us/rss', 'http://www.cnet.com/rss/', 'http://feeds.feedburner.com/ibm-big-data-hub?format=xml', 'http://feeds.feedburner.com/ResearchDiscussions-DataScienceCentral?format=xml', 'http://feeds.feedburner.com/BdnDailyPressReleasesDiscussions-BigDataNews?format=xml', 'http://http://feeds.feedburner.com/ibm-big-data-hub-galleries?format=xml', 'http://http://feeds.feedburner.com/PlanetBigData?format=xml', 'http://rss.cnn.com/rss/cnn_tech.rss', 'http://news.yahoo.com/rss/tech', 'http://slashdot.org/slashdot.rdf', 'http://bbc.com/news/technology/'] def extractPlainText(ht): plaintxt='' s=0 for char in ht: if char == '<': s = 1 elif char == '>': s = 0 plaintxt += ' ' elif s == 0: plaintxt += char return plaintxt def separatewords(text): splitter = re.compile('\W*') return [s.lower() for s in splitter.split(text) if len(s) > 3] def combineWordsFromFeed(filename): with open(filename, 'w') as wfile: for feed in feedlist: print "Parsing " + feed fp = feedparser.parse(feed) for e in fp.entries: txt = e.title.encode('utf8') + extractPlainText(e.description.encode('utf8')) words = separatewords(txt) for word in words: if word.isdigit() == False and word not in mystopwords: wfile.write(word) wfile.write(" ") wfile.write(" ") wfile.close() return combineWordsFromFeed("wordcloudInput_FromFeeds.txt")
In order to access the Twitter API, you will need the access token and consumer credentials that consist of four parameters: access_token
, access_token_secret
, consumer_key
, and consumer_secret
. In order to obtain these keys, you will have to use a Twitter account. The steps involved in obtaining these keys are available on the Twitter website. The steps involved are:
Assuming that these parameters are ready, with the tweepy
package, you can access tweets via Python. The following code displays a simple custom stream listener. Here, as the tweets are streamed, there is a listener that listens to the status and writes the state to a file. This can be used later to create word clouds.
The stream uses a filter to narrow the Twitter text that is focused on the Python program, data visualization, big data, machine learning, and statistics. The tweepy
stream provides the tweets that are extracted. This can run forever because there is unlimited data available out there. How do we set it to stop? The accessing speed may be slower than you would expect, and for the purposes of creating a word cloud, you would imagine that extracting a certain number of tweets is probably sufficient. We therefore set a limit and called it MAX_TWEETS
to be 50
, as shown in the following code:
import tweepy import json import sys import codecs counter = 0 MAX_TWEETS = 500 #Variables that contains the user credentials to access Twitter API access_token = "Access Token" access_token_secret = "Access Secret" consumer_key = "Consumer Key" consumer_secret = "Consumer Secret" fp = codecs.open("filtered_tweets.txt", "w", "utf-8") class CustomStreamListener(tweepy.StreamListener): def on_status(self, status): global counter fp.write(status.text) print "Tweet-count:" +str(counter) counter += 1 if counter >= MAX_TWEETS: sys.exit() def on_error(self, status): print status if __name__ == '__main__': auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) streaming_api = tweepy.streaming.Stream(auth, CustomStreamListener(), timeout=60) streaming_api.filter(track=['python program', 'statistics', 'data visualization', 'big data', 'machine learning'])
Using any bag of words, you can write fewer than 20 lines of the Python code to generate word clouds. A word cloud generates an image, and using matplotlib.pyplot
, you can use imshow()
to display the word cloud image. The following word cloud can be used with any input file of words:
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt from os import path d = path.dirname("__file__") text = open(path.join(d, 'filtered_tweets.txt')).read() wordcloud = WordCloud( font_path='/Users/MacBook/kirthi/RemachineScript.ttf', stopwords=STOPWORDS, background_color='#222222', width=1000, height=800).generate(text) # Open a plot of the generated image. plt.figure(figsize=(13,13)) plt.imshow(wordcloud) plt.axis("off") plt.show()
The required font file can be downloaded from any of a number of sites (one specific resource for this font is available at http://www.dafont.com/remachine-script.font). Wherever the font file is located, you will have to use this exact path set to font_path
. For using the data from feeds, there is only one line that changes, as shown in the following code:
text = open(path.join(d, 'wordcloudInput_fromFeeds.txt')).read()
Using the similar idea of extracting text from tweets to create word clouds, you could extract text within the context of mobile phone vendors with keywords, such as iPhone, Samsung Galaxy, Amazon Fire, LG Optimus, Nokia Lumia, and so on, to determine the sentiments of consumers. In this case, you may need an additional set of information, that is, the positive and negative sentiment values associated with words.
There are a few approaches that you can follow in a sentiment analysis on tweets in a restricted context. First, a very naïve approach would be to just associate weights to words that correspond to a positive sentiment as wp and a negative sentiment as wn, applying the following notation p(+) as the probability of a positive sentiment and p(-) for a negative sentiment:
The second approach would be to use a natural language processing tool and apply trained classifiers to obtain better results. TextBlob is a text processing package that also has sentiment analysis (http://textblob.readthedocs.org/en/dev).
TextBlob builds a text classification system and creates a training set in the JSON format. Later, using this training and the Naïve Bayes classifier, it performs the sentiment analysis. We will attempt to use this tool in later chapters to demonstrate our working examples.
The two biggest stock exchanges in the U.S. are the New York Stock Exchange (NYSE), founded in 1792 and the NASDAQ founded in 1971. Today, most stock market trades are executed electronically. Even the stocks themselves are almost always held in the electronic form, not as physical certificates. There are numerous other websites that also provide real-time stock price data, apart from NASDAQ and NYSE.
One of the websites to obtain data is Yahoo, which provides data via the API, for example, to obtain the stock price (low, high, open, close, and volume) of Amazon, the URL is http://chartapi.finance.yahoo.com/instrument/1.0/amzn/chartdata;type=quote;range=3y/csv. Depending on the plotting method you select, there is some data conversion that is required. For instance, the data obtained from this resource includes date in a format that does not have any format, as shown in the following code:
uri:/instrument/1.0/amzn/chartdata;type=quote;range=3y/csv ticker:amzn Company-Name:Amazon.com, Inc. Exchange-Name:NMS unit:DAY timestamp: first-trade:19970516 last-trade:20150430 currency:USD previous_close_price:231.9000 Date:20120501,20150430 labels:20120501,20120702,20121001,20130102,20130401,20130701,20131001,20140102,20140401,20140701,20141001,20150102,20150401 values:Date,close,high,low,open,volume close:208.2200,445.1000 high:211.2300,452.6500 low:206.3700,439.0000 open:207.4000,443.8600 volume:984400,23856100 20120501,230.0400,232.9700,228.4000,229.4000,6754900 20120502,230.2500,231.4400,227.4000,227.8200,4593400 20120503,229.4500,232.5300,228.0300,229.7400,4055500 ... ... 20150429,429.3700,434.2400,426.0300,426.7500,3613300 20150430,421.7800,431.7500,419.2400,427.1100,3609700
We will discuss three approaches in creating the plots. Each one has its own advantages and limitations.
In the first approach, with the matplotlib.cbook
package and the pylab
package, you can create a plot with the following lines of code:
from pylab import plotfile show, gca import matplotlib.cbook as cbook fname = cbook.get_sample_data('/Users/MacBook/stocks/amzn.csv', asfileobj=False) plotfile(fname, ('date', 'high', 'low', 'close'), subplots=False) show()
This will create a plot similar to the one shown in the following screenshot:
There is one additional programming effort that is required before attempting to plot using this approach. The date values have to be formatted to represent 20150430 as %d-%b-%Y
. With this approach, the plot can also be split into two, one showing the stock price and the other showing the volume, as shown in the following code:
from pylab import plotfile show, gca import matplotlib.cbook as cbook fname = cbook.get_sample_data('/Users/MacBook/stocks/amzn.csv', asfileobj=False) plotfile(fname, (0,1,5), plotfuncs={f:'bar'}) show()
The second approach is to use the subpackages of matplotlib.mlab
and matplotlib.finance
. This has convenient methods to fetch the stock data from http://ichart.finance.yahoo.com/table.csv?s=GOOG&a=04&b=12&c=2014&d=06&e=20&f=2015&g=d, and to just show a sample, here is a code snippet:
ticker='GOOG' import matplotlib.finance as finance import matplotlib.mlab as mlab import datetime startdate = datetime.date(2014,4,12) today = enddate = datetime.date.today() fh = finance.fetch_historical_yahoo(ticker, startdate, enddate) r = mlab.csv2rec(fh); fh.close() r.sort() print r[:2] [ (datetime.date(2014, 4, 14), 538.25, 544.09998, 529.56, 532.52002, 2568000, 532.52002) (datetime.date(2014, 4, 15), 536.82001, 538.45001, 518.46002, 536.44, 3844500, 536.44)]
When you attempt to plot the stock price comparison, it does not make sense to display the volume information because for each stock ticker, the volumes are different. Also, it becomes too cluttered to view the stock chart.
matplotlib
already has a working example to plot the stock chart, which is elaborate enough and includes Relative Strength Indicator (RSI) and Moving Average Convergence/Divergence (MACD), and is available at http://matplotlib.org/examples/pylab_examples/finance_work2.html. For details on RSI and MACD, you can find many resources online, but there is one interesting explanation at http://easyforextrading.co/how-to-trade/indicators/.
In an attempt to use the existing code, modify it, and make it work for multiple charts, a function called plotTicker()
was created. This helps in plotting each ticker within the same axis, as shown in the following code:
import datetime import numpy as np import matplotlib.finance as finance import matplotlib.dates as mdates import matplotlib.mlab as mlab import matplotlib.pyplot as plt startdate = datetime.date(2014,4,12) today = enddate = datetime.date.today() plt.rc('axes', grid=True) plt.rc('grid', color='0.75', linestyle='-', linewidth=0.5) rect = [0.4, 0.5, 0.8, 0.5] fig = plt.figure(facecolor='white', figsize=(12,11)) axescolor = '#f6f6f6' # the axes background color ax = fig.add_axes(rect, axisbg=axescolor) ax.set_ylim(10,800) def plotTicker(ticker, startdate, enddate, fillcolor): """ matplotlib.finance has fetch_historical_yahoo() which fetches stock price data the url where it gets the data from is http://ichart.yahoo.com/table.csv stores in a numpy record array with fields: date, open, high, low, close, volume, adj_close """ fh = finance.fetch_historical_yahoo(ticker, startdate, enddate) r = mlab.csv2rec(fh); fh.close() r.sort() ### plot the relative strength indicator ### adjusted close removes the impacts of splits and dividends prices = r.adj_close ### plot the price and volume data ax.plot(r.date, prices, color=fillcolor, lw=2, label=ticker) ax.legend(loc='top right', shadow=True, fancybox=True) # set the labels rotation and alignment for label in ax.get_xticklabels(): # To display date label slanting at 30 degrees label.set_rotation(30) label.set_horizontalalignment('right') ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d') #plot the tickers now plotTicker('BIDU', startdate, enddate, 'red') plotTicker('GOOG', startdate, enddate, '#1066ee') plotTicker('AMZN', startdate, enddate, '#506612') plt.show()
When you use this to compare the stock prices of Bidu, Google, and Amazon, the plot would look similar to the following screenshot:
Use the following code to compare the stock prices of Twitter, Facebook, and LinkedIn:
plotTicker('TWTR', startdate, enddate, '#c72020') plotTicker('LNKD', startdate, enddate, '#103474') plotTicker('FB', startdate, enddate, '#506612')
Now, you can add the volume plot as well. For a single ticker plot with volume, use the following code:
import datetime import matplotlib.finance as finance import matplotlib.dates as mdates import matplotlib.mlab as mlab import matplotlib.pyplot as plt startdate = datetime.date(2013,3,1) today = enddate = datetime.date.today() rect = [0.1, 0.3, 0.8, 0.4] fig = plt.figure(facecolor='white', figsize=(10,9)) ax = fig.add_axes(rect, axisbg='#f6f6f6') def plotSingleTickerWithVolume(ticker, startdate, enddate): global ax fh = finance.fetch_historical_yahoo(ticker, startdate, enddate) # a numpy record array with fields: # date, open, high, low, close, volume, adj_close r = mlab.csv2rec(fh); fh.close() r.sort() plt.rc('axes', grid=True) plt.rc('grid', color='0.78', linestyle='-', linewidth=0.5) axt = ax.twinx() prices = r.adj_close fcolor = 'darkgoldenrod' ax.plot(r.date, prices, color=r'#1066ee', lw=2, label=ticker) ax.fill_between(r.date, prices, 0, prices, facecolor='#BBD7E5') ax.set_ylim(0.5*prices.max()) ax.legend(loc='upper right', shadow=True, fancybox=True) volume = (r.close*r.volume)/1e6 # dollar volume in millions vmax = volume.max() axt.fill_between(r.date, volume, 0, label='Volume', facecolor=fcolor, edgecolor=fcolor) axt.set_ylim(0, 5*vmax) axt.set_yticks([]) for axis in ax, axt: for label in axis.get_xticklabels(): label.set_rotation(30) label.set_horizontalalignment('right') axis.fmt_xdata = mdates.DateFormatter('%Y-%m-%d') plotSingleTickerWithVolume ('MSFT', startdate, enddate) plt.show()
With the single ticker plot along with volume and the preceding changes in the earlier code, the plot will look similar to the following screenshot:
You may also have the option of using the third approach: using the blockspring
package. In order to install blockspring
, you have to use the following pip
command:
pip install blockspring
Blockspring's approach is to generate the HTML code. It autogenerates data for the plots in the JavaScript format. When this is integrated with D3.js, it provides a very nice interactive plot. Amazingly, there are only two lines of code:
import blockspring import json print blockspring.runParsed("stock-price-comparison", { "tickers": "FB, LNKD, TWTR", "start_date": "2014-01-01", "end_date": "2015-01-01" }).params
Depending on the operating system, when this code is run, it generates the HTML code in a default area.