Chapter 1. Tools of the Trade

This chapter gives you an overview of the tools available for data analysis in Python, with details concerning the Python packages and libraries that will be used in this book. A few installation tips are given, and the chapter concludes with a brief example. We will concentrate on how to read data files, select data, and produce simple plots, instead of delving into numerical data analysis.

Before you start

We assume that you have familiarity with Python and have already developed and run some scripts or used Python interactively, either in the shell or on another interface, such as the Jupyter Notebook (formerly known as the IPython notebook). Hence, we also assume that you have a working installation of Python. In this book, we assume that you have installed Python 3.4 or later.

We also assume that you have developed your own workflow with Python, based on needs and available environment. To follow the examples in this book, you are expected to have access to a working installation of Python 3.4 or later. There are two alternatives to get started, as outlined in the following list:

Tip

Even if you have a working Python installation, you might want to try one of the prepackaged distributions. They contain a well-rounded collection of packages and modules suitable for data analysis and scientific computing. If you choose this path, all the libraries in the next list are included by default.

We also assume that you have the libraries in the following list:

  • numpy and scipy: These are available at http://www.scipy.org. These are the essential Python libraries for computational work. NumPy defines a fast and flexible array data structure, and SciPy has a large collection of functions for numerical computing. They are required by some of the libraries mentioned in the list.
  • matplotlib: This is available at http://matplotlib.org. It is a library for interactive graphics built on top of NumPy. I recommend versions above 1.5, which is what is included in Anaconda Python by default.
  • pandas: This is available at http://pandas.pydata.org. It is a Python data analysis library. It will be used extensively throughout the book.
  • pymc: This is a library to make Bayesian models and fitting in Python accessible and straightforward. It is available at http://pymc-devs.github.io/pymc/. This package will mainly be used in Chapter 6 Bayesian Methods, of this book.
  • scikit-learn: This is available at http://scikit-learn.org. It is a library for machine learning in Python. This package is used in  Chapter 7, Supervised and Unsupervised Learning.
  • IPython: This is available at http://ipython.org. It is a library providing enhanced tools for interactive computations in Python from the command line.
  • Jupyter: This is available at https://jupyter.org/. It is the notebook interface working on top of IPython (and other programming languages). Originally part of the IPython project, the notebook interface is a web-based platform for computational and data science that allows easy integration of the tools that are used in this book.

Notice that each of the libraries in the preceding list may have several dependencies, which must also be separately installed. To test the availability of any of the packages, start a Python shell and run the corresponding import statement. For example, to test the availability of NumPy, run the following command:

import numpy

If NumPy is not installed in your system, this will produce an error message. An alternative approach that does not require starting a Python shell is to run the command line:

python -c 'import numpy'

We also assume that you have either a programmer's editor or Python IDE. There are several options, but at the basic level, any editor capable of working with unformatted text files will do.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset