Chapter 3

Introduction to NumPy, Pandas,and Matplotlib

Learning Objectives

By the end of the chapter, you will be able to:

  • Create and manipulate one-dimensional and multi-dimensional arrays
  • Create and manipulate pandas DataFrames and series objects
  • Plot and visualize numerical data using the Matplotlib library
  • Apply matplotlib, NumPy, and pandas to calculate descriptive statistics from a DataFrame/matrix

In this chapter, you will learn about the fundamentals of the NumPy, pandas, and matplotlib libraries.

Introduction

In the preceding chapters, we have covered some advanced data structures, such as stack, queue, iterator, and file operations in Python. In this section, we will cover three essential libraries, namely NumPy, pandas, and matplotlib.

NumPy Arrays

In the life of a data scientist, reading and manipulating arrays is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list or a multi-dimensional table or a matrix full of numbers.

The array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant.

Some example scenarios where you will need to handle numeric arrays are as follows:

  • To read a list of phone numbers and postal codes and extract a certain pattern
  • To create a matrix with random numbers to run a Monte Carlo simulation on some statistical process
  • To scale and normalize a sales figure table, with lots of financial and transactional data
  • To create a smaller table of key descriptive statistics (for example, mean, median, min/max range, variance, inter-quartile ranges) from a large raw data table
  • To read in and analyze time series data in a one-dimensional array daily, such as the stock price of an organization over a year or daily temperature data from a weather station

In short, arrays and numeric data tables are everywhere. As a data wrangling professional, the importance of the ability to read and process numeric arrays cannot be overstated. In this regard, NumPy arrays will be the most important object in Python that you need to know about.

NumPy Array and Features

NumPy and SciPy are open source add-on modules for Python that provide common mathematical and numerical routines in pre-compiled, fast functions. These have grown into highly mature libraries that provide functionality that meets, or perhaps exceeds, what is associated with common commercial software such as MATLAB or Mathematica.

One of the main advantages of the NumPy module is to handle or create one-dimensional or multi-dimensional arrays. This advanced data structure/class is at the heart of the NumPy package and it serves as the fundamental building block of more advanced classes such as pandas and DataFrame, which we will cover shortly in this chapter.

NumPy arrays are different than common Python lists, since Python lists can be thought as simple array. NumPy arrays are built for vectorized operations that process a lot of numerical data with just a single line of code. Many built-in mathematical functions in NumPy arrays are written in low-level languages such as C or Fortran and pre-compiled for real, fast execution.

Note

NumPy arrays are optimized data structures for numerical analysis, and that's why they are so important to data scientists.

Exercise 26: Creating a NumPy Array (from a List)

In this exercise, we will create a NumPy array from a list:

  1. To work with NumPy, we must import it. By convention, we give it a short name, np, while importing:

    import numpy as np

  2. Create a list with three elements, 1, 2, and 3:

    list_1 = [1,2,3]

  3. Use the array function to convert it into an array:

    array_1 = np.array(list_1)

    We just created a NumPy array object called array_1 from the regular Python list object, list_1.

  4. Create an array of floating type elements 1.2, 3.4, and 5.6:

    import array as arr

    a = arr.array('d', [1.2, 3.4, 5.6])

    print(a)

    The output is as follows:

    array('d', [1.2, 3.4, 5.6])

  5. Let's check the type of the newly created object by using the type function:

    type(array_1)

    The output is as follows:

    numpy.ndarray

  6. Use type on list_1:

    type (list_1)

    The output is as follows:

    list

So, this is indeed different from the regular list object.

Exercise 27: Adding Two NumPy Arrays

This simple exercise will demonstrate the addition of two NumPy arrays, and thereby show the key difference between a regular Python list/array and a NumPy array:

  1. Consider list_1 and array_1 from the preceding exercise. If you have changed the Jupyter notebook, you will have to declare them again.
  2. Use the + notation to add two list_1 object and save the results in list_2:

    list_2 = list_1 + list_1

    print(list_2)

    The output is as follows:

    [1, 2, 3, 1, 2, 3]

  3. Use the same + notation to add two array_1 objects and save the result in array_2:

    array_2 = array_1 + array_1

    print(array_2)

    The output is as follows:

    [2, ,4, 6]

Did you notice the difference? The first print shows a list with 6 elements [1, 2, 3, 1, 2, 3]. But the second print shows another NumPy array (or vector) with the elements [2, 4, 6], which are just the sum of the individual elements of array_1.

NumPy arrays are like mathematical objects – vectors. They are built for element-wise operations, that is, when we add two NumPy arrays, we add the first element of the first array to the first element of the second array – there is an element-to-element correspondence in this operation. This is in contrast to Python lists, where the elements are simply appended and there is no element-to-element relation. This is the real power of a NumPy array: they can be treated just like mathematical vectors.

A vector is a collection of numbers that can represent, for example, the coordinates of points in a three-dimensional space or the color of numbers (RGB) in a picture. Naturally, relative order is important for such a collection and as we discussed previously, a NumPy array can maintain such order relationships. That's why they are perfectly suitable to use in numerical computations.

Exercise 28: Mathematical Operations on NumPy Arrays

Now that you know that these arrays are like vectors, we will try some mathematical operations on arrays.

NumPy arrays even support element-wise exponentiation. For example, suppose there are two arrays – the elements of the first array will be raised to the power of the elements in the second array:

  1. Multiply two arrays using the following command:

    print("array_1 multiplied by array_1: ",array_1*array_1)

    The output is as follows:

    array_1 multiplied by array_1: [1 4 9]

  2. Divide two arrays using the following command:

    print("array_1 divided by array_1: ",array_1/array_1)

    The output is as follows:

    array_1 divided by array_1: [1. 1. 1.]

  3. Raise one array to the second arrays power using the following command:

    print("array_1 raised to the power of array_1: ",array_1**array_1)

    The output is as follows:

    array_1 raised to the power of array_1: [ 1 4 27]

Exercise 29: Advanced Mathematical Operations on NumPy Arrays

NumPy has all the built-in mathematical functions that you can think of. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical operations on that array.

Here, we are creating a list and then converting that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array:

  1. Create a list with five elements:

    list_5=[i for i in range(1,6)]

    print(list_5)

    The output is as follows:

    [1, 2, 3, 4, 5]

  2. Convert the list into a NumPy array by using the following command:

    array_5=np.array(list_5)

    array_5

    The output is as follows:

    array([1, 2, 3, 4, 5])

  3. Find the sine value of the array by using the following command:

    # sine function

    print("Sine: ",np.sin(array_5))

    The output is as follows:

    Sine: [ 0.84147098 0.90929743 0.14112001 -0.7568025 -0.95892427]

  4. Find the logarithmic value of the array by using the following command:

    # logarithm

    print("Natural logarithm: ",np.log(array_5))

    print("Base-10 logarithm: ",np.log10(array_5))

    print("Base-2 logarithm: ",np.log2(array_5))

    The output is as follows:

    Natural logarithm: [0. 0.69314718 1.09861229 1.38629436 1.60943791]

    Base-10 logarithm: [0. 0.30103 0.47712125 0.60205999 0.69897 ]

    Base-2 logarithm: [0. 1. 1.5849625 2. 2.32192809]

  5. Find the exponential value of the array by using the following command:

    # Exponential

    print("Exponential: ",np.exp(array_5))

    The output is as follows:

    Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591 ]

Exercise 30: Generating Arrays Using arange and linspace

Generation of numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods.

The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of the fixed numbers of intermediate points between two extremes:

  1. Create a series of numbers using the arange method, by using the following command:

    print("A series of numbers:",np.arange(5,16))

    The output is as follows:

    A series of numbers: [ 5 6 7 8 9 10 11 12 13 14 15]

  2. Print numbers using the arange function by using the following command:

    print("Numbers spaced apart by 2: ",np.arange(0,11,2))

    print("Numbers spaced apart by a floating point number: ",np.arange(0,11,2.5))

    print("Every 5th number from 30 in reverse order ",np.arange(30,-1,-5))

    The output is as follows:

    Numbers spaced apart by 2: [ 0 2 4 6 8 10]

    Numbers spaced apart by a floating point number: [ 0. 2.5 5. 7.5 10. ]

    Every 5th number from 30 in reverse order

    [30 25 20 15 10 5 0]

  3. For linearly spaced numbers, we can use the linspace method, as follows:

    print("11 linearly spaced numbers between 1 and 5: ",np.linspace(1,5,11))

    The output is as follows:

    11 linearly spaced numbers between 1 and 5: [1. 1.4 1.8 2.2 2.6 3. 3.4 3.8 4.2 4.6 5. ]

Exercise 31: Creating Multi-Dimensional Arrays

So far, we have created only one-dimensional arrays. Now, let's create some multi-dimensional arrays (such as a matrix in linear algebra). Just like we created the one-dimensional array from a simple flat list, we can create a two-dimensional array from a list of lists:

  1. Create a list of lists and convert it into a two-dimensional NumPy array by using the following command:

    list_2D = [[1,2,3],[4,5,6],[7,8,9]]

    mat1 = np.array(list_2D)

    print("Type/Class of this object:",type(mat1))

    print("Here is the matrix ---------- ",mat1," ----------")

    The output is as follows:

    Type/Class of this object: <class 'numpy.ndarray'>

    Here is the matrix

    ----------

    [[1 2 3]

    [4 5 6]

    [7 8 9]]

    ----------

  2. Tuples can be converted into multi-dimensional arrays by using the following code:

    tuple_2D = np.array([(1.5,2,3), (4,5,6)])

    mat_tuple = np.array(tuple_2D)

    print (mat_tuple)

    The output is as follows:

    [[1.5 2. 3. ]

    [4. 5. 6. ]]

Thus, we have created multi-dimensional arrays using Python lists and tuples.

Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array

The following methods let you check the dimension, shape, and size of the array. Note that if it's a 3x2 matrix, that is, it has 3 rows and 2 columns, then the shape will be (3,2), but the size will be 6, as 6 = 3x2:

  1. Print the dimension of the matrix using ndim by using the following command:

    print("Dimension of this matrix: ",mat1.ndim,sep='')

    The output is as follows:

    Dimension of this matrix: 2

  2. Print the size using size:

    print("Size of this matrix: ", mat1.size,sep='')

    The output is as follows:

    Size of this matrix: 9

  3. Print the shape of the matrix using shape:

    print("Shape of this matrix: ", mat1.shape,sep='')

    The output is as follows:

    Shape of this matrix: (3, 3)

  4. Print the dimension type using dtype:

    print("Data type of this matrix: ", mat1.dtype,sep='')

    The output is as follows:

    Data type of this matrix: int32

Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors

Now that we are familiar with basic vector (one-dimensional) and matrix data structures in NumPy, we will take a look how to create special matrices easily. Often, you may have to create matrices filled with zeros, ones, random numbers, or ones in the diagonal:

  1. Print the vector of zeros by using the following command:

    print("Vector of zeros: ",np.zeros(5))

    The output is as follows:

    Vector of zeros: [0. 0. 0. 0. 0.]

  2. Print the matrix of zeros by using the following command:

    print("Matrix of zeros: ",np.zeros((3,4)))

    The output is as follows:

    Matrix of zeros: [[0. 0. 0. 0.]

    [0. 0. 0. 0.]

    [0. 0. 0. 0.]]

  3. Print the matrix of fives by using the following command:

    print("Matrix of 5's: ",5*np.ones((3,3)))

    The output is as follows:

    Matrix of 5's: [[5. 5. 5.]

    [5. 5. 5.]

    [5. 5. 5.]]

  4. Print an identity matrix by using the following command:

    print("Identity matrix of dimension 2:",np.eye(2))

    The output is as follows:

    Identity matrix of dimension 2: [[1. 0.]

    [0. 1.]]

  5. Print an identity matrix with a dimension of 4x4 by using the following command:

    print("Identity matrix of dimension 4:",np.eye(4))

    The output is as follows:

    Identity matrix of dimension 4: [[1. 0. 0. 0.]

    [0. 1. 0. 0.]

    [0. 0. 1. 0.]

    [0. 0. 0. 1.]]

  6. Print a matrix of random shape using the randint function:

    print("Random matrix of shape (4,3): ",np.random.randint(low=1,high=10,size=(4,3)))

    The sample output is as follows:

    Random matrix of shape (4,3):

    [[6 7 6]

    [5 6 7]

    [5 3 6]

    [2 9 4]]

    Note

    When creating matrices, you need to pass on tuples of integers as arguments.

Random number generation is a very useful utility and needs to be mastered for data science/data wrangling tasks. We will look at the topic of random variables and distributions again in the section on statistics and see how NumPy and pandas have built-in random number and series generation, as well as manipulation functions.

Exercise 34: Reshaping, Ravel, Min, Max, and Sorting

Reshaping an array is a very useful operation for vectors as machine learning algorithms may demand input vectors in various formats for mathematical manipulation. In this section, we will be looking at how reshaping can take be done on an array. The opposite of reshape is the ravel function, which flattens any given array into a one-dimensional array. It is a very useful action in many machine learning and data analytics tasks.

The following functions reshape the function. We will first generate a random one-dimensional vector of 2-digit numbers and then reshape the vector into multi-dimensional vectors:

  1. Create an array of 30 random integers (sampled from 1 to 99) and reshape it into two different forms using the following code:

    a = np.random.randint(1,100,30)

    b = a.reshape(2,3,5)

    c = a.reshape(6,5)

  2. Print the shape using the shape function by using the following code:

    print ("Shape of a:", a.shape)

    print ("Shape of b:", b.shape)

    print ("Shape of c:", c.shape)

    The output is as follows:

    Shape of a: (30,)

    Shape of b: (2, 3, 5)

    Shape of c: (6, 5)

  3. Print the arrays a, b, and c using the following code:

    print(" a looks like ",a)

    print(" b looks like ",b)

    print(" c looks like ",c)

    The sample output is as follows:

    a looks like

    [ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57

    32 69 34 59 98 48]

    b looks like

    [[[ 7 82 9 29 50]

    [50 71 65 33 84]

    [55 78 40 68 50]]

    [[15 65 55 98 38]

    [23 75 50 57 32]

    [69 34 59 98 48]]]

    c looks like

    [[ 7 82 9 29 50]

    [50 71 65 33 84]

    [55 78 40 68 50]

    [15 65 55 98 38]

    [23 75 50 57 32]

    [69 34 59 98 48]]

    Note

    "b" is a three-dimensional array – a kind of list of a list of a list.

  4. Ravel file b using the following code:

    b_flat = b.ravel()

    print(b_flat)

    The sample output is as follows:

    [ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57

    32 69 34 59 98 48]

Exercise 35: Indexing and Slicing

Indexing and slicing of NumPy arrays is very similar to regular list indexing. We can even step through a vector of elements with a definite step size by providing it as an additional argument in the format (start, step, end). Furthermore, we can pass a list as the argument to select specific elements.

In this exercise, we will learn about indexing and slicing on one-dimensional and multi-dimensional arrays:

Note

In multi-dimensional arrays, you can use two numbers to denote the position of an element. For example, if the element is in the third row and second column, its indices are 2 and 1 (because of Python's zero-based indexing).

  1. Create an array of 10 elements and examine its various elements by slicing and indexing the array with slightly different syntaxes. Do this by using the following command:

    array_1 = np.arange(0,11)

    print("Array:",array_1)

    The output is as follows:

    Array: [ 0 1 2 3 4 5 6 7 8 9 10]

  2. Print the element in the seventh position by using the following command:

    print("Element at 7th index is:", array_1[7])

    The output is as follows:

    Element at 7th index is: 7

  3. Print the elements between the third and sixth positions by using the following command:

    print("Elements from 3rd to 5th index are:", array_1[3:6])

    The output is as follows:

    Elements from 3rd to 5th index are: [3 4 5]

  4. Print the elements until the fourth position by using the following command:

    print("Elements up to 4th index are:", array_1[:4])

    The output is as follows:

    Elements up to 4th index are: [0 1 2 3]

  5. Print the elements backwards by using the following command:

    print("Elements from last backwards are:", array_1[-1::-1])

    The output is as follows:

    Elements from last backwards are: [10 9 8 7 6 5 4 3 2 1 0]

  6. Print the elements using their backward index, skipping three values, by using the following command:

    print("3 Elements from last backwards are:", array_1[-1:-6:-2])

    The output is as follows:

    3 Elements from last backwards are: [10 8 6]

  7. Create a new array called array_2 by using the following command:

    array_2 = np.arange(0,21,2)

    print("New array:",array_2)

    The output is as follows:

    New array: [ 0 2 4 6 8 10 12 14 16 18 20]

  8. Print the second, fourth, and ninth elements of the array:

    print("Elements at 2nd, 4th, and 9th index are:", array_2[[2,4,9]])

    The output is as follows:

    Elements at 2nd, 4th, and 9th index are: [ 4 8 18]

  9. Create a multi-dimensional array by using the following command:

    matrix_1 = np.random.randint(10,100,15).reshape(3,5)

    print("Matrix of random 2-digit numbers ",matrix_1)

    The sample output is as follows:

    Matrix of random 2-digit numbers

    [[21 57 60 24 15]

    [53 20 44 72 68]

    [39 12 99 99 33]]

  10. Access the values using double bracket indexing by using the following command:

    print(" Double bracket indexing ")

    print("Element in row index 1 and column index 2:", matrix_1[1][2])

    The sample output is as follows:

    Double bracket indexing

    Element in row index 1 and column index 2: 44

  11. Access the values using single bracket indexing by using the following command:

    print(" Single bracket with comma indexing ")

    print("Element in row index 1 and column index 2:", matrix_1[1,2])

    The sample output is as follows:

    Single bracket with comma indexing

    Element in row index 1 and column index 2: 44

  12. Access the values in a multi-dimensional array using a row or column by using the following command:

    print(" Row or column extract ")

    print("Entire row at index 2:", matrix_1[2])

    print("Entire column at index 3:", matrix_1[:,3])

    The sample output is as follows:

    Row or column extract

    Entire row at index 2: [39 12 99 99 33]

    Entire column at index 3: [24 72 99]

  13. Print the matrix with the specified row and column indices by using the following command:

    print(" Subsetting sub-matrices ")

    print("Matrix with row indices 1 and 2 and column indices 3 and 4 ", matrix_1[1:3,3:5])

    The sample output is as follows:

    Subsetting sub-matrices

    Matrix with row indices 1 and 2 and column indices 3 and 4

    [[72 68]

    [99 33]]

  14. Print the matrix with the specified row and column indices by using the following command:

    print("Matrix with row indices 0 and 1 and column indices 1 and 3 ", matrix_1[0:2,[1,3]])

    The sample output is as follows:

    Matrix with row indices 0 and 1 and column indices 1 and 3

    [[57 24]

    [20 72]]

Conditional Subsetting

Conditional subsetting is a way to select specific elements based on some numeric condition. It is almost like a shortened version of a SQL query to subset elements. See the following example:

matrix_1 = np.array(np.random.randint(10,100,15)).reshape(3,5)

print("Matrix of random 2-digit numbers ",matrix_1)

print (" Elements greater than 50 ", matrix_1[matrix_1>50])

The sample output is as follows (note that the exact output will be different for you as it is random):

Matrix of random 2-digit numbers

[[71 89 66 99 54]

[28 17 66 35 85]

[82 35 38 15 47]]

Elements greater than 50

[71 89 66 99 54 66 85 82]

Exercise 36: Array Operations (array-array, array-scalar, and universal functions)

NumPy arrays operate just like mathematical matrices, and the operations are performed element-wise.

Create two matrices (multi-dimensional arrays) with random integers and demonstrate element-wise mathematical operations such as addition, subtraction, multiplication, and division. Show the exponentiation (raising a number to a certain power) operation, as follows:

Note

Due to random number generation, your specific output could be different to what is shown here.

  1. Create two matrices:

    matrix_1 = np.random.randint(1,10,9).reshape(3,3)

    matrix_2 = np.random.randint(1,10,9).reshape(3,3)

    print(" 1st Matrix of random single-digit numbers ",matrix_1)

    print(" 2nd Matrix of random single-digit numbers ",matrix_2)

    The sample output is as follows (note that the exact output will be different for you as it is random):

    1st Matrix of random single-digit numbers

    [[6 5 9]

    [4 7 1]

    [3 2 7]]

    2nd Matrix of random single-digit numbers

    [[2 3 1]

    [9 9 9]

    [9 9 6]]

  2. Perform addition, subtraction, division, and linear combination on the matrices:

    print(" Addition ", matrix_1+matrix_2)

    print(" Multiplication ", matrix_1*matrix_2)

    print(" Division ", matrix_1/matrix_2)

    print(" Linear combination: 3*A - 2*B ", 3*matrix_1-2*matrix_2)

    The sample output is as follows (note that the exact output will be different for you as it is random):

    Addition

    [[ 8 8 10]

    [13 16 10]

    [12 11 13]] ^

    Multiplication

    [[12 15 9]

    [36 63 9]

    [27 18 42]]

    Division

    [[3. 1.66666667 9. ]

    [0.44444444 0.77777778 0.11111111]

    [0.33333333 0.22222222 1.16666667]]

    Linear combination: 3*A - 2*B

    [[ 14 9 25]

    [ -6 3 -15]

    [ -9 -12 9]]

  3. Perform the addition of a scalar, exponential matrix cube, and exponential square root:

    print(" Addition of a scalar (100) ", 100+matrix_1)

    print(" Exponentiation, matrix cubed here ", matrix_1**3)

    print(" Exponentiation, square root using 'pow' function ",pow(matrix_1,0.5))

    The sample output is as follows (note that the exact output will be different for you as it is random):

    Addition of a scalar (100)

    [[106 105 109]

    [104 107 101]

    [103 102 107]]

    Exponentiation, matrix cubed here

    [[216 125 729]

    [ 64 343 1]

    [ 27 8 343]]

    Exponentiation, square root using 'pow' function

    [[2.44948974 2.23606798 3. ]

    [2. 2.64575131 1. ]

    [1.73205081 1.41421356 2.64575131]]

Stacking Arrays

Stacking arrays on top of each other (or side by side) is a useful operation for data wrangling. Here is the code:

a = np.array([[1,2],[3,4]])

b = np.array([[5,6],[7,8]])

print("Matrix a ",a)

print("Matrix b ",b)

print("Vertical stacking ",np.vstack((a,b)))

print("Horizontal stacking ",np.hstack((a,b)))

The output is as follows:

Matrix a

[[1 2]

[3 4]]

Matrix b

[[5 6]

[7 8]]

Vertical stacking

[[1 2]

[3 4]

[5 6]

[7 8]]

Horizontal stacking

[[1 2 5 6]

[3 4 7 8]]

NumPy has many other advanced features, mainly related to statistics and linear algebra functions, which are used extensively in machine learning and data science tasks. However, not all of that is directly useful for beginner level data wrangling, so we won't cover it here.

Pandas DataFrames

The pandas library is a Python package that provides fast, flexible, and expressive data structures that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool that's available in any language.

The two primary data structures of pandas, Series (one-dimensional) and DataFrame (two-dimensional), handle the vast majority of typical use cases. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other third-party libraries.

Exercise 37: Creating a Pandas Series

In this exercise, we will learn about how to create a pandas series object from the data structures that we created previously. If you have imported pandas as pd, then the function to create a series is simply pd.Series:

  1. Initialize labels, lists, and a dictionary:

    labels = ['a','b','c']

    my_data = [10,20,30]

    array_1 = np.array(my_data)

    d = {'a':10,'b':20,'c':30}

    print ("Labels:", labels)

    print("My data:", my_data)

    print("Dictionary:", d)

    The output is as follows:

    Labels: ['a', 'b', 'c']

    My data: [10, 20, 30]

    Dictionary: {'a': 10, 'b': 20, 'c': 30}

  2. Import pandas as pd by using the following command:

    import pandas as pd

  3. Create a series from the my_data list by using the following command:

    series_1=pd.Series(data=my_data)

    print(series_1)

    The output is as follows:

    0 10

    1 20

    2 30

    dtype: int64

  4. Create a series from the my_data list along with the labels as follows:

    series_2=pd.Series(data=my_data, index = labels)

    print(series_2)

    The output is as follows:

    a 10

    b 20

    c 30

    dtype: int64

  5. Then, create a series from the NumPy array, as follows:

    series_3=pd.Series(array_1,labels)

    print(series_3)

    The output is as follows:

    a 10

    b 20

    c 30

    dtype: int32

  6. Create a series from the dictionary, as follows:

    series_4=pd.Series(d)

    print(series_4)

    The output is as follows:

    a 10

    b 20

    c 30

    dtype: int64

Exercise 38: Pandas Series and Data Handling

The pandas series object can hold many types of data. This is the key to constructing a bigger table where multiple series objects are stacked together to create a database-like entity:

  1. Create a pandas series with numerical data by using the following command:

    print (" Holding numerical data ",'-'*25, sep='')

    print(pd.Series(array_1))

    The output is as follows:

    Holding numerical data

    -------------------------

    0 10

    1 20

    2 30

    dtype: int32

  2. Create a pandas series with labels by using the following command:

    print (" Holding text labels ",'-'*20, sep='')

    print(pd.Series(labels))

    The output is as follows:

    Holding text labels

    --------------------

    0 a

    1 b

    2 c

    dtype: object

  3. Create a pandas series with functions by using the following command:

    print (" Holding functions ",'-'*20, sep='')

    print(pd.Series(data=[sum,print,len]))

    The output is as follows:

    Holding functions

    --------------------

    0 <built-in function sum>

    1 <built-in function print>

    2 <built-in function len>

    dtype: object

  4. Create a pandas series with a dictionary by using the following command:

    print (" Holding objects from a dictionary ",'-'*40, sep='')

    print(pd.Series(data=[d.keys, d.items, d.values]))

    The output is as follows:

    Holding objects from a dictionary

    ----------------------------------------

    0 <built-in method keys of dict object at 0x0000...

    1 <built-in method items of dict object at 0x000...

    2 <built-in method values of dict object at 0x00...

    dtype: object

Exercise 39: Creating Pandas DataFrames

The pandas DataFrame is similar to an Excel table or relational database (SQL) table that consists of three main components: the data, the index (or rows), and the columns. Under the hood, it is a stack of pandas series objects, which are themselves built on top of NumPy arrays. So, all of our previous knowledge of NumPy array applies here:

  1. Create a simple DataFrame from a two-dimensional matrix of numbers. First, the code draws 20 random integers from the uniform distribution. Then, we need to reshape it into a (5,4) NumPy array – 5 rows and 4 columns:

    matrix_data = np.random.randint(1,10,size=20).reshape(5,4)

  2. Define the rows labels as ('A','B','C','D','E') and column labels as ('W','X','Y','Z'):

    row_labels = ['A','B','C','D','E']

    column_headings = ['W','X','Y','Z']

    df = pd.DataFrame(data=matrix_data, index=row_labels,

    columns=column_headings)

  3. The function to create a DataFrame is pd.DataFrame and it is called in next:

    print(" The data frame looks like ",'-'*45, sep='')

    print(df)

    The sample output is as follows:

    The data frame looks like

    ---------------------------------------------

    W X Y Z

    A 6 3 3 3

    B 1 9 9 4

    C 4 3 6 9

    D 4 8 6 7

    E 6 6 9 1

  4. Create a DataFrame from a Python dictionary of some lists of integers by using the following command:

    d={'a':[10,20],'b':[30,40],'c':[50,60]}

  5. Pass this dictionary as the data argument to the pd.DataFrame function. Pass on a list of rows or indices. Notice how the dictionary keys became the column names and that the values were distributed among multiple rows:

    df2=pd.DataFrame(data=d,index=['X','Y'])

    print(df2)

    The output is as follows:

    a b c

    X 10 30 50

    Y 20 40 60

    Note

    The most common way that you will encounter to create a pandas DataFrame will be to read tabular data from a file on your local disk or over the internet – CSV, text, JSON, HTML, Excel, and so on. We will cover some of these in the next chapter.

Exercise 40: Viewing a DataFrame Partially

In the previous section, we used print(df) to print the whole DataFrame. For a large dataset, we would like to print only sections of data. In this exercise, we will read a part of the DataFrame:

  1. Execute the following code to create a DataFrame with 25 rows and fill it with random numbers:

    # 25 rows and 4 columns

    matrix_data = np.random.randint(1,100,100).reshape(25,4)

    column_headings = ['W','X','Y','Z']

    df = pd.DataFrame(data=matrix_data,columns=column_headings)

  2. Run the following code to view only the first five rows of the DataFrame:

    df.head()

    The sample output is as follows (note that your output could be different due to randomness):

    Figure 3.1: First five rows of the DataFrame
    Figure 3.1: First five rows of the DataFrame

    By default, head shows only five rows. If you want to see any specific number of rows just pass that as an argument.

  3. Print the first eight rows by using the following command:

    df.head(8)

    The sample output is as follows:

    Figure 3.2: First eight rows of the DataFrame
    Figure 3.2: First eight rows of the DataFrame

    Just like head shows the first few rows, tail shows the last few rows.

  4. Print the DataFrame using the tail command, as follows:

    df.tail(10)

    The sample output is as follows:

Figure 3.3: Last ten rows of the DataFrame
Figure 3.3: Last ten rows of the DataFrame

Indexing and Slicing Columns

There are two methods for indexing and slicing columns from a DataFrame. They are as follows:

  • DOT method
  • Bracket method

The DOT method is good to find specific element. The bracket method is intuitive and easy to follow. In this method, you can access the data by the generic name/header of the column.

The following code illustrates these concepts. Execute them in your Jupyter notebook:

print(" The 'X' column ",'-'*25, sep='')

print(df['X'])

print(" Type of the column: ", type(df['X']), sep='')

print(" The 'X' and 'Z' columns indexed by passing a list ",'-'*55, sep='')

print(df[['X','Z']])

print(" Type of the pair of columns: ", type(df[['X','Z']]), sep='')

The output is as follows (a screenshot is shown here because the actual column is long):

Figure 3.4: Rows of the 'X' columns

This is the output showing the type of column:

Figure 3.5: Type of ‘X’ column
Figure 3.5: Type of 'X' column

This is the output showing the X and Z column indexed by passing a list:

Figure 3.6: Rows of the ‘Y’ columns
Figure 3.6: Rows of the 'Y' columns

This is the output showing the type of the pair of column:

Figure 3.7: Type of ‘Y’ column
Figure 3.7: Type of 'Y' column

Note

For more than one column, the object turns into a DataFrame. But for a single column, it is a pandas series object.

Indexing and Slicing Rows

Indexing and slicing rows in a DataFrame can also be done using following methods:

  • Label-based 'loc' method
  • Index based 'iloc' method

The loc method is intuitive and easy to follow. In this method, you can access the data by the generic name of the row. On the other hand, the iloc method allows you to access the rows by their numerical index. It can be very useful for a large table with thousands of rows, especially when you want to iterate over the table in a loop with a numerical counter. The following code illustrate the concepts of iloc:

matrix_data = np.random.randint(1,10,size=20).reshape(5,4)

row_labels = ['A','B','C','D','E']

column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels,

columns=column_headings)

print(" Label-based 'loc' method for selecting row(s) ",'-'*60, sep='')

print(" Single row ")

print(df.loc['C'])

print(" Multiple rows ")

print(df.loc[['B','C']])

print(" Index position based 'iloc' method for selecting row(s) ",'-'*70, sep='')

print(" Single row ")

print(df.iloc[2])

print(" Multiple rows ")

print(df.iloc[[1,2]])

The sample output is as follows:

Figure 3.8: Output of loc and iloc methods
Figure 3.8: Output of the loc and iloc methods

Exercise 41: Creating and Deleting a New Column or Row

One of the most common tasks in data wrangling is creating or deleting columns or rows of data from your DataFrame. Sometimes, you want to create a new column based on some mathematical operation or transformation involving the existing columns. This is similar to manipulating database records and inserting a new column based on simple transformations. We show some of these concepts in the following code blocks:

  1. Create a new column using the following snippet:

    print(" A column is created by assigning it in relation ",'-'*75, sep='')

    df['New'] = df['X']+df['Z']

    df['New (Sum of X and Z)'] = df['X']+df['Z']

    print(df)

    The sample output is as follows:

    Figure 3.9: Output after adding a new column
    Figure 3.9: Output after adding a new column
  2. Drop a column using the df.drop method:

    print(" A column is dropped by using df.drop() method ",'-'*55, sep='')

    df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is #default, so one has to change it to 1

    print(df)

    The sample output is as follows:

    Figure 3.10: Output after dropping a column
    Figure 3.10: Output after dropping a column
  3. Drop a specific row using the df.drop method:

    df1=df.drop('A')

    print(" A row is dropped by using df.drop method and axis=0 ",'-'*65, sep='')

    print(df1)

    The sample output is as follows:

    Figure 3.11: Output after dropping a row
    Figure 3.11: Output after dropping a row

    Dropping methods creates a copy of the DataFrame and does not change the original DataFrame.

  4. Change the original DataFrame by setting the inplace argument to True:

    print(" An in-place change can be done by making inplace=True in the drop method ",'-'*75, sep='')

    df.drop('New (Sum of X and Z)', axis=1, inplace=True)

    print(df)

    A sample output is as follows:

Figure 3.12: Output after using the inplace argument
Figure 3.12: Output after using the inplace argument

Note

All the normal operations are not in-place, that is, they do not impact the original DataFrame object but return a copy of the original with addition (or deletion). The last bit of code shows how to make a change in the existing DataFrame with the inplace=True argument. Please note that this change is irreversible and should be used with caution.

Statistics and Visualization with NumPy and Pandas

One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.

Refresher of Basic Descriptive Statistics (and the Matplotlib Library for Visualization)

For any data wrangling task, it is quite useful to extract basic descriptive statistics from the data and create some simple visualizations/plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process from which the data might have been generated.

As the inferential statistics are intimately coupled with the machine learning/predictive modeling stage of a data science pipeline, descriptive statistics naturally becomes associated with the data wrangling aspect.

There are two broad approaches for descriptive statistical analysis:

  • Graphical techniques: Bar plots, scatter plots, line charts, box plots, histograms, and so on
  • Calculation of central tendency and spread: Mean, median, mode, variance, standard deviation, range, and so on

In this topic, we will demonstrate how you can accomplish both of these tasks using Python. Apart from NumPy and pandas, we will need to learn the basics of another great package – matplotlib – which is the most powerful and versatile visualization library in Python.

Exercise 42: Introduction to Matplotlib Through a Scatter Plot

In this exercise, we will demonstrate the power and simplicity of matplotlib by creating a simple scatter plot from some data about the age, weight, and height of a few people:

  1. First, we define simple lists of names, age, weight (in kgs), and height (in centimeters):

    people = ['Ann','Brandon','Chen','David','Emily','Farook',

    'Gagan','Hamish','Imran','Joseph','Katherine','Lily']

    age = [21,12,32,45,37,18,28,52,5,40,48,15]

    weight = [55,35,77,68,70,60,72,69,18,65,82,48]

    height = [160,135,170,165,173,168,175,159,105,171,155,158]

  2. Import the most important module from matplotlib, called pyplot:

    import matplotlib.pyplot as plt

  3. Create simple scatter plots of age versus weight:

    plt.scatter(age,weight)

    plt.show()

    The output is as follows:

    Figure 3.13: A screenshot of a scatter plot containing age and weight
    Figure 3.13: A screenshot of a scatter plot containing age and weight

    The plot can be improved by enlarging the figure size, customizing the aspect ratio, adding a title with a proper font size, adding X-axis and Y-axis labels with a customized font size, adding grid lines, changing the Y-axis limit to be between 0 and 100, adding X and Y-tick marks, customizing the scatter plot's color, and changing the size of the scatter dots.

  4. The code for the improved plot is as follows:

    plt.figure(figsize=(8,6))

    plt.title("Plot of Age vs. Weight (in kgs)",fontsize=20)

    plt.xlabel("Age (years)",fontsize=16)

    plt.ylabel("Weight (kgs)",fontsize=16)

    plt.grid (True)

    plt.ylim(0,100)

    plt.xticks([i*5 for i in range(12)],fontsize=15)

    plt.yticks(fontsize=15)

    plt.scatter(x=age,y=weight,c='orange',s=150,edgecolors='k')

    plt.text(x=20,y=85,s="Weights after 18-20 years of age",fontsize=15)

    plt.vlines(x=20,ymin=0,ymax=80,linestyles='dashed',color='blue',lw=3)

    plt.legend(['Weight in kgs'],loc=2,fontsize=12)

    plt.show()

    The output is as follows:

Figure 3.14: A screenshot of a scatter plot showing age versus weight
Figure 3.14: A screenshot of a scatter plot showing age versus weight

Observe the following:

  • A tuple (8,6) is passed as an argument for the figure size.
  • A list comprehension is used inside Xticks to create a customized list of 5-10-15-…-55.
  • A newline ( ) character is used inside the plt.text() function to break up and distribute the text in two lines.
  • The plt.show() function is used at the very end. The idea is to keep on adding various graphics properties (font, color, axis limits, text, legend, grid, and so on) until you are satisfied and then show the plot with one function. The plot will not be displayed without this last function call.

Definition of Statistical Measures – Central Tendency and Spread

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:

  • Mean: Mean is the sum of all values divided by the total number of values.
  • Median: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it.
  • Mode: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.

Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode:

Figure 3.15: A screenshot of a curve showing the mean, median, and mode
Figure 3.15: A screenshot of a curve showing the mean, median, and mode

The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large amount from the mean (and each other), then there is a large spread in the data:

  • Variance: This is the most common measure of spread. Variance is the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out.
  • Standard Deviation: Because variance is produced by squaring the distance from the mean, its unit does not match that of the original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance.

Random Variables and Probability Distribution

A random variable is defined as the value of a given variable that represents the outcome of a statistical experiment or process.

Although it sounds very formal, pretty much everything around us that we can measure can be thought of as a random variable.

The reason behind this is that almost all natural, social, biological, and physical processes are the final outcome of a large number of complex processes, and we cannot know the details of those fundamental processes. All we can do is observe and measure the final outcome.

Typical examples of random variables that are around us are as follows:

  • The economic output of a nation
  • The blood pressure of a patient
  • The temperature of a chemical process in a factory
  • Number of friends of a person on Facebook
  • The stock market price of a company

These values can take any discrete or continuous value and they follow a particular pattern (although the pattern may vary over time). Therefore, they can all be classified as random variables.

What Is a Probability Distribution?

A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of a variable vary based on the underlying probability distribution.

Suppose you go to a school and measure the heights of students who have been selected randomly. Height is an example of a random variable here. As you measure height, you can create a distribution of height. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

The concepts of central tendency and spread are applicable to a distribution and are used to describe the properties and behavior of a distribution.

Statisticians generally divide all distributions into two broad categories:

  • Discrete distributions
  • Continuous distributions

Discrete Distributions

Discrete probability functions are also known as probability mass functions and can assume a discrete number of values. For example, coin tosses and counts of events are discrete functions. You can have only heads or tails in a coin toss. Similarly, if you're counting the number of trains that arrive at a station per hour, you can count 11 or 12 trains, but nothing in-between.

Some prominent discrete distributions are as follows:

  • Binomial distribution to model binary data, such as coin tosses
  • Poisson distribution to model count data, such as the count of library book checkouts per hour
  • Uniform distribution to model multiple events with the same probability, such as rolling a die

Continuous Distributions

Continuous probability functions are also known as probability density functions. You have a continuous distribution if the variable can assume an infinite number of values between any two values. Continuous variables are often measurements on a real number scale, such as height, weight, and temperature.

The most well-known continuous distribution is the normal distribution, which is also known as the Gaussian distribution or the bell curve. This symmetric distribution fits a wide variety of phenomena, such as human height and IQ scores.

The normal distribution is linked to the famous 68-95-99.7 rule, which describes the percentage of data that falls within 1, 2, or 3 standard deviations away from the mean if the data follows a normal distribution. This means that you can quickly look at some sample data, calculate the mean and standard deviation, and can have a confidence (a statistical measure of uncertainty) that any future incoming data will fall within those 68%-95%-99.7% boundaries. This rule is widely used in industries, medicine, economics, and social science:

Figure 3.16: Curve showing the normal distribution of the famous 68-95-99.7 rule

Data Wrangling in Statistics and Visualization

A good data wrangling professional is expected to encounter a dizzying array of diverse data sources each day. As we explained previously, due to a multitude of complex sub-processes and mutual interactions that give rise to such data, they all fall into the category of discrete or continuous random variables.

It will be extremely difficult and confusing to the data wrangler or data science team if all of this data continues to be treated as completely random and without any shape or pattern. A formal statistical basis must be given to such random data streams, and one of the simplest ways to start that process is to measure their descriptive statistics.

Assigning a stream of data to a particular distribution function (or a combination of many distributions) is actually part of inferential statistics. However, inferential statistics starts only when descriptive statistics is done alongside measuring all the important parameters of the pattern of the data.

Therefore, as the front line of a data science pipeline, data wrangling must deal with measuring and quantifying such descriptive statistics of the incoming data. Along with the formatted and cleaned-up data, the primary job of a data wrangler is to hand over these measures (and sometimes accompanying plots) to the next team member of analytics.

Plotting and visualization also help a data wrangling team identify potential outliers and misfits in the incoming data stream and help them to take appropriate action. We will see some examples of such tasks in the next chapter, where we will identify odd data points by creating scatter plots or histograms and either impute or omit the data point.

Using NumPy and Pandas to Calculate Basic Descriptive Statistics on the DataFrame

Now that we have some basic knowledge of NumPy, pandas, and matplotlib under our belt, we can explore a few additional topics related to these libraries, such as how we can bring them together for advanced data generation, analysis, and visualization.

Random Number Generation Using NumPy

NumPy offers a dizzying array of random number generation utility functions, all of which correspond to various statistical distributions, such as uniform, binomial, Gaussian normal, Beta/Gamma, and chi-square. Most of these functions are extremely useful and appear countless times in advanced statistical data mining and machine learning tasks. Having a solid knowledge of them is strongly encouraged for all the students taking this book.

Here, we will discuss three of the most important distributions that may come in handy for data wrangling tasks – uniform, binomial, and gaussian normal. The goal here is to show an example of simple function calls that can generate one or more random numbers/arrays whenever the user needs them.

Note

The results will be different for each student when they use these functions as they are supposed to be random.

Exercise 43: Generating Random Numbers from a Uniform Distribution

In this exercise, we will be generating random numbers from a uniform distribution:

  1. Generate a random integer between 1 and 10:

    x = np.random.randint(1,10)

    print(x)

    The sample output is as follows (your output could be different):

    1

  2. Generate a random integer between 1 and 10 but with size=1 as an argument. It generates a NumPy array of size 1:

    x = np.random.randint(1,10,size=1)

    print(x)

    The sample output is as follows (your output could be different due to random draw):

    [8]

    Therefore, we can easily write the code to generate the outcome of a dice being thrown (a normal 6-sided dice) for 10 trials.

    How about moving away from the integers and generating some real numbers? Let's say that we want to generate artificial data for weights (in kgs) of 20 adults and we can measure the accurate weights up to two decimal places.

  3. Generate decimal data using the following command:

    x = 50+50*np.random.random(size=15)

    x= x.round(decimals=2)

    print(x)

    The sample output is as follows:

    [56.24 94.67 50.66 94.36 77.37 53.81 61.47 71.13 59.3 65.3 63.02 65.

    58.21 81.21 91.62]

    We are not only restricted to one-dimensional arrays.

  4. Generate and show a 3x3 matrix with random numbers between 0 and 1:

    x = np.random.rand(3,3)

    print(x)

    The sample output is as follows (note that your specific output could be different due to randomness):

    [[0.99240105 0.9149215 0.04853315]

    [0.8425871 0.11617792 0.77983995]

    [0.82769081 0.57579771 0.11358125]]

Exercise 44: Generating Random Numbers from a Binomial Distribution and Bar Plot

A binomial distribution is the probability distribution of getting a specific number of successes in a specific number of trials of an event with a pre-determined chance or probability.

The most obvious example of this is a coin toss. A fair coin may have an equal chance of heads or tails, but an unfair coin may have more chances of the head coming up or vice versa. We can simulate a coin toss in NumPy in the following manner.

Suppose we have a biased coin where the probability of heads is 0.6. We toss this coin ten times and note down the number of heads turning up each time. That is one trial or experiment. Now, we can repeat this experiment (10 coin tosses) any number of times, say 8 times. Each time, we record the number of heads:

  1. The experiment can be simulated using the following code:

    x = np.random.binomial(10,0.6,size=8)

    print(x)

    The sample output is as follows (note your specific output could be different due to randomness):

    [6 6 5 6 5 8 4 5]

  2. Plot the result using a bar chart:

    plt.figure(figsize=(7,4))

    plt.title("Number of successes in coin toss",fontsize=16)

    plt.bar(left=np.arange(1,9),height=x)

    plt.xlabel("Experiment number",fontsize=15)

    plt.ylabel("Number of successes",fontsize=15)

    plt.show()

    The sample output is as follows:

Figure 3.17: A screenshot of a graph showing the binomial distribution and the bar plot
Figure 3.17: A screenshot of a graph showing the binomial distribution and the bar plot

Exercise 45: Generating Random Numbers from Normal Distribution and Histograms

We discussed the normal distribution in the last topic and mentioned that it is the most important probability distribution because many pieces of natural, social, and biological data follow this pattern closely when the number of samples is large. NumPy provides an easy way to generate random numbers corresponding to this distribution:

  1. Draw a single sample from a normal distribution by using the following command:

    x = np.random.normal()

    print(x)

    The sample output is as follows (note that your specific output could be different due to randomness):

    -1.2423774071573694

    We know that normal distribution is characterized by two parameters – mean (µ) and standard deviation (σ). In fact, the default values for this particular function are µ = 0.0 and σ = 1.0.

    Suppose we know that the heights of the teenage (12-16 years) students in a particular school is distributed normally with a mean height of 155 cm and a standard deviation of 10 cm.

  2. Generate a histogram of 100 students by using the following command:

    # Code to generate the 100 samples (heights)

    heights = np.random.normal(loc=155,scale=10,size=100)

    # Plotting code

    #-----------------------

    plt.figure(figsize=(7,5))

    plt.hist(heights,color='orange',edgecolor='k')

    plt.title("Histogram of teen aged students's height",fontsize=18)

    plt.xlabel("Height in cm",fontsize=15)

    plt.xticks(fontsize=15)

    plt.yticks(fontsize=15)

    plt.show()

    The sample output is as follows:

Figure 3.18: Histogram of teenage student's height

Note the use of the loc parameter for the mean (=155) and the scale parameter for standard deviation (=10). The size parameter is set to 100 for that may samples' generation.

Exercise 46: Calculation of Descriptive Statistics from a DataFrame

Recollect the age, weight, and height parameters that we defined for the plotting exercise. Let's put that data in a DataFrame to calculate various descriptive statistics about them.

The best part of working with a pandas DataFrame is that it has a built-in utility function to show all of these descriptive statistics with a single line of code. It does this by using the describe method:

  1. Construct a dictionary with the available series data by using the following command:

    people_dict={'People':people,'Age':age,'Weight':weight,'Height':height}

    people_df=pd.DataFrame(data=people_dict)

    people_df

    The output is as follows:

    Figure 3.19: Output of the created dictionary
  2. Find the number of rows and columns of the DataFrame by executing the following command:

    print(people_df.shape)

    The output is as follows:

    (12, 4)

  3. Obtain a simple count (any column can be used for this purpose) by executing the following command:

    print(people_df['Age'].count())

    The output is as follows:

    12

  4. Calculate the sum total of age by using the following command:

    print(people_df['Age'].sum())

    The output is as follows:

    353

  5. Calculate the mean age by using the following command:

    print(people_df['Age'].mean())

    The output is as follows:

    29.416666666666668

  6. Calculate the median weight by using the following command:

    print(people_df['Weight'].median())

    The output is as follows:

    66.5

  7. Calculate the maximum height by using the following command:

    print(people_df['Height'].max())

    The output is as follows:

    175

  8. Calculate the standard deviation of the weights by using the following command:

    print(people_df['Weight'].std())

    The output is as follows:

    18.45120510148239

    Note how we are calling the statistical functions directly from a DataFrame object.

  9. To calculate percentile, we can call a function from NumPy and pass on the particular column (a pandas series). For example, to calculate the 75th and 25th percentiles of age distribution and their difference (called the inter-quartile range), use the following code:

    pcnt_75 = np.percentile(people_df['Age'],75)

    pcnt_25 = np.percentile(people_df['Age'],25)

    print("Inter-quartile range: ",pcnt_75-pcnt_25)

    The output is as follows:

    Inter-quartile range: 24.0

  10. Use the describe command to find a detailed description of the DataFrame:

    print(people_df.describe())

    The output is as follows:

Figure 3.20: Output of the DataFrame using the describe method
Figure 3.20: Output of the DataFrame using the describe method

Note

This function works only on the columns where numeric data is present. It has no impact on the non-numeric columns, for example, People in this DataFrame.

Exercise 47: Built-in Plotting Utilities

DataFrame also has built-in plotting utilities that wrap around matplotlib functions and create basic plots of numeric data:

  1. Find the histogram of the weights by using the hist function:

    people_df['Weight'].hist()

    plt.show()

    The output is as follows:

    Figure 3.21: Histogram of the weights
    Figure 3.21: Histogram of the weights
  2. Create a simple scatter plot directly from the DataFrame to plot the relationship between weight and heights by using the following command:

    people_df.plot.scatter('Weight','Height',s=150,

    c='orange',edgecolor='k')

    plt.grid(True)

    plt.title("Weight vs. Height scatter plot",fontsize=18)

    plt.xlabel("Weight (in kg)",fontsize=15)

    plt.ylabel("Height (in cm)",fontsize=15)

    plt.show()

    The output is as follows:

Figure 3.22: Weight versus Height scatter plot

Note

You can try regular matplotlib methods around this function call to make your plot pretty.

Activity 5: Generating Statistics from a CSV File

Suppose you are working with the famous Boston housing price (from 1960) dataset. This dataset is famous in the machine learning community. Many regression problems can be formulated, and machine learning algorithms can be run on this dataset. You will do perform a basic data wrangling activity (including plotting some trends) on this dataset by reading it as a pandas DataFrame.

Note

The pandas function for reading a CSV file is read_csv.

These steps will help you complete this activity:

  1. Load the necessary libraries.
  2. Read in the Boston housing dataset (given as a .csv file) from the local directory.
  3. Check the first 10 records. Find the total number of records.
  4. Create a smaller DataFrame with columns that do not include CHAS, NOX, B, and LSTAT.
  5. Check the last seven records of the new DataFrame you just created.
  6. Plot the histograms of all the variables (columns) in the new DataFrame.
  7. Plot them all at once using a for loop. Try to add a unique title to a plot.
  8. Create a scatter plot of crime rate versus price.
  9. Plot using log10(crime) versus price.
  10. Calculate some useful statistics, such as mean rooms per dwelling, median age, mean distances to five Boston employment centers, and the percentage of houses with a low price (< $20,000).

    Note

    The solution for this activity can be found on page 292.

Summary

In this chapter, we started with the basics of NumPy arrays, including how to create them and their essential properties. We discussed and showed how a NumPy array is optimized for vectorized element-wise operations and differs from a regular Python list. Then, we moved on to practicing various operations on NumPy arrays such as indexing, slicing, filtering, and reshaping. We also covered special one-dimensional and two-dimensional arrays, such as zeros, ones, identity matrices, and random arrays.

In the second major topic of this chapter, we started with pandas series objects and quickly moved on to a critically important object – pandas DataFrames. It is analogous to Excel or MATLAB or a database tab, but with many useful properties for data wrangling. We demonstrated some basic operations on DataFrames, such as indexing, subsetting, row and column addition, and deletion.

Next, we covered the basics of plotting with matplotlib, the most widely used and popular Python library for visualization. Along with plotting exercises, we touched upon refresher concepts of descriptive statistics (such as central tendency and measure of spread) and probability distributions (such as uniform, binomial, and normal).

In the next chapter, we will cover more advanced operation with pandas DataFrames that will come in very handy for day-to-day working in a data wrangling job.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset