By the end of the chapter, you will be able to:
In this chapter, you will learn about the fundamentals of the NumPy, pandas, and matplotlib libraries.
In the preceding chapters, we have covered some advanced data structures, such as stack, queue, iterator, and file operations in Python. In this section, we will cover three essential libraries, namely NumPy, pandas, and matplotlib.
In the life of a data scientist, reading and manipulating arrays is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list or a multi-dimensional table or a matrix full of numbers.
The array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant.
Some example scenarios where you will need to handle numeric arrays are as follows:
In short, arrays and numeric data tables are everywhere. As a data wrangling professional, the importance of the ability to read and process numeric arrays cannot be overstated. In this regard, NumPy arrays will be the most important object in Python that you need to know about.
NumPy and SciPy are open source add-on modules for Python that provide common mathematical and numerical routines in pre-compiled, fast functions. These have grown into highly mature libraries that provide functionality that meets, or perhaps exceeds, what is associated with common commercial software such as MATLAB or Mathematica.
One of the main advantages of the NumPy module is to handle or create one-dimensional or multi-dimensional arrays. This advanced data structure/class is at the heart of the NumPy package and it serves as the fundamental building block of more advanced classes such as pandas and DataFrame, which we will cover shortly in this chapter.
NumPy arrays are different than common Python lists, since Python lists can be thought as simple array. NumPy arrays are built for vectorized operations that process a lot of numerical data with just a single line of code. Many built-in mathematical functions in NumPy arrays are written in low-level languages such as C or Fortran and pre-compiled for real, fast execution.
NumPy arrays are optimized data structures for numerical analysis, and that's why they are so important to data scientists.
In this exercise, we will create a NumPy array from a list:
import numpy as np
list_1 = [1,2,3]
array_1 = np.array(list_1)
We just created a NumPy array object called array_1 from the regular Python list object, list_1.
import array as arr
a = arr.array('d', [1.2, 3.4, 5.6])
print(a)
The output is as follows:
array('d', [1.2, 3.4, 5.6])
type(array_1)
The output is as follows:
numpy.ndarray
type (list_1)
list
So, this is indeed different from the regular list object.
This simple exercise will demonstrate the addition of two NumPy arrays, and thereby show the key difference between a regular Python list/array and a NumPy array:
list_2 = list_1 + list_1
print(list_2)
The output is as follows:
[1, 2, 3, 1, 2, 3]
array_2 = array_1 + array_1
print(array_2)
The output is as follows:
[2, ,4, 6]
Did you notice the difference? The first print shows a list with 6 elements [1, 2, 3, 1, 2, 3]. But the second print shows another NumPy array (or vector) with the elements [2, 4, 6], which are just the sum of the individual elements of array_1.
NumPy arrays are like mathematical objects – vectors. They are built for element-wise operations, that is, when we add two NumPy arrays, we add the first element of the first array to the first element of the second array – there is an element-to-element correspondence in this operation. This is in contrast to Python lists, where the elements are simply appended and there is no element-to-element relation. This is the real power of a NumPy array: they can be treated just like mathematical vectors.
A vector is a collection of numbers that can represent, for example, the coordinates of points in a three-dimensional space or the color of numbers (RGB) in a picture. Naturally, relative order is important for such a collection and as we discussed previously, a NumPy array can maintain such order relationships. That's why they are perfectly suitable to use in numerical computations.
Now that you know that these arrays are like vectors, we will try some mathematical operations on arrays.
NumPy arrays even support element-wise exponentiation. For example, suppose there are two arrays – the elements of the first array will be raised to the power of the elements in the second array:
print("array_1 multiplied by array_1: ",array_1*array_1)
The output is as follows:
array_1 multiplied by array_1: [1 4 9]
print("array_1 divided by array_1: ",array_1/array_1)
The output is as follows:
array_1 divided by array_1: [1. 1. 1.]
print("array_1 raised to the power of array_1: ",array_1**array_1)
The output is as follows:
array_1 raised to the power of array_1: [ 1 4 27]
NumPy has all the built-in mathematical functions that you can think of. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical operations on that array.
Here, we are creating a list and then converting that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array:
list_5=[i for i in range(1,6)]
print(list_5)
The output is as follows:
[1, 2, 3, 4, 5]
array_5=np.array(list_5)
array_5
The output is as follows:
array([1, 2, 3, 4, 5])
# sine function
print("Sine: ",np.sin(array_5))
The output is as follows:
Sine: [ 0.84147098 0.90929743 0.14112001 -0.7568025 -0.95892427]
# logarithm
print("Natural logarithm: ",np.log(array_5))
print("Base-10 logarithm: ",np.log10(array_5))
print("Base-2 logarithm: ",np.log2(array_5))
The output is as follows:
Natural logarithm: [0. 0.69314718 1.09861229 1.38629436 1.60943791]
Base-10 logarithm: [0. 0.30103 0.47712125 0.60205999 0.69897 ]
Base-2 logarithm: [0. 1. 1.5849625 2. 2.32192809]
# Exponential
print("Exponential: ",np.exp(array_5))
The output is as follows:
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591 ]
Generation of numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods.
The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of the fixed numbers of intermediate points between two extremes:
print("A series of numbers:",np.arange(5,16))
The output is as follows:
A series of numbers: [ 5 6 7 8 9 10 11 12 13 14 15]
print("Numbers spaced apart by 2: ",np.arange(0,11,2))
print("Numbers spaced apart by a floating point number: ",np.arange(0,11,2.5))
print("Every 5th number from 30 in reverse order ",np.arange(30,-1,-5))
The output is as follows:
Numbers spaced apart by 2: [ 0 2 4 6 8 10]
Numbers spaced apart by a floating point number: [ 0. 2.5 5. 7.5 10. ]
Every 5th number from 30 in reverse order
[30 25 20 15 10 5 0]
print("11 linearly spaced numbers between 1 and 5: ",np.linspace(1,5,11))
The output is as follows:
11 linearly spaced numbers between 1 and 5: [1. 1.4 1.8 2.2 2.6 3. 3.4 3.8 4.2 4.6 5. ]
So far, we have created only one-dimensional arrays. Now, let's create some multi-dimensional arrays (such as a matrix in linear algebra). Just like we created the one-dimensional array from a simple flat list, we can create a two-dimensional array from a list of lists:
list_2D = [[1,2,3],[4,5,6],[7,8,9]]
mat1 = np.array(list_2D)
print("Type/Class of this object:",type(mat1))
print("Here is the matrix ---------- ",mat1," ----------")
The output is as follows:
Type/Class of this object: <class 'numpy.ndarray'>
Here is the matrix
----------
[[1 2 3]
[4 5 6]
[7 8 9]]
----------
tuple_2D = np.array([(1.5,2,3), (4,5,6)])
mat_tuple = np.array(tuple_2D)
print (mat_tuple)
The output is as follows:
[[1.5 2. 3. ]
[4. 5. 6. ]]
Thus, we have created multi-dimensional arrays using Python lists and tuples.
The following methods let you check the dimension, shape, and size of the array. Note that if it's a 3x2 matrix, that is, it has 3 rows and 2 columns, then the shape will be (3,2), but the size will be 6, as 6 = 3x2:
print("Dimension of this matrix: ",mat1.ndim,sep='')
The output is as follows:
Dimension of this matrix: 2
print("Size of this matrix: ", mat1.size,sep='')
The output is as follows:
Size of this matrix: 9
print("Shape of this matrix: ", mat1.shape,sep='')
The output is as follows:
Shape of this matrix: (3, 3)
print("Data type of this matrix: ", mat1.dtype,sep='')
The output is as follows:
Data type of this matrix: int32
Now that we are familiar with basic vector (one-dimensional) and matrix data structures in NumPy, we will take a look how to create special matrices easily. Often, you may have to create matrices filled with zeros, ones, random numbers, or ones in the diagonal:
print("Vector of zeros: ",np.zeros(5))
The output is as follows:
Vector of zeros: [0. 0. 0. 0. 0.]
print("Matrix of zeros: ",np.zeros((3,4)))
The output is as follows:
Matrix of zeros: [[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
print("Matrix of 5's: ",5*np.ones((3,3)))
The output is as follows:
Matrix of 5's: [[5. 5. 5.]
[5. 5. 5.]
[5. 5. 5.]]
print("Identity matrix of dimension 2:",np.eye(2))
The output is as follows:
Identity matrix of dimension 2: [[1. 0.]
[0. 1.]]
print("Identity matrix of dimension 4:",np.eye(4))
The output is as follows:
Identity matrix of dimension 4: [[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
print("Random matrix of shape (4,3): ",np.random.randint(low=1,high=10,size=(4,3)))
The sample output is as follows:
Random matrix of shape (4,3):
[[6 7 6]
[5 6 7]
[5 3 6]
[2 9 4]]
When creating matrices, you need to pass on tuples of integers as arguments.
Random number generation is a very useful utility and needs to be mastered for data science/data wrangling tasks. We will look at the topic of random variables and distributions again in the section on statistics and see how NumPy and pandas have built-in random number and series generation, as well as manipulation functions.
Reshaping an array is a very useful operation for vectors as machine learning algorithms may demand input vectors in various formats for mathematical manipulation. In this section, we will be looking at how reshaping can take be done on an array. The opposite of reshape is the ravel function, which flattens any given array into a one-dimensional array. It is a very useful action in many machine learning and data analytics tasks.
The following functions reshape the function. We will first generate a random one-dimensional vector of 2-digit numbers and then reshape the vector into multi-dimensional vectors:
a = np.random.randint(1,100,30)
b = a.reshape(2,3,5)
c = a.reshape(6,5)
print ("Shape of a:", a.shape)
print ("Shape of b:", b.shape)
print ("Shape of c:", c.shape)
The output is as follows:
Shape of a: (30,)
Shape of b: (2, 3, 5)
Shape of c: (6, 5)
print(" a looks like ",a)
print(" b looks like ",b)
print(" c looks like ",c)
The sample output is as follows:
a looks like
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57
32 69 34 59 98 48]
b looks like
[[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]]
[[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]]
c looks like
[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]
[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]
"b" is a three-dimensional array – a kind of list of a list of a list.
b_flat = b.ravel()
print(b_flat)
The sample output is as follows:
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57
32 69 34 59 98 48]
Indexing and slicing of NumPy arrays is very similar to regular list indexing. We can even step through a vector of elements with a definite step size by providing it as an additional argument in the format (start, step, end). Furthermore, we can pass a list as the argument to select specific elements.
In this exercise, we will learn about indexing and slicing on one-dimensional and multi-dimensional arrays:
In multi-dimensional arrays, you can use two numbers to denote the position of an element. For example, if the element is in the third row and second column, its indices are 2 and 1 (because of Python's zero-based indexing).
array_1 = np.arange(0,11)
print("Array:",array_1)
The output is as follows:
Array: [ 0 1 2 3 4 5 6 7 8 9 10]
print("Element at 7th index is:", array_1[7])
The output is as follows:
Element at 7th index is: 7
print("Elements from 3rd to 5th index are:", array_1[3:6])
The output is as follows:
Elements from 3rd to 5th index are: [3 4 5]
print("Elements up to 4th index are:", array_1[:4])
The output is as follows:
Elements up to 4th index are: [0 1 2 3]
print("Elements from last backwards are:", array_1[-1::-1])
The output is as follows:
Elements from last backwards are: [10 9 8 7 6 5 4 3 2 1 0]
print("3 Elements from last backwards are:", array_1[-1:-6:-2])
The output is as follows:
3 Elements from last backwards are: [10 8 6]
array_2 = np.arange(0,21,2)
print("New array:",array_2)
The output is as follows:
New array: [ 0 2 4 6 8 10 12 14 16 18 20]
print("Elements at 2nd, 4th, and 9th index are:", array_2[[2,4,9]])
The output is as follows:
Elements at 2nd, 4th, and 9th index are: [ 4 8 18]
matrix_1 = np.random.randint(10,100,15).reshape(3,5)
print("Matrix of random 2-digit numbers ",matrix_1)
The sample output is as follows:
Matrix of random 2-digit numbers
[[21 57 60 24 15]
[53 20 44 72 68]
[39 12 99 99 33]]
print(" Double bracket indexing ")
print("Element in row index 1 and column index 2:", matrix_1[1][2])
The sample output is as follows:
Double bracket indexing
Element in row index 1 and column index 2: 44
print(" Single bracket with comma indexing ")
print("Element in row index 1 and column index 2:", matrix_1[1,2])
The sample output is as follows:
Single bracket with comma indexing
Element in row index 1 and column index 2: 44
print(" Row or column extract ")
print("Entire row at index 2:", matrix_1[2])
print("Entire column at index 3:", matrix_1[:,3])
The sample output is as follows:
Row or column extract
Entire row at index 2: [39 12 99 99 33]
Entire column at index 3: [24 72 99]
print(" Subsetting sub-matrices ")
print("Matrix with row indices 1 and 2 and column indices 3 and 4 ", matrix_1[1:3,3:5])
The sample output is as follows:
Subsetting sub-matrices
Matrix with row indices 1 and 2 and column indices 3 and 4
[[72 68]
[99 33]]
print("Matrix with row indices 0 and 1 and column indices 1 and 3 ", matrix_1[0:2,[1,3]])
The sample output is as follows:
Matrix with row indices 0 and 1 and column indices 1 and 3
[[57 24]
[20 72]]
Conditional subsetting is a way to select specific elements based on some numeric condition. It is almost like a shortened version of a SQL query to subset elements. See the following example:
matrix_1 = np.array(np.random.randint(10,100,15)).reshape(3,5)
print("Matrix of random 2-digit numbers ",matrix_1)
print (" Elements greater than 50 ", matrix_1[matrix_1>50])
The sample output is as follows (note that the exact output will be different for you as it is random):
Matrix of random 2-digit numbers
[[71 89 66 99 54]
[28 17 66 35 85]
[82 35 38 15 47]]
Elements greater than 50
[71 89 66 99 54 66 85 82]
NumPy arrays operate just like mathematical matrices, and the operations are performed element-wise.
Create two matrices (multi-dimensional arrays) with random integers and demonstrate element-wise mathematical operations such as addition, subtraction, multiplication, and division. Show the exponentiation (raising a number to a certain power) operation, as follows:
Due to random number generation, your specific output could be different to what is shown here.
matrix_1 = np.random.randint(1,10,9).reshape(3,3)
matrix_2 = np.random.randint(1,10,9).reshape(3,3)
print(" 1st Matrix of random single-digit numbers ",matrix_1)
print(" 2nd Matrix of random single-digit numbers ",matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
1st Matrix of random single-digit numbers
[[6 5 9]
[4 7 1]
[3 2 7]]
2nd Matrix of random single-digit numbers
[[2 3 1]
[9 9 9]
[9 9 6]]
print(" Addition ", matrix_1+matrix_2)
print(" Multiplication ", matrix_1*matrix_2)
print(" Division ", matrix_1/matrix_2)
print(" Linear combination: 3*A - 2*B ", 3*matrix_1-2*matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition
[[ 8 8 10]
[13 16 10]
[12 11 13]] ^
Multiplication
[[12 15 9]
[36 63 9]
[27 18 42]]
Division
[[3. 1.66666667 9. ]
[0.44444444 0.77777778 0.11111111]
[0.33333333 0.22222222 1.16666667]]
Linear combination: 3*A - 2*B
[[ 14 9 25]
[ -6 3 -15]
[ -9 -12 9]]
print(" Addition of a scalar (100) ", 100+matrix_1)
print(" Exponentiation, matrix cubed here ", matrix_1**3)
print(" Exponentiation, square root using 'pow' function ",pow(matrix_1,0.5))
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition of a scalar (100)
[[106 105 109]
[104 107 101]
[103 102 107]]
Exponentiation, matrix cubed here
[[216 125 729]
[ 64 343 1]
[ 27 8 343]]
Exponentiation, square root using 'pow' function
[[2.44948974 2.23606798 3. ]
[2. 2.64575131 1. ]
[1.73205081 1.41421356 2.64575131]]
Stacking arrays on top of each other (or side by side) is a useful operation for data wrangling. Here is the code:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print("Matrix a ",a)
print("Matrix b ",b)
print("Vertical stacking ",np.vstack((a,b)))
print("Horizontal stacking ",np.hstack((a,b)))
The output is as follows:
Matrix a
[[1 2]
[3 4]]
Matrix b
[[5 6]
[7 8]]
Vertical stacking
[[1 2]
[3 4]
[5 6]
[7 8]]
Horizontal stacking
[[1 2 5 6]
[3 4 7 8]]
NumPy has many other advanced features, mainly related to statistics and linear algebra functions, which are used extensively in machine learning and data science tasks. However, not all of that is directly useful for beginner level data wrangling, so we won't cover it here.
The pandas library is a Python package that provides fast, flexible, and expressive data structures that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool that's available in any language.
The two primary data structures of pandas, Series (one-dimensional) and DataFrame (two-dimensional), handle the vast majority of typical use cases. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other third-party libraries.
In this exercise, we will learn about how to create a pandas series object from the data structures that we created previously. If you have imported pandas as pd, then the function to create a series is simply pd.Series:
labels = ['a','b','c']
my_data = [10,20,30]
array_1 = np.array(my_data)
d = {'a':10,'b':20,'c':30}
print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)
The output is as follows:
Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}
import pandas as pd
series_1=pd.Series(data=my_data)
print(series_1)
The output is as follows:
0 10
1 20
2 30
dtype: int64
series_2=pd.Series(data=my_data, index = labels)
print(series_2)
The output is as follows:
a 10
b 20
c 30
dtype: int64
series_3=pd.Series(array_1,labels)
print(series_3)
The output is as follows:
a 10
b 20
c 30
dtype: int32
series_4=pd.Series(d)
print(series_4)
The output is as follows:
a 10
b 20
c 30
dtype: int64
The pandas series object can hold many types of data. This is the key to constructing a bigger table where multiple series objects are stacked together to create a database-like entity:
print (" Holding numerical data ",'-'*25, sep='')
print(pd.Series(array_1))
The output is as follows:
Holding numerical data
-------------------------
0 10
1 20
2 30
dtype: int32
print (" Holding text labels ",'-'*20, sep='')
print(pd.Series(labels))
The output is as follows:
Holding text labels
--------------------
0 a
1 b
2 c
dtype: object
print (" Holding functions ",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
The output is as follows:
Holding functions
--------------------
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
print (" Holding objects from a dictionary ",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))
The output is as follows:
Holding objects from a dictionary
----------------------------------------
0 <built-in method keys of dict object at 0x0000...
1 <built-in method items of dict object at 0x000...
2 <built-in method values of dict object at 0x00...
dtype: object
The pandas DataFrame is similar to an Excel table or relational database (SQL) table that consists of three main components: the data, the index (or rows), and the columns. Under the hood, it is a stack of pandas series objects, which are themselves built on top of NumPy arrays. So, all of our previous knowledge of NumPy array applies here:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels,
columns=column_headings)
print(" The data frame looks like ",'-'*45, sep='')
print(df)
The sample output is as follows:
The data frame looks like
---------------------------------------------
W X Y Z
A 6 3 3 3
B 1 9 9 4
C 4 3 6 9
D 4 8 6 7
E 6 6 9 1
d={'a':[10,20],'b':[30,40],'c':[50,60]}
df2=pd.DataFrame(data=d,index=['X','Y'])
print(df2)
The output is as follows:
a b c
X 10 30 50
Y 20 40 60
The most common way that you will encounter to create a pandas DataFrame will be to read tabular data from a file on your local disk or over the internet – CSV, text, JSON, HTML, Excel, and so on. We will cover some of these in the next chapter.
In the previous section, we used print(df) to print the whole DataFrame. For a large dataset, we would like to print only sections of data. In this exercise, we will read a part of the DataFrame:
# 25 rows and 4 columns
matrix_data = np.random.randint(1,100,100).reshape(25,4)
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,columns=column_headings)
df.head()
The sample output is as follows (note that your output could be different due to randomness):
By default, head shows only five rows. If you want to see any specific number of rows just pass that as an argument.
df.head(8)
The sample output is as follows:
Just like head shows the first few rows, tail shows the last few rows.
df.tail(10)
The sample output is as follows:
There are two methods for indexing and slicing columns from a DataFrame. They are as follows:
The DOT method is good to find specific element. The bracket method is intuitive and easy to follow. In this method, you can access the data by the generic name/header of the column.
The following code illustrates these concepts. Execute them in your Jupyter notebook:
print(" The 'X' column ",'-'*25, sep='')
print(df['X'])
print(" Type of the column: ", type(df['X']), sep='')
print(" The 'X' and 'Z' columns indexed by passing a list ",'-'*55, sep='')
print(df[['X','Z']])
print(" Type of the pair of columns: ", type(df[['X','Z']]), sep='')
The output is as follows (a screenshot is shown here because the actual column is long):
This is the output showing the type of column:
This is the output showing the X and Z column indexed by passing a list:
This is the output showing the type of the pair of column:
For more than one column, the object turns into a DataFrame. But for a single column, it is a pandas series object.
Indexing and slicing rows in a DataFrame can also be done using following methods:
The loc method is intuitive and easy to follow. In this method, you can access the data by the generic name of the row. On the other hand, the iloc method allows you to access the rows by their numerical index. It can be very useful for a large table with thousands of rows, especially when you want to iterate over the table in a loop with a numerical counter. The following code illustrate the concepts of iloc:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels,
columns=column_headings)
print(" Label-based 'loc' method for selecting row(s) ",'-'*60, sep='')
print(" Single row ")
print(df.loc['C'])
print(" Multiple rows ")
print(df.loc[['B','C']])
print(" Index position based 'iloc' method for selecting row(s) ",'-'*70, sep='')
print(" Single row ")
print(df.iloc[2])
print(" Multiple rows ")
print(df.iloc[[1,2]])
The sample output is as follows:
One of the most common tasks in data wrangling is creating or deleting columns or rows of data from your DataFrame. Sometimes, you want to create a new column based on some mathematical operation or transformation involving the existing columns. This is similar to manipulating database records and inserting a new column based on simple transformations. We show some of these concepts in the following code blocks:
print(" A column is created by assigning it in relation ",'-'*75, sep='')
df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']
print(df)
The sample output is as follows:
print(" A column is dropped by using df.drop() method ",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is #default, so one has to change it to 1
print(df)
The sample output is as follows:
df1=df.drop('A')
print(" A row is dropped by using df.drop method and axis=0 ",'-'*65, sep='')
print(df1)
The sample output is as follows:
Dropping methods creates a copy of the DataFrame and does not change the original DataFrame.
print(" An in-place change can be done by making inplace=True in the drop method ",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)
A sample output is as follows:
All the normal operations are not in-place, that is, they do not impact the original DataFrame object but return a copy of the original with addition (or deletion). The last bit of code shows how to make a change in the existing DataFrame with the inplace=True argument. Please note that this change is irreversible and should be used with caution.
One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.
For any data wrangling task, it is quite useful to extract basic descriptive statistics from the data and create some simple visualizations/plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process from which the data might have been generated.
As the inferential statistics are intimately coupled with the machine learning/predictive modeling stage of a data science pipeline, descriptive statistics naturally becomes associated with the data wrangling aspect.
There are two broad approaches for descriptive statistical analysis:
In this topic, we will demonstrate how you can accomplish both of these tasks using Python. Apart from NumPy and pandas, we will need to learn the basics of another great package – matplotlib – which is the most powerful and versatile visualization library in Python.
In this exercise, we will demonstrate the power and simplicity of matplotlib by creating a simple scatter plot from some data about the age, weight, and height of a few people:
people = ['Ann','Brandon','Chen','David','Emily','Farook',
'Gagan','Hamish','Imran','Joseph','Katherine','Lily']
age = [21,12,32,45,37,18,28,52,5,40,48,15]
weight = [55,35,77,68,70,60,72,69,18,65,82,48]
height = [160,135,170,165,173,168,175,159,105,171,155,158]
import matplotlib.pyplot as plt
plt.scatter(age,weight)
plt.show()
The output is as follows:
The plot can be improved by enlarging the figure size, customizing the aspect ratio, adding a title with a proper font size, adding X-axis and Y-axis labels with a customized font size, adding grid lines, changing the Y-axis limit to be between 0 and 100, adding X and Y-tick marks, customizing the scatter plot's color, and changing the size of the scatter dots.
plt.figure(figsize=(8,6))
plt.title("Plot of Age vs. Weight (in kgs)",fontsize=20)
plt.xlabel("Age (years)",fontsize=16)
plt.ylabel("Weight (kgs)",fontsize=16)
plt.grid (True)
plt.ylim(0,100)
plt.xticks([i*5 for i in range(12)],fontsize=15)
plt.yticks(fontsize=15)
plt.scatter(x=age,y=weight,c='orange',s=150,edgecolors='k')
plt.text(x=20,y=85,s="Weights after 18-20 years of age",fontsize=15)
plt.vlines(x=20,ymin=0,ymax=80,linestyles='dashed',color='blue',lw=3)
plt.legend(['Weight in kgs'],loc=2,fontsize=12)
plt.show()
The output is as follows:
Observe the following:
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:
Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode:
The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large amount from the mean (and each other), then there is a large spread in the data:
A random variable is defined as the value of a given variable that represents the outcome of a statistical experiment or process.
Although it sounds very formal, pretty much everything around us that we can measure can be thought of as a random variable.
The reason behind this is that almost all natural, social, biological, and physical processes are the final outcome of a large number of complex processes, and we cannot know the details of those fundamental processes. All we can do is observe and measure the final outcome.
Typical examples of random variables that are around us are as follows:
These values can take any discrete or continuous value and they follow a particular pattern (although the pattern may vary over time). Therefore, they can all be classified as random variables.
A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of a variable vary based on the underlying probability distribution.
Suppose you go to a school and measure the heights of students who have been selected randomly. Height is an example of a random variable here. As you measure height, you can create a distribution of height. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.
The concepts of central tendency and spread are applicable to a distribution and are used to describe the properties and behavior of a distribution.
Statisticians generally divide all distributions into two broad categories:
Discrete probability functions are also known as probability mass functions and can assume a discrete number of values. For example, coin tosses and counts of events are discrete functions. You can have only heads or tails in a coin toss. Similarly, if you're counting the number of trains that arrive at a station per hour, you can count 11 or 12 trains, but nothing in-between.
Some prominent discrete distributions are as follows:
Continuous probability functions are also known as probability density functions. You have a continuous distribution if the variable can assume an infinite number of values between any two values. Continuous variables are often measurements on a real number scale, such as height, weight, and temperature.
The most well-known continuous distribution is the normal distribution, which is also known as the Gaussian distribution or the bell curve. This symmetric distribution fits a wide variety of phenomena, such as human height and IQ scores.
The normal distribution is linked to the famous 68-95-99.7 rule, which describes the percentage of data that falls within 1, 2, or 3 standard deviations away from the mean if the data follows a normal distribution. This means that you can quickly look at some sample data, calculate the mean and standard deviation, and can have a confidence (a statistical measure of uncertainty) that any future incoming data will fall within those 68%-95%-99.7% boundaries. This rule is widely used in industries, medicine, economics, and social science:
A good data wrangling professional is expected to encounter a dizzying array of diverse data sources each day. As we explained previously, due to a multitude of complex sub-processes and mutual interactions that give rise to such data, they all fall into the category of discrete or continuous random variables.
It will be extremely difficult and confusing to the data wrangler or data science team if all of this data continues to be treated as completely random and without any shape or pattern. A formal statistical basis must be given to such random data streams, and one of the simplest ways to start that process is to measure their descriptive statistics.
Assigning a stream of data to a particular distribution function (or a combination of many distributions) is actually part of inferential statistics. However, inferential statistics starts only when descriptive statistics is done alongside measuring all the important parameters of the pattern of the data.
Therefore, as the front line of a data science pipeline, data wrangling must deal with measuring and quantifying such descriptive statistics of the incoming data. Along with the formatted and cleaned-up data, the primary job of a data wrangler is to hand over these measures (and sometimes accompanying plots) to the next team member of analytics.
Plotting and visualization also help a data wrangling team identify potential outliers and misfits in the incoming data stream and help them to take appropriate action. We will see some examples of such tasks in the next chapter, where we will identify odd data points by creating scatter plots or histograms and either impute or omit the data point.
Now that we have some basic knowledge of NumPy, pandas, and matplotlib under our belt, we can explore a few additional topics related to these libraries, such as how we can bring them together for advanced data generation, analysis, and visualization.
NumPy offers a dizzying array of random number generation utility functions, all of which correspond to various statistical distributions, such as uniform, binomial, Gaussian normal, Beta/Gamma, and chi-square. Most of these functions are extremely useful and appear countless times in advanced statistical data mining and machine learning tasks. Having a solid knowledge of them is strongly encouraged for all the students taking this book.
Here, we will discuss three of the most important distributions that may come in handy for data wrangling tasks – uniform, binomial, and gaussian normal. The goal here is to show an example of simple function calls that can generate one or more random numbers/arrays whenever the user needs them.
The results will be different for each student when they use these functions as they are supposed to be random.
In this exercise, we will be generating random numbers from a uniform distribution:
x = np.random.randint(1,10)
print(x)
The sample output is as follows (your output could be different):
1
x = np.random.randint(1,10,size=1)
print(x)
The sample output is as follows (your output could be different due to random draw):
[8]
Therefore, we can easily write the code to generate the outcome of a dice being thrown (a normal 6-sided dice) for 10 trials.
How about moving away from the integers and generating some real numbers? Let's say that we want to generate artificial data for weights (in kgs) of 20 adults and we can measure the accurate weights up to two decimal places.
x = 50+50*np.random.random(size=15)
x= x.round(decimals=2)
print(x)
The sample output is as follows:
[56.24 94.67 50.66 94.36 77.37 53.81 61.47 71.13 59.3 65.3 63.02 65.
58.21 81.21 91.62]
We are not only restricted to one-dimensional arrays.
x = np.random.rand(3,3)
print(x)
The sample output is as follows (note that your specific output could be different due to randomness):
[[0.99240105 0.9149215 0.04853315]
[0.8425871 0.11617792 0.77983995]
[0.82769081 0.57579771 0.11358125]]
A binomial distribution is the probability distribution of getting a specific number of successes in a specific number of trials of an event with a pre-determined chance or probability.
The most obvious example of this is a coin toss. A fair coin may have an equal chance of heads or tails, but an unfair coin may have more chances of the head coming up or vice versa. We can simulate a coin toss in NumPy in the following manner.
Suppose we have a biased coin where the probability of heads is 0.6. We toss this coin ten times and note down the number of heads turning up each time. That is one trial or experiment. Now, we can repeat this experiment (10 coin tosses) any number of times, say 8 times. Each time, we record the number of heads:
x = np.random.binomial(10,0.6,size=8)
print(x)
The sample output is as follows (note your specific output could be different due to randomness):
[6 6 5 6 5 8 4 5]
plt.figure(figsize=(7,4))
plt.title("Number of successes in coin toss",fontsize=16)
plt.bar(left=np.arange(1,9),height=x)
plt.xlabel("Experiment number",fontsize=15)
plt.ylabel("Number of successes",fontsize=15)
plt.show()
The sample output is as follows:
We discussed the normal distribution in the last topic and mentioned that it is the most important probability distribution because many pieces of natural, social, and biological data follow this pattern closely when the number of samples is large. NumPy provides an easy way to generate random numbers corresponding to this distribution:
x = np.random.normal()
print(x)
The sample output is as follows (note that your specific output could be different due to randomness):
-1.2423774071573694
We know that normal distribution is characterized by two parameters – mean (µ) and standard deviation (σ). In fact, the default values for this particular function are µ = 0.0 and σ = 1.0.
Suppose we know that the heights of the teenage (12-16 years) students in a particular school is distributed normally with a mean height of 155 cm and a standard deviation of 10 cm.
# Code to generate the 100 samples (heights)
heights = np.random.normal(loc=155,scale=10,size=100)
# Plotting code
#-----------------------
plt.figure(figsize=(7,5))
plt.hist(heights,color='orange',edgecolor='k')
plt.title("Histogram of teen aged students's height",fontsize=18)
plt.xlabel("Height in cm",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()
The sample output is as follows:
Note the use of the loc parameter for the mean (=155) and the scale parameter for standard deviation (=10). The size parameter is set to 100 for that may samples' generation.
Recollect the age, weight, and height parameters that we defined for the plotting exercise. Let's put that data in a DataFrame to calculate various descriptive statistics about them.
The best part of working with a pandas DataFrame is that it has a built-in utility function to show all of these descriptive statistics with a single line of code. It does this by using the describe method:
people_dict={'People':people,'Age':age,'Weight':weight,'Height':height}
people_df=pd.DataFrame(data=people_dict)
people_df
The output is as follows:
print(people_df.shape)
The output is as follows:
(12, 4)
print(people_df['Age'].count())
The output is as follows:
12
print(people_df['Age'].sum())
The output is as follows:
353
print(people_df['Age'].mean())
The output is as follows:
29.416666666666668
print(people_df['Weight'].median())
The output is as follows:
66.5
print(people_df['Height'].max())
The output is as follows:
175
print(people_df['Weight'].std())
The output is as follows:
18.45120510148239
Note how we are calling the statistical functions directly from a DataFrame object.
pcnt_75 = np.percentile(people_df['Age'],75)
pcnt_25 = np.percentile(people_df['Age'],25)
print("Inter-quartile range: ",pcnt_75-pcnt_25)
The output is as follows:
Inter-quartile range: 24.0
print(people_df.describe())
The output is as follows:
This function works only on the columns where numeric data is present. It has no impact on the non-numeric columns, for example, People in this DataFrame.
DataFrame also has built-in plotting utilities that wrap around matplotlib functions and create basic plots of numeric data:
people_df['Weight'].hist()
plt.show()
The output is as follows:
people_df.plot.scatter('Weight','Height',s=150,
c='orange',edgecolor='k')
plt.grid(True)
plt.title("Weight vs. Height scatter plot",fontsize=18)
plt.xlabel("Weight (in kg)",fontsize=15)
plt.ylabel("Height (in cm)",fontsize=15)
plt.show()
The output is as follows:
You can try regular matplotlib methods around this function call to make your plot pretty.
Suppose you are working with the famous Boston housing price (from 1960) dataset. This dataset is famous in the machine learning community. Many regression problems can be formulated, and machine learning algorithms can be run on this dataset. You will do perform a basic data wrangling activity (including plotting some trends) on this dataset by reading it as a pandas DataFrame.
The pandas function for reading a CSV file is read_csv.
These steps will help you complete this activity:
The solution for this activity can be found on page 292.
In this chapter, we started with the basics of NumPy arrays, including how to create them and their essential properties. We discussed and showed how a NumPy array is optimized for vectorized element-wise operations and differs from a regular Python list. Then, we moved on to practicing various operations on NumPy arrays such as indexing, slicing, filtering, and reshaping. We also covered special one-dimensional and two-dimensional arrays, such as zeros, ones, identity matrices, and random arrays.
In the second major topic of this chapter, we started with pandas series objects and quickly moved on to a critically important object – pandas DataFrames. It is analogous to Excel or MATLAB or a database tab, but with many useful properties for data wrangling. We demonstrated some basic operations on DataFrames, such as indexing, subsetting, row and column addition, and deletion.
Next, we covered the basics of plotting with matplotlib, the most widely used and popular Python library for visualization. Along with plotting exercises, we touched upon refresher concepts of descriptive statistics (such as central tendency and measure of spread) and probability distributions (such as uniform, binomial, and normal).
In the next chapter, we will cover more advanced operation with pandas DataFrames that will come in very handy for day-to-day working in a data wrangling job.