Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15. Data Acquisition

Ashwin Pajankar¹

(1)

Nashik, Maharashtra, India

In the previous chapter, you learned the basics of using two Pandas data structures, namely, the series and the dataframe.

This chapter focuses on acquiring data with Python using all the libraries you have studied up to now (NumPy, Matplotlib, and Pandas). The following are the topics you will learn about in this chapter:

Handling plain-text files
Handling CSV files with Python
Using Python and Excel
Writing and reading files with NumPy
Reading data from a CSV file with NumPy
Using a Matplotlib CBook
Reading data from a CSV file
Reading data from an Excel file
Reading data from a JSON file
Reading data from a Pickle file
Reading data from the Web
Reading data from a relational database
Reading data from the clipboard

After reading this chapter, you will be comfortable reading data from various file formats and saving it.

Plain-Text File Handling

Let’s learn how to read data from and write data to a plain-text file. Python comes with the functionality to read and write plain-text files. We have four modes for opening a file, as listed here:

w: Write
r: Read
a: Append
r+: Read and write mode

You can use them (one at a time) as follows:

f = open('testfile.txt', 'w')

print(f)

This code opens the testfile.txt file in write mode. If the file does not exist, then Python creates this file in the current location on disk. If the file already exists, it overwrites the contents of the file. The previous code prints the file object as follows:

<_io.TextIOWrapper name='testfile.txt' mode='w' encoding='cp1252'>

Let’s write some data to the file. In this case, the data consists of multicharacter strings.

f.write('This is a test string. ')

f.write('This is the middle line. ')

f.write('This is the last line.')

You can close the file object (also called the file handle) as follows:

f.close()

You know that opening a file again in write mode will overwrite its data. So, this time, let’s open the same file in append mode as follows:

f = open('testfile.txt', 'a')

f.write(' This is the appended line.')

f.close()

We are writing one line into the file and then closing the file. Let’s read the data and print it as follows:

f = open('testfile.txt', 'r')

print(f.read())

f.close()

The output is as follows:

This is a test string.

This is the middle line.

This is the last line.

This is the appended line

You can retrieve the lines in a list (with every line in the file corresponding to an element in the list), as follows:

f = open('testfile.txt', 'r')

print(f.readlines())

f.close()

The output is as follows:

['This is a test string. ', 'This is the middle line. ', 'This is the last line. ', 'This is the appended line.']

You can also retrieve the data in the file line by line as follows:

f = open('testfile.txt', 'r')

for line in f:

print(line)

f.close()

The output is as follows:

This is a test string.

This is the middle line.

This is the last line.

This is the appended line.

Handling CSV Files with Python

Let’s learn a few things about the comma-separated file (CSV) format. CSV files store data in plain-text format, and the data items are either a fixed length or separated by a delimiter such as a comma (,), a pipe (|), or a colon(:). The most common CSV format uses a comma as the delimiter, and many times the first line is used to store the names of the columns.

In this section, you will learn how to handle a CSV file with Python 3. Python 3 comes with a built-in library to handle CSV files. You do not have to install anything. You can import the library as follows:

import csv

You can open the file as a plain-text file in read mode as follows:

file = open('test.csv', 'r')

print(file)

Once you open the file, you can pass the file handle to the routine csv.reader() as follows:

csvfile = csv.reader(file, delimiter=',')

print(csvfile)

This prints the value of the object as follows:

<_csv.reader object at 0x0590AC68>

You can retrieve the data line by line as follows:

for row in csvfile:

print(row)

This produces the following output:

['Banana', 'Yellow', '250']

['Orange', 'Orange', '200']

['Grapes', 'Green', '400']

['Tomato', 'Red', '100']

['Spinach', 'Green', '40']

['Potatoes', 'Gray', '400']

['Rice', 'White', '300']

['Rice', 'Brown', '400']

['Wheat', 'Brown', '500']

['Barley', 'Yellow', '500']

You can display the elements individually as follows:

for row in csvfile:

for element in row:

print(element)

The output is as follows:

Banana

Yellow

250

Orange

200

Grapes

Green

400

Tomato

Red

100

Spinach

Green

Potatoes

Gray

400

Rice

White

300

Rice

Brown

400

Wheat

Brown

500

Barley

Yellow

500

Let’s close the file handle as follows:

file.close()

Python and Excel

Let’s see how to read the data from Excel. You need an external library for that. The following code installs the library that we will use in this section:

!pip3 install openpyxl

You can import it as follows:

import openpyxl

You can open an Excel file as follows:

wb = openpyxl.load_workbook('test.xlsx')

print(wb)

print(type(wb))

The output is as follows:

<openpyxl.workbook.workbook.Workbook object at 0x0E87F7D8>

You can retrieve the names of all the sheets as follows:

print(wb.sheetnames)

The output is as follows:

['Sheet1', 'Sheet2', 'Sheet3']

You can select a sheet as follows:

currSheet = wb['Sheet1']

print(currSheet)

print(type(currSheet))

The output is as follows:

Similarly, the following code has the same effect:

currSheet = wb[wb.sheetnames[0]]

print(currSheet)

print(type(currSheet))

You can print the name of the current sheet as follows:

print(currSheet.title)

The output is as follows:

Sheet1

You can print the value of a cell as follows:

var1 = currSheet['A1']

print(var1.value)

The output is as follows:

Food Item

The other way to do the same activity is as follows:

print(currSheet['B1'].value)

You can do this another way as follows:

var2 = currSheet.cell(row=2, column=2)

print(var2.value)

The number of rows and columns can be obtained as follows:

print(currSheet.max_row)

print(currSheet.max_column)

The output is as follows:

Let’s print all the data in the spreadsheet as follows:

for i in range(currSheet.max_row):

print('---Beginning of Row---')

for j in range(currSheet.max_column):

var = currSheet.cell(row=i+1, column=j+1)

print(var.value)

print('---End of Row---')

The output is very long, so I’ve truncated it here. Please run the code to see it for yourself .

Writing and Reading Files with NumPy

Let’s see how to read and write files with NumPy. Let’s create a dataset with NumPy as follows:

import numpy as np

x = np.arange(100)

print(x)

The output is as follows:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

96 97 98 99]

You can save it to a file (in NumPy data format) as follows:

np.save('test.npy', x)

You can load the data from a file into a variable as follows:

data = np.load('test.npy')

print(data)

The output is as follows:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

96 97 98 99]

Reading the Data from a CSV File with NumPy

The CSV file can be read with NumPy too as follows:

import numpy as np

# Reads only numeric data

data = np.loadtxt('data.csv', delimiter=',')

print(data)

The output is as follows:

[[ 0. 1. 18. 2.]

[ 1. 6. 1. 3.]

[ 2. 3. 154. 0.]

[ 4. 978. 3. 6.]

[ 5. 2. 41. 45.]

[ 6. 67. 2. 3.]

[ 7. 5. 67. 2.]]

You can also skip rows and columns as follows:

data = np.loadtxt('data.csv', delimiter=',',

skiprows=3, usecols=[1, 3])

print(data)

The output is as follows:

[[978. 6.]

[ 2. 45.]

[ 67. 3.]

[ 5. 2.]]

Matplotlib CBook

You can read data that is stored in Matplotlib’s CBook format . Matplotlib comes with a few sample files in that format. Let’s see how to read the data:

import matplotlib.cbook as cbook

datafile = cbook.get_sample_data('aapl.npz')

r = np.load(datafile)

print(r.files)

This will print the names of the data files, as shown here:

['price_data']

Let’s read the data from that data file:

print(r['price_data'])

This shows the Apple share price data as follows:

[('1984-09-07', 26.5 , 26.87, 26.25, 26.5 , 2981600, 3.02)

('1984-09-10', 26.5 , 26.62, 25.87, 26.37, 2346400, 3.01)

('1984-09-11', 26.62, 27.37, 26.62, 26.87, 5444000, 3.07) ...

('2008-10-10', 85.7 , 100. , 85. , 96.8 , 79260700, 96.8 )

('2008-10-13', 104.55, 110.53, 101.02, 110.26, 54967000, 110.26)

('2008-10-14', 116.26, 116.4 , 103.14, 104.08, 70749800, 104.08)]

Reading Data from a CSV

As mentioned earlier, a CSV file contains values separated by commas. You can use the versatile function read_csv() in Pandas to read a CSV file on the Web or on the local/networked disk. The following are the contents of a CSV file that we will use in this demonstration:

rank,discipline,phd,service,sex,salary

Prof,B,56,49,Male,186960

Prof,A,12,6,Male,93000

Prof,A,23,20,Male,110515

Prof,A,40,31,Male,131205

Prof,B,20,18,Male,104800

Prof,A,20,20,Male,122400

AssocProf,A,20,17,Male,81285

The first row is the header row. Most CSV files will have a header row, although it is not required. As you can see, the values are separated by commas. This is a common format of CSV files. Depending on the system and application, you can use a variety of separators like a space, a semicolon (;), or a pipe (|). Also, CSV files can use a fixed number of characters for storing data in columns. In this example, as discussed, we are using one of the most common CSV formats for storing data.

Let’s learn how to read data from such files with Pandas. Create a new notebook for this chapter.

Import the Pandas library as follows:

import pandas as pd

Let’s read a CSV file located on the Web as follows:

df1 = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")

print(df1)

You can also read a CSV stored on the local disk as follows:

df2 = pd.read_csv("Salaries.csv")

print(df2)

You can also dump the data of a dataframe to a CSV file at a disk location as follows:

df2.to_csv('output.csv', index=True, header=False)

The code will create a CSV file on the disk in the current directory.

Reading Data from an Excel File

To read data from an Excel file into a Pandas dataframe, you need the support of an external package. Let’s install a package as follows:

!pip3 install xlrd

Now let’s read the Excel file stored on the disk, as follows:

excel_file = 'test.xlsx'

df1 = pd.read_excel(excel_file)

Here (and in the earlier example too), the file is stored in the same directory as the notebook file. If you need to read the file in any other location, you must specify the full path of that file. The previous code, when executed, will load the contents of an Excel file into a Pandas dataframe. You can see the contents using the following line of code:

print(df1)

Figure 15-1 shows the output.

../images/515442_1_En_15_Chapter/515442_1_En_15_Fig1_HTML.jpg — Figure 15-1
The data from an Excel sheet

Reading Data from JSON

You can read the data of a JSON string into a dataframe as follows. Create a JSON string first.

obj = """

{"name": "Ashwin",

"places_lived": ["Nashik", "Hyderabad", "Bangalore"],

"pet": null,

"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},

{"name": "Katie", "age": 38,

"pets": ["Sixes", "Stache", "Cisco"]}]

}

"""

You can print the string as follows:

print(obj)

You can also check the type of the variable (it is a string in JSON format), as shown here:

print(type(obj))

You can convert this JSON-formatted string to a dictionary as follows:

import json

result = json.loads(obj)

print(result)

Let’s check the data type of the newly created variable, as shown here:

print(type(result))

This will produce the following result:

Let’s load the data into a dataframe as follows:

df1 = pd.DataFrame(result['siblings'], columns=['name', 'age'])

print(df1)

The output is as follows :

name age

0 Scott 30

1 Katie 38

You can also read the data from a JSON file as follows:

df2 = pd.read_json('example_2.json')

print(df2)

This is how you can read the JSON data into dataframes.

Reading Data from a Pickle File

In Python programming, Pickle is used in serializing and deserializing Python Objects. You can store a Pandas dataframe to a Pickle file on the disk as follows:

data = [1, 2, 3, 4, 5]

df1 = pd.DataFrame(data)

print(df1)

df1.to_pickle('mypickle')

You can read the data from a Pickle file stored on the disk as follows:

df2 = pd.read_pickle('mypickle')

print(df2)

Reading Data from the Web

Let’s read the data from the Web. For that, you will need a few libraries. You can install them as follows:

!pip3 install lxml html5lib BeautifulSoup4

You can read an HTML file located on the Web as follows:

df1 = pd.read_html('https://www.google.com/')

Let’s get the details of the object and the data as follows:

print(df1)

len(df1)

type(df1)

df1[0].head()

You can also parse this retrieved HTML text and fetch important information from the tags as follows:

from lxml import objectify

from io import StringIO

The following is an HTML tag string and a way to parse it, as shown here:

tag = '<a href="http://www.google.com/">Google</a>'

root = objectify.parse(StringIO(tag)).getroot()

You retrieve the root and the text of this object as follows:

print(root)

root.get('href')

print(root.text)

This will produce the following output:

Google

Interacting with the Web API

Let’s learn to interact with the web API to retrieve and store the data into a Pandas dataframe. Install the necessary library as follows:

!pip3 install requests

Let’s import the library as follows:

import requests

Let’s create a URL string as follows:

url='https://api.github.com/repos/pandas-dev/pandas/issues'

You can fetch the data from the URL with the HTTP GET request issued programmatically as follows:

resp = requests.get(url)

You can check the response code and its datatype as follows:

print(resp)

print(type(resp))

The output is as follows:

The HTTP response code 200 stands for success in retrieving the information. You can retrieve the actual information as follows:

data = resp.json()

print(type(data))

It will be a list, as shown here:

You can convert it into a dataframe as follows:

output = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])

print(output)

Figure 15-2 shows the output.

../images/515442_1_En_15_Chapter/515442_1_En_15_Fig2_HTML.jpg — Figure 15-2
The data from an HTTPS GET response

This is how you can work with data available on the Web.

Reading Data from a Relational Database Table

You can read the data stored in a table in a relational database like MySQL or MariaDB. You can read more about the installation and usage at the following URLs:

https://www.mysql.com/

https://mariadb.org/

You have to install an external library as follows:

!pip3 install pymysql

Then you need to import the library to the notebook as follows:

import pymysql

You can connect to a MySQL or MariaDB database instance as follows:

db = pymysql.connect(host="localhost", user="root",

password="test123", database="world")

Then you can read the output of a SELECT query into a dataframe as follows:

df1 = pd.read_sql('select * from country', db)

print(df1)

This produces the output shown in Figure 15-3.

../images/515442_1_En_15_Chapter/515442_1_En_15_Fig3_HTML.jpg — Figure 15-3
The data from a MySQL/MariaDB table

Reading Data from the Clipboard

You can read the data stored on the clipboard . The clipboard is a temporary and unnamed buffer in the computer’s main memory (RAM) that a few operating systems provide for the short-term storage and transfer of data within and between programs. For example, whenever you copy text data from a file, it is stored on the clipboard of the operating system.

Copy the following data into your computer’s clipboard by selecting it and pressing the Ctrl+C buttons on the keyboard.

A B C

x 1 2 a

y 2 3 b

z 3 4 c

You can load it into a Pandas dataframe using the following code:

df = pd.read_clipboard()

You can also copy data onto the clipboard programmatically as follows:

import numpy as np

df = pd.DataFrame(np.random.randn(5, 3))

df.to_clipboard()

You can see this data either by reading the clipboard programmatically into a dataframe as explained earlier or by pasting it with the Ctrl+V command into a text editor like Notepad (on Windows) or Leafpad or gedit (on Linux).

Summary

In this chapter, you learned how to read data from multiple file formats and how to load that data into Python variables.

In the next chapter, you will study how to visualize Pandas data using Matplotlib.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Data Acquisition

Create new playlist

Sign In

Sign Up

15. Data Acquisition

Plain-Text File Handling

Handling CSV Files with Python

Python and Excel

Writing and Reading Files with NumPy

Reading the Data from a CSV File with NumPy

Matplotlib CBook

Reading Data from a CSV

Reading Data from an Excel File

Reading Data from JSON

Reading Data from a Pickle File

Reading Data from the Web

Interacting with the Web API

Reading Data from a Relational Database Table

Reading Data from the Clipboard

Summary

Table of Contents for
15. Data Acquisition