Python networking libraries for acquiring data

The vast majority of geospatial data sharing is accomplished via the Internet. And Python is well equipped when it comes to networking libraries for almost any protocol. Automated data downloads are often an important step in automating a geospatial process. Data is typically retrieved from a website Uniform Resource Locator (URL) or a File Transfer Protocol (FTP) server. And because geospatial datasets often contain multiple files, data is often distributed as ZIP files.

A nice feature of Python is its concept of a file-like object. Most Python libraries which read and write data use a standard set of methods which allow you to access data from all different types of resources as if you were writing a simple file on disk. The networking modules in the Python standard library use this convention as well. The benefit of this approach is that it allows you to pass file-like objects to other libraries and methods, which recognize the convention without a lot of setup for different types of data distributed in different ways.

The Python urllib module

The Python urllib package is designed for simple access to any file with a URL address. The urllib package in Python 3 consists of several modules which handle different parts of managing web requests and responses. These modules implement some of Python's file-like object conventions starting with its open() method. When you call open(), it prepares a connection to the resource but does not access any data. Sometimes, you just want to grab a file and save it to the disk instead of accessing it in memory. This function is available through the urllib.request.retrieve() method.

The following example uses the urllib.request.retrieve() method to download the zipped shapefile named hancock.zip, which is used in other examples. We define the URL and the local file name as variables. The URL is passed as an argument as well as the filename we want to use to save it to our local machine which in this case is just hancock.zip:

>>> import urllib.request
>>> import urllib.parse
>>> import urllib.error
>>> url = "https://github.com/GeospatialPython/Learn/raw/master/hancock.zip"
>>> fileName = "hancock.zip"
>>> urllib.request.urlretrieve(url, fileName)
('hancock.zip', <httplib.HTTPMessage instance at 0x00CAD378>)

The message from the underlying httplib module confirms that the file was downloaded to the current directory. The URL and filename could have been passed to the retrieve() method directly as strings as well. If you specify just the filename, the download saves to the current working directory. You can also specify a fully qualified pathname to save it somewhere else. You can also specify a callback function as a third argument which will receive download status information for the file, so you can create a simple download status indicator or perform some other action.

The urllib.request.urlopen() method allows you to access an online resource with more precision and control. As mentioned previously, it implements most of the Python file-like object methods with the exception of the seek() method, which allows you to jump to arbitrary locations within a file. You can read a file online one line at a time, read all lines as a list, read a specified number of bytes, or iterate through each line of the file. All of these functions are performed in memory, so you don't have to store the data on disk. This ability is useful for accessing frequently updated data online, which you may want to process without saving to disk.

In the following example, we demonstrate this concept by accessing the United States Geological Survey (USGS) earthquake feed to view all of the earthquakes in the world, which have occurred within the last hour. This data is distributed as a Comma-Separated Values (CSV) file, which we can read line by line like a text file. CSV files are similar to spreadsheets and can be opened in a text editor or spreadsheet program. First, you'll open the URL and read the header with the column names in the file, and then you'll read the first line containing a record of a recent earthquake, as shown in the following lines of code:

>>> url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.csv"
>>> earthquakes = urllib.request.urlopen(url)
>>> earthquakes.readline()
'time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place

'
>>> earthquakes.readline()
'2013-06-14T14:37:57.000Z,64.8405,-147.6478,13.1,0.6,Ml,6,180,0.09701805,0.2,ak,
ak10739050,2013-06-14T14:39:09.442Z,"3km E of Fairbanks, Alaska"
'

We can also iterate through this file which is a memory efficient way to read through large files. If you are running this example in the Python interpreter as shown next, you will need to press the Enter or Return key twice to execute the loop. This action is necessary because it signals to the interpreter that you are done building the loop. In the following example, we abbreviate the output:

>>> for record in earthquakes: print(record)
2013-06-14T14:30:40.000Z,62.0828,-145.2995,22.5,1.6,Ml,8,108,0.08174669,0.86,ak,
ak10739046,2013-06-14T14:37:02.318Z,"13km ESE of Glennallen, Alaska"
...
2013-06-14T13:42:46.300Z,38.8162,-122.8148,3.5,0.6,Md,,126,0.00898315,0.07,nc,nc
72008115,2013-06-14T13:53:11.592Z,"6km NW of The Geysers, California"

FTP

FTP allows you to browse an online directory and download data using FTP client software. Until around 2004, when geospatial web services became very common, FTP was one of the most common ways to distribute geospatial data. FTP is less common now, but you occasionally encounter it when you're searching for data. Once again Python's batteries included standard library has a reasonable FTP module called ftplib with a main class called FTP().

In the following example, we will access an FTP server hosted by the U.S. National Oceanic and Atmospheric Administration (NOAA) to access a text file containing data from the Deep-ocean Assessment and Reporting of Tsunamis (DART) buoy network used to watch for tsunamis around the world. This particular buoy is off the coast of Peru. We'll define the server and the directory path. Then we will access the server. All FTP servers require a username and password. Most public servers have a user called anonymous with the password anonymous as this one does. Using Python's ftplib, you can just call the login() method without any arguments to login as the default anonymous user. Otherwise, you can add the username and password as string arguments. Once we're logged in, we'll change to the directory containing the DART datafile. To download the file, we open up a local file called out and pass its write() method as a callback function to the ftplib.ftp.retrbinary() method, which simultaneously downloads the file and writes it to our local file.

Once the file is downloaded, we can close it to save it. Then we'll read the file and look for the line containing the latitude and longitude of the buoy to make sure that the data was downloaded successfully, as shown in the following lines of code:

>>> import ftplib
>>> server = "ftp.ngdc.noaa.gov"
>>> dir = "hazards/DART/20070815_peru"
>>> fileName = "21415_from_20070727_08_55_15_tides.txt"
>>> ftp = ftplib.FTP(server)
>>> ftp.login()
'230 Login successful.'
>>> ftp.cwd(dir)
'250 Directory successfully changed.'
>>> out = open(fileName, "wb")
>>> ftp.retrbinary("RETR " + fileName, out.write)
'226 Transfer complete.'
>>> out.close()
>>> dart = open(fileName)
>>> for line in dart:
...     if "LAT," in line:
...             print(line)
...             break
...
LAT,   LON      50.1663    171.8360

In this example, we opened the local file in binary write mode, and we used the retrbinary() ftplib method as opposed to retrlines(), which uses ASCII mode. Binary mode works for both ASCII and binary files, so it's always a safer bet. In fact, in Python, the binary read and write modes for a file are only required on Windows.

If you are just downloading a simple file from an FTP server, many FTP servers have a web interface as well. In that case, you could use urllib to read the file. FTP URLs use the following format to access data:

ftp://username:password@server/directory/file

This format is insecure for password-protected directories because you are transmitting your login information over the Internet. But for anonymous FTP servers, there is no additional security risk. To demonstrate this, the following example accesses the same file that we just saw but by using urllib instead of ftplib:

>>> dart = urllib.request.urlopen("ftp://" + server + "/" + dir + "/"   + fileName)
>>> for line in dart:
...     line = str(line, encoding="utf8")
...     if "LAT," in line:
...             print(line)
...             break
...
LAT,   LON      50.1663    171.8360

ZIP and TAR files

Geospatial datasets often consist of multiple files. For this reason, they are often distributed as ZIP or TAR file archives. These formats can also compress data, but their ability to bundle multiple files is the primary reason they are used for geospatial data. While the TAR format doesn't contain a compression algorithm, it incorporates the gzip compression and offers it as a program option. Python has standard modules for reading and writing both ZIP and TAR archives. These modules are called zipfile and tarfile respectively.

The following example extracts the hancock.shp, hancock.shx, and hancock.dbf files contained in the hancock.zip file we downloaded using urllib for use in the previous examples. This example assumes that the ZIP file is in the current directory:

>>> import zipfile
>>> zip = open("hancock.zip", "rb")
>>> zipShape = zipfile.ZipFile(zip)
>>> shpName, shxName, dbfName = zipShape.namelist()
>>> shpFile = open(shpName, "wb")
>>> shxFile = open(shxName, "wb")
>>> dbfFile = open(dbfName, "wb")
>>> shpFile.write(zipShape.read(shpName))
>>> shxFile.write(zipShape.read(shxName))
>>> dbfFile.write(zipShape.read(dbfName))
>>> shpFile.close()
>>> shxFile.close()
>>> dbfFile.close()

This example is more verbose than necessary for clarity. We can shorten this example and make it more robust by using a for loop around the zipfile.namelist() method without explicitly defining the different files as variables. This method is a more flexible and Pythonic approach, which could be used on ZIP archives with unknown contents, as shown in the following lines of code:

>>> import zipfile
>>> zip = open("hancock.zip", "rb")
>>> zipShape = zipfile.ZipFile(zip)
>>> for fileName in zipShape.namelist():
...     out = open(fileName, "wb")
...     out.write(zipShape.read(fileName))
...     out.close()
>>>  

Now that you understand the basics of the zipfile module, let's take the files we just unzipped and create a TAR archive with them. In this example, when we open the TAR archive for writing, we specify the write mode as w:gz for gzipped compression. We also specify the file extension as tar.gz to reflect this mode, as shown in the following lines of code:

>>> import tarfile
>>> tar = tarfile.open("hancock.tar.gz", "w:gz")
>>> tar.add("hancock.shp")
>>> tar.add("hancock.shx")
>>> tar.add("hancock.dbf")
>>> tar.close()

We can extract the files using the simple tarfile.extractall() method. First, we open the file using the tarfile.open() method, and then extract it, as shown in the following lines of code:

>>> tar = tarfile.open("hancock.tar.gz", "r:gz")
>>> tar.extractall()
>>> tar.close()

We'll work on one more example by combining elements we've learned in this chapter as well as the elements in the Vector data section of Chapter 2, Geospatial Data. We'll read the bounding box coordinates from the hancock.zip file without ever saving it to disk. We'll use the power of Python's file-like object convention to pass around the data. Then, we'll use Python's struct module to read the bounding box as we did in Chapter 2. In this case, we read the unzipped .shp file into a variable and access the data using Python array slicing by specifying the starting and ending indexes of the data separated by a colon (:). We are able to use list slicing because Python allows you to treat strings as lists. In this example, we also use Python's StringIO module to temporarily store data in memory in a file-like object that implements all methods including the seek() method, which is absent from most Python networking modules, as shown in the following lines of code:

>>> import urllib.request
>>> import urllib.parse
>>> import urllib.error
>>> import zipfile
>>> import io
>>> import struct
>>> url = "https://github.com/GeospatialPython/Learn/raw/master/hancock.zip"
>>> cloudshape = urllib.request.urlopen(url)
>>> memoryshape = io.BytesIO(cloudshape.read())
>>> zipshape = zipfile.ZipFile(memoryshape)
>>> cloudshp = zipshape.read("hancock.shp")
# Access Python string as an array
>>> struct.unpack("<dddd", cloudshp[36:68])
(-89.6904544701547, 30.173943486533133, -89.32227546981174, 30.6483914869749)

As you can see from the examples so far, Python's standard library packs a lot of punch. Most of the time, you don't have to download a third-party library just to access a file online.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset