The previous subsetting example is one way to select data. There are many other ways to subset data for further analysis. In this section, we'll examine some of them.
We briefly discussed the point in polygon formula in Chapter 1, Learning Geospatial Analysis with Python, as a common type of geospatial operation. You'll find that it is one of the most useful formulas out there. The formula is relatively straightforward. The following function performs this check using the Ray Casting method. This method draws a line from the test point all the way through the polygon and counts the number of times it crosses the polygon boundary. If the count is even, the point is outside the polygon. If it is odd, then it's inside. This particular implementation also checks to see if the point is on the edge of the polygon, as shown here:
def point_in_poly(x,y,poly): # check if point is a vertex if (x,y) in poly: return True # check if point is on a boundary for i in range(len(poly)): p1 = None p2 = None if i==0: p1 = poly[0] p2 = poly[1] else: p1 = poly[i-1] p2 = poly[i] if p1[1] == p2[1] and p1[1] == y and x > min(p1[0], p2[0]) and x < max(p1[0], p2[0]): return True n = len(poly) inside = False p1x,p1y = poly[0] for i in range(n+1): p2x,p2y = poly[i % n] if y > min(p1y,p2y): if y <= max(p1y,p2y): if x <= max(p1x,p2x): if p1y != p2y: xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x if p1x == p2x or x <= xints: inside = not inside p1x,p1y = p2x,p2y if inside: return True return False
Now, let's use the point_in_poly()
function to test a point in Chile:
>>> # Test a point for inclusion >>> myPolygon = [(-70.593016,-33.416032), (-70.589604,-33.415370), (-70.589046,-33.417340), (-70.592351,-33.417949), (-70.593016,-33.416032)] >>> # Point to test >>> lon = -70.592000 >>> lat = -33.416000 >>> print(point_in_poly(lon, lat, myPolygon)) True
The point is inside. Let's also verify that edge points will be detected:
>>> # test an edge point >>> lon = -70.593016 >>> lat = -33.416032 >>> print(point_in_poly(lon, lat, myPolygon)) True
You'll find new uses for this function all the time. It's definitely one to keep in your toolbox.
We'll go through one more example using a simple bounding box to isolate a complex set of features and save it in a new shapefile. In this example, we'll subset the roads on the island of Puerto Rico from the mainland U.S. Major Roads shapefile. You can download the shapefile from the following link:
https://github.com/GeospatialPython/Learn/raw/master/roads.zip
Floating-point coordinate comparisons can be expensive, but we are using a box and not an irregular polygon, and so this code is efficient enough for most operations:
>>> import shapefile >>> r = shapefile.Reader("roadtrl020") >>> w = shapefile.Writer(r.shapeType) >>> w.fields = list(r.fields) >>> xmin = -67.5 >>> xmax = -65.0 >>> ymin = 17.8 >>> ymax = 18.6 >>> for road in r.iterShapeRecords(): >>> geom = road.shape >>> rec = road.record >>> sxmin, symin, sxmax, symax = geom.bbox >>> if sxmin < xmin: continue >>> elif sxmax > xmax: continue >>> elif symin < ymin: continue >>> elif symax > ymax: continue >>> w._shapes.append(geom) >>> w.records.append(rec) >>> w.save("Puerto_Rico_Roads")
We've now seen two different ways of subsetting a larger dataset resulting in a smaller one based on spatial relationships. Now, let's examine a quick way to subset vector data using the attribute table. In this example, we'll use a polygon
shapefile that has densely populated urban areas within Mississippi. You can download this zipped shapefile, which is available at the following link:
This script is really quite simple. It creates the Reader
and Writer
objects and copies the dbf
fields; then it loops through the records for matching attributes and adds them to Writer
. We'll select urban areas with a population of less than 5,000, as shown here:
>>> import shapefile >>> # Create a reader instance >>> r = shapefile.Reader("MS_UrbanAnC10") >>> # Create a writer instance >>> w = shapefile.Writer(r.shapeType) >>> # Copy the fields to the writer >>> w.fields = list(r.fields) >>> # Grab the geometry and records from all features >>> # with the correct population >>> selection = [] >>> for rec in enumerate(r.records()): ... if rec[1][14] < 5000: ... selection.append(rec) >>> # Add the geometry and records to the writer >>> for rec in selection: ... w._shapes.append(r.shape(rec[0])) ... w.records.append(rec[1]) >>> # Save the new shapefile >>> w.save("MS_Urban_Subset")
Attribute selections are typically fast. Spatial selections are computationally expensive because of floating point calculations. Whenever possible, make sure you are enable to use attribute selection to subset first. The following figure shows the starting shapefile containing all urban areas on the left with a state boundary, and the urban areas with less than 5,000 people on the right after the previous attribute selection:
Let's see what that same example looks like using fiona
, which takes advantage of the OGR library. We'll use nested statements to reduce the amount of code needed to properly open and close the files, as shown here:
>>> import fiona >>> with fiona.open("MS_UrbanAnC10.shp") as sf: >>> filtered = filter(lambda f: f['properties']['POP'] < 5000, sf) >>> # Shapefile file format driver >>> drv = sf.driver >>> # Coordinate Reference System >>> crs = sf.crs >>> # Dbf schema >>> schm = sf.schema >>> subset = "MS_Urban_Fiona_Subset.shp" >>> with fiona.open(subset, "w", >>> driver=drv, >>> crs=crs, >>> schema=schm) as w: >>> for rec in filtered: >>> w.write(rec)