Vector data

Vector data is, by far, the most common geospatial format because it is the most efficient way to store spatial information, and in general, requires less computer resources to store and process than raster data. The OGC has over 16 formats directly related to vector data. Vector data stores only geometric primitives including points, lines, and polygons. However, only the points are stored for each type of shape. For example, in the case of a simple straight vector line shape, only the end points would be necessarily stored and defined as a line. Software displaying this data would read the shape type and then connect the end points with a line dynamically.

Geospatial vector data is similar to the concept of vector computer graphics with some notable exceptions. Geospatial vector data contains positive and negative Earth-based coordinates, while vector graphics typically store computer screen coordinates. Geospatial vector data is also usually linked to other information about the object represented by the geometry. This information may be as simple as a timestamp in the case of GPS data or an entire database table for larger geographic information systems. Vector graphics often store styling information describing colors, shadows, and other display-related instructions, while geospatial vector data typically does not. Another important difference is shapes. Geospatial vectors typically only include very primitive geometries based on points, straight lines, and straight-line polygons, while many computer graphic vector formats have concepts of curves and circles. However, geospatial vectors can model these shapes using more points.

Other human readable formats such as Comma-Separated Values (CSV), simple text strings, GeoJSON, and XML-based formats are technically vector data because they store geometry as opposed to rasters, which represent all the data within the bounding box of the dataset. Until the explosion of XML in the late 1990s, vector data formats were nearly all binary. XML provided a hybrid approach that was both computer and human readable. The compromise is that text formats such as GeoJSON and XML data greatly increase the file size over binary formats. These formats are discussed later in this section.

The number of vector formats to choose from is staggering. The open source vector library, OGR (http://www.gdal.org/ogr_formats.html), lists over 86 supported vector formats. Its commercial counterpart, Safe Software's Feature Manipulation Engine (FME), lists over 188 supported vector formats (http://www.safe.com/fme/format-search/#filters%5B%5D=VECTOR). These lists include a few vector graphics formats as well as human readable geospatial formats. There are still dozens of formats out there to at least be aware of, in case you come across them.

Shapefiles

The most ubiquitous geospatial format is the Esri shapefile. Geospatial software company Esri released the shapefile format specification as an open format in 1998 (http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf). Esri developed it as a format for their ArcView software, designed as a lower-end GIS option to complement their high-end professional package, ArcInfo, formerly called ARC/INFO. However, the open specification, efficiency, and simplicity of the format turned it into an unofficial GIS standard, still extremely popular over 15 years later. Virtually, every piece of software labeled as geospatial software supports shapefiles because the shapefile format is so common. For this reason, you can almost get by as an analyst by being intimately familiar with shapefiles and mostly ignoring other formats. You can convert almost any other format to shapefiles through the source format's native software or a third-party converter, such as the OGR library, for which there is a Python module. Other Python modules to handle shapefiles are Shapely and Fiona, which are based on OGR.

One of the most striking features of a shapefile is that the format consists of multiple files. At a minimum, there are three, and there can even be as many as 15 different files! The following table describes the file formats. The .shp, .shx, and .dbf files are required for a valid shapefile.

Shapefile supporting file extension

Supporting file purpose

Notes

.shp

It is the shapefile. It contains the geometry.

Required file. Some software needing only geometry will accept the .shp files without the .shx or .dbf file.

.shx

It is the shape index file. It is fixed-sized record index referencing geometry for faster access.

Required file. This file is meaningless without the .shp file.

.dbf

It is the database file. It contains the geometry attributes.

Required file. Some software will access this format without the .shp file present as the specification predates shapefiles. Based on the very old FoxPro and Dbase formats. An open specification exists called Xbase. The .dbf files can be opened by most types of spreadsheet software.

.sbn

It is the spatial bin file, the shapefile spatial index.

Contains bounding boxes of features mapped to a 256-by-256 integer grid. Frequently seen.

.sbx

A fixed-sized record index for the .sbn file.

A traditional ordered record index of a spatial index. Frequently seen.

.prj

Map projection information stored in well-known text format.

Very common file and required for on-the-fly projection by the GIS software. This same format can also accompany raster data.

.fbn

A spatial index of read only features.

Very rarely seen.

.fbx

A fixed-sized record index of the .fbn spatial index.

Very rarely seen.

.ixs

A geocoding index.

Common in geocoding applications including driving-direction type applications.

.mxs

Another type of geocoding index.

Less common than the .ixs format.

.ain

Attribute index.

Mostly legacy format and rarely used in modern software.

.aih

Attribute index.

Accompanies the .ain files.

.qix

Quadtree index.

A spatial index format created by the open source community because the Esri .sbn and .sbx files were undocumented until recently.

.atx

Attribute index.

A more recent Esri software-specific attribute index to speed up attribute queries.

.shp.xml

Metadata.

Geospatial metadata .xml container. Can be any of the multiple XML standards including FGDC and ISO.

.cpg

Code page file for .dbf.

Used for the internationalization of the .dbf files.

You will probably never encounter all of these formats at once. However, any shapefile that you use will have multiple files. You will commonly see .shp, .shx, .dbf, .prj, .sbn, .sbx, and, occasionally, .shp.xml files. If you want to rename a shapefile, you must rename all of the associated files with the same name; however, in Esri software and other GIS packages, these datasets will appear as a single file.

Another important feature of shapefiles is that the records are not numbered. Records include the geometry, the .shx index record, and the .dbf record. These records are stored in a fixed order. When you examine shapefile records using software, they appear to be numbered. People are often confused when they delete a shapefile record, save the file, and reopen it; the number of the record deleted still appears. The reason is that the shapefile records are numbered dynamically on loading but not saved. So, if you delete record number 23 and save the shapefile, record number 24 will become 23 the next time you read the shapefile. Many people expect to open the shapefile and see the records jump from 22 to 24. The only way to track shapefile records in this way is to create a new attribute called ID or similar in the .dbf file and assign each record a permanent, unique identifier.

Just like renaming shapefiles, care must be taken when editing shapefiles. It's best to use software that treats the shapefiles as a single dataset. If you edit any of the files individually and add or delete a record without editing the accompanying files, the shapefile will be seen as corrupt by most geospatial software.

CAD files

CAD stands for computer-aided design. The primary formats for CAD data were created by Autodesk for their leading AutoCAD package. The two formats commonly seen are Drawing Exchange Format (DXF) and AutoCAD native Drawing (DWG) format. DWG was traditionally a closed format but it has become more open.

CAD software is used for everything that is engineering-related, from designing bicycles to cars, parks, and city sewer systems. As a geospatial analyst, you don't have to worry about mechanical engineering designs; however, civil engineering designs become quite an issue. Most engineering firms use geospatial analysis to a very limited degree but store nearly all of their data in the CAD format. The DWG and DXF formats can represent objects using features not found in geospatial software or weakly supported by geospatial systems. Some examples of these features include the following:

  • Curves
  • Surfaces (for objects that are different from geospatial elevation surfaces)
  • 3D solids
  • Text (rendered as an object)
  • Text styling
  • Viewport configuration

These CAD and engineering-specific features make it difficult to cleanly convert CAD data into geospatial formats. If you encounter CAD data, the easiest option is to ask the data provider if they have shapefiles or some other geospatial-centric format.

Tag-based and markup-based formats

Tag-based markup formats are typically XML formats. They also include other structured text formats such as the well-known text (WKT) format used for projection information files as well as different types of data exchange. XML formats include Keyhole Markup Language (KML), the Open Street Map (OSM) format, and the Garmin GPX format for GPS data, which has become a popular exchange format. The Open Geospatial Consortium's Geographic Markup Language (GML) standard is one of the oldest and most widely used XML-based geographic formats. It is also the basis for the OGC Web Feature Service (WFS) standard for web applications. However, GML has been largely superseded by KML and the GeoJSON format.

XML formats often contain more than just geometry. They also contain attributes and rendering instructions such as color, styling, and symbology. Google's KML format has become a fully supported OGC standard. The following is a sample of KML showing a simple place mark:

<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>Mockingbird Cafe</name>
    <description>Coffee Shop</description>
    <Point>
      <coordinates>-89.329160,30.310964</coordinates>
    </Point>
  </Placemark>
</kml>

The XML format is attractive to geospatial analysts for the following several reasons:

  • It is a human readable format
  • It can be edited in a text editor
  • It is well-supported by programming languages (especially Python!)
  • It is, by definition, easily extensible

XML is not perfect though. It is an inefficient storage mechanism for very large data formats and can quickly become cumbersome to edit. Errors in datasets are common and most parsers do not handle errors robustly. Despite the downsides, XML is widely used in geospatial analysis.

Scalable Vector Graphics (SVG) is a widely supported XML format for computer graphics. It is supported well by browsers and is often used for geospatial rendering. However, SVG was not designed as a geographic format.

The WKT format is also an older OGC standard. The most common use for it is to define projection information usually stored in the .prj projection files along with a shapefile or raster. The WKT string for the WGS84 coordinate system is as follows:

GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.01745329251994328,
        AUTHORITY["EPSG","9122"]],
    AUTHORITY["EPSG","4326"]]

The parameters defining a projection can be quite long. A standards committee created by the EPSG created a numerical coding system to reference projections. These codes, such as EPSG:4326, are used as shorthand for strings like the preceding code. There are also short names for commonly used projections such as Mercator, which can be used in different software packages to reference a projection. More information on these reference systems can be found at the spatial reference website at http://spatialreference.org/ref/.

GeoJSON

GeoJSON is a relatively new and brilliant text format based on the JavaScript Object Notation (JSON) format, which has been a commonly used data exchange format for years. Despite its short history, GeoJSON can be found embedded in all major geospatial software systems and most websites that distribute data because JavaScript is the language of the dynamic web and GeoJSON can be directly fed into JavaScript.

GeoJSON is a completely backwards-compatible extension for the popular JSON format. The structure of JSON is very similar and in some cases identical to existing data structures of common programming languages. JSON is almost identical to Python's dictionary and list data types. Due to this similarity, parsing JSON in a script is simple to do from scratch but there are also many libraries to make it even easier. Python contains a built-in library aptly named json.

GeoJSON provides you with a standard way to define geometry, attributes, bounding boxes, and projection information. GeoJSON has all of the advantages of XML including human readable syntax, excellent software support, and wide use in the industry. It also surpasses XML. GeoJSON is far more compact than XML largely because it uses simple symbols to define objects rather than opening and closing text-laden tags. The compactness also helps with the readability and manageability of larger datasets. However, it is still inferior to binary formats from a data volume standpoint. The following is a sample of the GeoJSON syntax, defining a geometry collection with both a point and line:

{ "type": "GeometryCollection",
  "geometries": [
    { "type": "Point",
      "coordinates": [-89.33, 30.0]
    },
    { "type": "LineString",
      "coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ]
    }
  {"type": "Polygon",
    "coordinates": [[
      [-104.05, 48.99],
      [-97.22,  48.98]
    }
  ]
}

The preceding code is a valid GeoJSON, but it is also a valid Python data structure. You can copy the preceding code sample directly to the Python interpreter as a variable definition and it will evaluate without error as follows:

gc = { "type": "GeometryCollection",
  "geometries": [
    { "type": "Point",
      "coordinates": [-89.33, 30.0]
    },
    { "type": "LineString",
      "coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ]
    }
  ]
}
gc
{'type': 'GeometryCollection', 'geometries': [{'type': 'Point', 'coordinates': [
  -89.33, 30.0]}, {'type': 'LineString', 'coordinates': [[-89.33, 30.3], [-89.36,30.28]]}]}

Due to its compact size, Internet-friendly syntax by virtue of being written in JavaScript, and support from major programming languages, GeoJSON is a key component of leading REST geospatial web APIs, which will be covered later in this chapter. It currently offers the best compromise among the computer resource efficiency of binary formats, the human-readability of text formats, and programmatic utility.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset