Reading, Exploring, and Modifying Data - Part I

Even if you are new to programming, opening, modifying and saving files within programs should be a relatively familiar process. You have likely opened and edited a document with a word processor or entered data into an Excel spread sheet.

Like a computer program, a dataset can be represented using a text file with a specific syntactical structure. The text in a data file specifies both the information contained in the data and the structure in which that information is placed. In a sense, writing a program to process data files is a similar to the process of editing a document or a spreadsheet. First, the content of a file is opened, observed, and modified, and then the result is saved.

Another general strategy to store and retrieve digital data is to use a database. Databases are organized collections of data that allow for efficient storage, retrieval, and modification. I will revisit databases in Chapter 9, Working with Large Datasets.

The major difference between processing a data file in Python and editing a spreadsheet in Excel is that programming tools give you much more fine tuned control over the way in which the information is processed. In this way, programming gives you the flexibility to build custom tools for specific tasks as you need them.

Approaches to programming range from highly specific to highly expressive. In the context of data wrangling, this means that data processing tasks that are relatively unusual or complicated may require a more customized set of programming instructions (highly specific). Here are just a few examples of tasks that may require a more specific approach:

  • Processing significantly large amounts of data
  • Processing data in a hierarchical format
  • Processing obscure data formats
  • Restructuring data or converting between formats
  • Extracting information from a body of text or data without well defined structure

By contrast, data processing tasks that are relatively routine or simple may be best be approached with an existing set of programming tools that can express a sequence of actions more concisely (highly expressive). Here are some examples of tasks that may benefit from a more expressive approach:

  • Filtering out data entries based on the values that they contain.
  • Selecting and extracting certain variables
  • Aggregating the values of particular variables
  • Creating new variables based off of existing ones
A data entry is an individual observation in a dataset, also called a record, document or row. A variable, also called an attribute or column refers to a data variable in the dataset.

Because data challenges come in all shapes and sizes, it helps to have to have a sense for both ends of the spectrum so that you can choose the best tool for the job. I will start in this chapter with a rather low level (specific) overview of the steps involved in processing a data file programmatically. This chapter will include the following sections:

  • External resources
  • Logistical overview
  • Introducing a basic data wrangling work flow
  • Introducing the JSON file format
  • Opening and closing a file in Python using file I/O
  • Reading the contents of a file
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset