Reading, Exploring, and Modifying Data

Even if you are new to programming, opening, modifying and saving files within programs should be a relatively familiar process. You have likely opened and edited a document with a word processor or entered data into an Excel spread sheet.

Like a computer program, a dataset can be represented using a text file with a specific syntactical structure. The text in a data file specifies both the information contained in the data and the structure in which that information is placed. In a sense, writing a program to process data files is a similar to the process of editing a document or a spreadsheet. First, the content of a file is opened, observed, and modified, and then the result is saved.

Another general strategy to store and retrieve digital data is to use a database. Databases are organized collections of data that allow for efficient storage, retrieval, and modification. I will revisit databases in Chapter 9, Working with Large Datasets.

The major difference between processing a data file in Python and editing a spreadsheet in Excel is that programming tools give you much more fine tuned control over the way in which the information is processed. In this way, programming gives you the flexibility to build custom tools for specific tasks as you need them.

Approaches to programming range from highly specific to highly expressive. In the context of data wrangling, this means that data processing tasks that are relatively unusual or complicated may require a more customized set of programming instructions (highly specific). Here are just a few examples of tasks that may require a more specific approach:

Processing significantly large amounts of data
Processing data in a hierarchical format
Processing obscure data formats
Restructuring data or converting between formats
Extracting information from a body of text or data without well defined structure

By contrast, data processing tasks that are relatively routine or simple may be best be approached with an existing set of programming tools that can express a sequence of actions more concisely (highly expressive). Here are some examples of tasks that may benefit from a more expressive approach:

Filtering out data entries based on the values that they contain.
Selecting and extracting certain variables
Aggregating the values of particular variables
Creating new variables based off of existing ones

A data entry is an individual observation in a dataset, also called a record, document or row. A variable, also called an attribute or column refers to a data variable in the dataset.

Because data challenges come in all shapes and sizes, it helps to have to have a sense for both ends of the spectrum so that you can choose the best tool for the job. I will start in this chapter with a rather low level (specific) overview of the steps involved in processing a data file programmatically. This chapter will include the following sections:

External resources
Logistical overview
Introducing a basic data wrangling work flow
Introducing the JSON file format
Opening and closing a file in Python using file I/O
Reading the contents of a file

Table of Contents for
Reading, Exploring, and Modifying Data - Part I

Reading, Exploring, and Modifying Data - Part I

Table of Contents for Reading, Exploring, and Modifying Data - Part I

Create new playlist

Sign In

Sign Up

Table of Contents for
Reading, Exploring, and Modifying Data - Part I