Serializing and Saving - JSON, YAML, Pickle, CSV, and XML

To make a Python object persistent, we must convert it into bytes and write the bytes to a file. We'll call this transformation serialization; it is also called marshaling, deflating, or encoding. We'll look at several ways to serialize a Python object to a stream of bytes. It's important to note that we're focused on representing the state of an object, separate from the full definition of the class and its methods and superclasses.

A serialization scheme includes a physical data format. Each format offers some advantages and disadvantages. There's no best format to represent the state of objects.  helps to distinguish the format from the logical data layout, which may be a simple reordering or change in the use of whitespace; the layout changes don't change the value of the object but change the sequence of bytes in an irrelevant way. For example, the CSV physical format can have a variety of logical layouts and still represent the same essential data. If we provide unique column titles, the order of the columns doesn't matter.

Some serialized representations are biased toward representing a single Python object, while others can save collections of individual objects. Even when the single object is a list of items, it's still a single Python object. In order to update or replace one of the items within the list, the entire list must be de-serialized and re-serialized. When it becomes necessary to work with multiple objects flexibly, there are better approaches described in Chapters 11, Storing and Retrieving Objects via Shelve, Chapter 12, Storing and Retrieving Objects via SQLite, and Chapter 13, Transmitting and Sharing Objects.

For the most part, we're limited to objects that fit in working memory. We'll look at the following serialization representations:

  • JavaScript Object Notation (JSON): This is a widely used representation. For more information, visit http://www.json.org. The json module provides the classes and functions necessary to load and dump data in this format. In the Python Standard Library, look at section 19, Internet Data Handling, not section 12, Persistence. The json module is focused narrowly on JSON serialization. The more general problem of serializing arbitrary Python objects isn't handled well.
  • YAML Ain't Markup Language (YAML): This is an extension to JSON and can lead to some simplification of the serialized output. For more information, check out http://yaml.org. This is not a standard part of the Python library; we must add a module to handle this. The PyYaml package, specifically, has numerous Python persistence features.
  • pickle: The pickle module has its own unique representation for data. As this is a first-class part of the Python library, we'll look closely at how to serialize an object in this way. This has the disadvantage of being a poor format for the interchange of data with non-Python programs. It's the basis for the shelve module in Chapter 11, Storing and Retrieving Objects via Shelve, as well as message queues in Chapter 13, Transmitting and Sharing Objects.
  • Comma Separated Values (CSV): This can be inconvenient for representing complex Python objects. As it's so widely used, we'll need to work out ways to serialize Python objects in the CSV notation. For references, look at section 14, File Formats, of the Python Standard Library, not section 12, Persistence, because it's simply a file format and little more. CSV allows us to perform an incremental representation of the Python object collections that cannot fit into memory.
  • XML: In spite of some disadvantages, this is very widely used, so it's important to be able to convert objects into an XML notation and recover objects from an XML document. XML parsing is a huge subject. The reference material is in section 20, Structured Markup Processing Tools, of the Python Standard Library. There are many modules to parse XML, each with different advantages and disadvantages. We'll focus on ElementTree.

Beyond these simple serialization formats, we can also have hybrid problems. One example of a hybrid problem is a spreadsheet encoded in XML. This means that we have a row-and-column data representation problem wrapped in the XML parsing problem. This leads to more complex software for disentangling the various kinds of data that were flattened to CSV-like rows so that we can recover useful Python objects. 

In this chapter, we will cover the following topics:

  • Understanding persistence class, state, and representation
  • Filesystem and network considerations
  • Defining classes to support persistence
  • Dumping and loading with JSON
  • Dumping and loading with YAML
  • Dumping and loading with pickle
  • Dumping and loading with CSV
  • Dumping and loading with XML
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset