Chapter 11. Sheets: A CSV Framework

Of course, the most important thing in programming is the program. Tools, techniques, philosophy and advice don't offer much at all if they're never applied to solve a real-world problem. Sometimes that problem is very specific, but other times it's merely a specific example of a more general problem. These general problems are typically the subject of libraries and frameworks, which can provide the base for a more specific application.

This puts frameworks in an interesting position because they're focused more on serving the needs of developers, rather than ordinary users. The goal is to provide a foundation and a set of tools to aid someone else in the development of a more specific application. Supporting a wider array of uses requires more advanced techniques than would ordinarily be used to solve the problem directly.

In order to be useful to other developers, though, the ideal goal is to provide a sort of translation service, so that the advanced techniques used by the framework allow other developers to use simpler techniques to perform those more advanced tasks. In this respect, framework design is very similar to other forms of design, but rather than focusing primarily on a visual user interface, the focus is on the application's programming interface, the API.

It's important to look at frameworks like this, because if you're writing a framework, your audience is looking for a tool to save them time and energy, so they can focus on their unique needs. The framework should provide a set of features in a way that encourages integration with other types of applications, so it's necessary to think in terms of how those other applications should work.

There are countless examples of frameworks in use already, serving a wide variety of needs. They all address a general class of problem, such as Django[20] for Web development, SQLAlchemy[21] for database interaction and Twisted[22] for working with network protocols. These each take different approaches with the style and form of the interfaces they expose to developers, highlighting the various ways a framework can operate.

This chapter will show a framework that uses a declarative syntax similar to the ones used in Django and Elixir.[23] The choice of this approach is based largely on style, and even though there are other approaches you could use, investigating one in detail will highlight many of the decisions that must be made when writing a framework. You'll see all the techniques shown in this book combine to form a single, cohesive whole, exposing a public API that provides a number of useful features.

The particular problem that this chapter will address is the need to work with files that store information as rows of comma-separated values, more commonly referred to as CSV files. It may seem like a simple format on the surface, but with multiple options available for tasks like separating values on a line, separating lines themselves and encoding individual values within each line, it becomes a very complex topic.

Python already does a lot to help with CSV files by providing a csv module.[24] Rather than attempting to duplicate its functionality, we can use csv to do most of the heavy lifting behind the scenes. What we're going to do instead is build a layer on top of csv to make it easier to work with and integrate with other applications. Essentially, we're just providing a new API on top of an existing one, in hopes that we can make it a bit friendlier.

Building a Declarative Framework

There are several steps involved in building a framework using a declarative syntax similar to that of Django or Elixir, but the process itself really isn't all that difficult. Making decisions along the way, however, is where things get tricky. In this chapter, I'll outline the various steps required to build such a framework, as well as many of the decisions you'll have to make, but I can't provide any such decisions because each will have to be made specifically for your own project.

You won't be all on your own, though. Each decision point along the way will outline the pros and cons of various options, so you can be confident about making an informed choice. Making the right decisions at the outset will help ensure that your framework will withstand future upgrades as well as criticisms from those who may not agree with you. Just make sure you have valid, real-world reasoning behind your decisions and you'll be fine.

Rather than leaving you with nothing but theory, this chapter will step through the creation of a framework that's simple enough to introduce the essential concepts, without having to dwell too long on matters specific to its purpose. It also needs to be a good example of when a declarative framework should be used, which first requires us to understand what it is we're really looking at.

Introducing Declarative Programming

At its core, a declarative framework is a helper to make declarative programming easier—or in some cases, possible. Of course, that definition is useless without defining what makes it declarative, but thankfully very little introduction is necessary. After all, you've already seen declarative programming in action, and have probably been using it for quite some time, perhaps without even realizing it.

Declarative programming is the practice of telling a program what you want (declaring), rather than telling it what to do (instructing). This distinction is really more about the programmer than the program, in that there are often no special syntax, parsing or processing rules and no single way to define what does and doesn't qualify. It's most often defined as the opposite of imperative programming, where the programmer is expected to outline every step the computer needs to perform.

With this in mind, it's easy to note that higher-level interpreted languages, like Python, are much better suited for declarative programming than their lower-level cousins, like C. In fact, many forms of it are built right in. Rather than having to declare a memory location, specify its type, then store a value in memory at that location, you can simply assign a variable and Python does the rest.

>>> foo = 'bar'

That's just one form of declarative programming, using one syntax. When we talk about declarative frameworks in Python, though, it usually refers to using a class declaration to configure the framework, rather than a long and complicated set of configuration directives. Whether or not that's the right approach for your needs requires a bit more discussion of the pros and cons.

To Build or Not to Build?

Declarative frameworks have been a rising trend in the Python world in the past few years, but it's important to understand that they're not always the best approach to a given problem. Like anything else, deciding whether to use a declarative framework requires understanding what it really is, what it does and what it means for your needs.

Declarative frameworks do a great job of wrapping a lot of complex behavior into a simple class declaration. This can be a great time-saver, but it can also seem a lot like magic, something the Python community is constantly battling against. Whether this is good or bad depends entirely on how closely your API matches what users will expect from a class declaration and how well you document the areas where those expectations may fail.

By having a class as the primary method of conveying your intentions to the framework, it's reasonable to expect that instances will have meaning. Most often, an instance refers to a specific set of data that conforms to a format defined by the class declaration. If your application acts on just a single set of well-defined data, there's little use in having individual instances.

Declarative classes are designed to create many different configurations using the same framework, each designed for a particular configuration of data. If you only have one data format to work with—even if you've got loads of data—it just doesn't make sense to write a framework built for configurability. Just write a solution for your type of data and use it.

In other cases, you may not be able to describe the structure of a data set in advance, but instead have to adjust the structure based on the data provided. In these cases, there's little value in offering a class declaration, since no single declaration would suffice for the needs of the data you're working with.

A primary value of objects is the ability to perform actions on their contents, by way of instance methods. Since a declarative framework results in customized classes that produce individual instances, it stands to reason that these instances should be able to perform useful tasks that would be more difficult without the framework's assistance. This not only increases their usefulness but it also helps ensure that the resulting instances match with users' expectations.

To review, a declarative framework is a valuable approach if you have:

  • Many potential configurations

  • Each configuration known in advance

  • Many instances of any given configuration

  • Actions that can be performed on a given instance

The CSV framework described in this chapter needs to deal with a vast array of possible configurations of columns and structure, with many example files of each type. Actions such as loading and saving data are common, while others are unique to specific configurations.

Once completed, this framework will allow applications to specify CSV configurations as classes like the following, and interact with them using methods automatically attached to the class.

import sheets

class EmployeeSheet(sheets.Row):
    first_name = sheets.StringColumn()
    last_name = sheets.StringColumn()
    hire_date = sheets.DateColumn()
    salary = sheets.CurrencyColumn(decimal_places=2)

So let's get started.

Building the Framework

There are three primary components of any declarative framework, though one of them may come in different forms, or possibly not at all.

  • A base class—Since declarative frameworks are all about declaring classes, having a common base class to inherit from gives the frame a place to hook in and process declarations as they're encountered by Python. A metaclass attached to this base class provides the necessary machinery to inspect the declaration at runtime and make the appropriate adjustments. The base class is also responsible for representing instances of whatever structure your framework encapsulates, often with various methods attached to simply common procedures.

  • Various field types—Inside the class declaration are a number of attributes, typically called fields. For some applications, it may make more sense to call them something more specific, but for this discussion, fields will suffice. These fields are used to manage individual data attributes in the structures represented by your framework, and often come in different flavors, each tailored to a different general type of data, such as strings, numbers and dates. Another important aspect of fields is that they must be able to know the order in which they get instantiated, so the ordering specified in the declaration is the same ordering used later on.

  • An options container—Not strictly a necessary component, most frameworks have use for some type of class-wide options, which shouldn't be specified on every individual field, as that wouldn't be very DRY. Since subclassing doesn't provide any options except the choice of base classes, some other structure must be used to manage these options. How these options are declared and processed can vary greatly from one framework to another; there's no syntactic or semantic standard whatsoever. As a matter of convenience, this container often also manages the fields attached to the class.

As a syntactic aid, most declarative frameworks also make sure that all three of these components can be imported from one single location. This allows end-user code to have a much simpler import block, while also containing all the necessary components on a single, identifiable namespace. The name of this namespace should be something meaningful, so it's easy to read in end-user code. The name of the framework itself is often an ideal choice, but it's important to be descriptive, so make sure it all makes sense when reading it over.

Although deciding what to call the framework can be deferred to later in the process, it helps to have a name in mind early on, if only to name the package that will contain the modules described below. Using a placeholder like csv would work fine for now, but since Python has its own csv module—which we'll be relying on as well—reusing this name would cause a great deal of problems. Since CSV files are commonly used to exchange data among spreadsheet applications, we'll call our little framework sheets.

It would seem that our journey should start with the base class, but really any of the three components can be a reasonable place to start. It often depends on which piece requires the most thought, does the most work or needs to be tested first. For this discussion, we'll start with the options container, as it can be created without relying on the implementation details of the other components. This avoids leaving too many vague references to functionality that hasn't been described yet.

Managing Options

The primary purpose of an options component is to store and manage options for a given class declaration. These options are not specific to any one field, but rather apply to the entire class or are used as default values that individual fields can optionally override. For now, we'll set aside the question of how these options will be declared and simply focus on the container itself and its associated needs.

On the surface, options are simply a map of names to values, so we could use a simple dictionary. After all, Python has a fantastic dictionary implementation and simple is most certainly better than complex. However, writing our own class affords us a few extra features that will be very handy.

For starters, we can validate the options that are defined for a given class. They can be validated based on their individual values, their combination with other options, their appropriateness for the given execution environment and whether or not they're known options at all. With a dictionary, we're stuck simply allowing any type of value for any option, even if it makes no sense.

Mistakes in options would then only be known when code that relies on them chokes because they're incorrect or missing, and those types of errors typically aren't very descriptive. Validating on a custom object means we can provide much more useful messages to users who try to use incorrect or invalid options.

Using a custom class also means we add our own custom methods to perform tasks that, while useful, are either repetitive or don't really belong anywhere else. A validation method can verify that all the included options are appropriate, displaying useful messages if not. Remember also that the options container often manages fields, so there are some methods that can be added for that purpose; these are described later in this section.

In fact, by combining those two features, the options class can even validate field declarations in the context of provided options. Try doing that with an ordinary dictionary.

Because it may end up encapsulating quite a bit of functionality, we'll set up a new module for our options container, unambiguously named options.py. Like most classes, the bulk of the work will be done in the __init__() method. For our purposes, this will accept all known options, store them away as attributes and set up some other attributes that will be used by other methods later on. Validation is generally only useful when actively defining options, so that belongs in its own method so as not to bog down this one.

And so we come to the next decision in your framework: what options should you accept? Different frameworks will obviously have different requirements, and it's important to lay them out as completely as you can at the outset. Don't worry, you can always add more later, but it's better to get them in place earlier than later.

One useful rule of thumb is that options should always have defaults. Asking your users to not only write a class and provide fields but also provide options every time will get frustrating, especially if the required option often has the same value. In general, if something is truly required and doesn't have some reasonable default, it should be supplied as an argument to the methods that require it, rather than defined as an option on the class.

We're building a framework to interface with CSV files, so there are a number of options available. Perhaps the most obvious is the character encoding of the file, but Python already converts file content to Unicode when the file is opened in text mode. The open() function accepts an encoding argument that allows all the same encodings available with a string's encode() method. It defaults to UTF-8, which should suffice for most common needs.

Note

The encoding used when reading the file seems like a perfect candidate for an option, so you can override the default UTF-8 behavior. Unfortunately, the standard CSV interface requires that the file be already open when it's passed in, so if our framework follows the same interface, we have no control over the encoding. The only way to control it would be to change the interface to accept a filename rather than an open file object.

One common variation in CSV files is whether they contain a header row, containing titles for the various columns. Since we'll be defining columns as fields later on in the framework, we don't really need that header row, so we can skip it. But only if we know it's there. A simple Boolean, defaulting to False for the more common case, will do the trick nicely.

class Options:
    """
    A container for options that control how a CSV file should be handled when
    converting it to a set of objects.

    has_header_row
        A Boolean indicating whether the file has a row containing header
        values. If True, that row will be skipped when looking for data.
        Defaults to False.
    """

    def __init__(self, has_header_row=False):
        self.has_header_row = has_header_row

There we have a simple, but useful, options container. At this point the only benefit it has over a dictionary is that it automatically prohibits any options other than the ones we've specified. We'll come back and add a more rigorous validation method later.

If you're familiar with Python's csv module, you may already know that it contains a variety of options as part of its support for different dialects. Since sheets will actually defer to that module for much of its functionality, it makes sense to support all of the same options, in addition to our own. In fact, it even makes sense to rename our Options class Dialect instead, to better reflect the vocabulary already in use.

Rather than listing all of the options supported by csv separately, though, let's take a bit more forward-thinking approach. We're relying on code outside our control, and it's a bit of a maintenance hassle to try to keep up with any changes that code might introduce in the future. In particular, we can support any existing options as well as any future options, by simply passing any additional options straight to csv itself.

In order to accept options without naming them, we turn to Python's support for extra keyword arguments using the double-asterisk syntax. These extra options can be stored away as a dictionary, which will be passed into the csv functions later on. Accepting them as a group of keyword arguments, rather than a single dictionary, helps unify all of the options, which will be important once we actually parse options out of the class declaration.

class Dialect:
    """
    A container for dialect options that control how a CSV file should be
    handled when converting it to a set of objects.

    has_header_row
        A Boolean indicating whether the file has a row containing header
        values. If True, that row will be skipped when looking for data.
        Defaults to False.

    For a list of additional options that can be passed in, see documentation
    for the dialects and formatting parameters of Python's csv module at
    http://docs.python.org/library/csv.html#dialects-and-formatting-parameters
    """

    def __init__(self, has_header_row=False, **kwargs):
        self.has_header_row = has_header_row
        self.csv_dialect = kwargs

This class will grow some more features later on, but that's enough to get things started. We'll come back to it a few more times before we're done, but for now, let's move on to what may well be the meatiest part of our little framework: fields.

Defining Fields

Fields are generally just containers for specific pieces of data. Because it's such a generic term, different disciplines may use something more specific to refer to the same concept. In databases, they're called columns. In forms, they're often called inputs. When executing a function or a program, they're called arguments. To maintain some perspective beyond this one framework, this chapter will refer to all such data containers as fields, even though for sheets itself, the term "column" will make more sense when naming the individual classes.

The first thing to define is a base field class, which will describe what it means to be a field. Without any details of any particular data type, this base class manages how fields fit in with the rest of the system, what API they'll have and how subclasses are expected to behave. Since our framework is calling them columns, we'll start a new module called columns.py and get to work.

Fields are Python objects that are instantiated as part of the class declaration and assigned as attributes of the class. Therefore, the __init__() method is the first entry point into field functionality and the only point where the field can be configured as part of the declaration. Arguments to __init__() may vary depending on a field's type, but there are often at least a few arguments that are applicable to all fields, and can thus be processed by the base class.

First, each field can have a title. This allows for more readable and understandable code, but also provides a way for other tools to automatically document the fields with more useful information than just the field's attribute name. Planning for validation wouldn't hurt, so we'll also add a way to indicate whether the field is required.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

Notice that the title is optional. If no title is provided, a simple one can be gleaned from the attribute name the field is assigned to. Unfortunately, the field doesn't know what that name is yet, so we'll have to come back for that functionality later. We also assume that most fields will be required, so that's the default, to be overridden on a per-field basis.

Tip

Required fields may not immediately seem to have much value for a CSV framework, since the data comes from files rather than directly from users, but they can be useful. For some things like sheets, it can eventually validate incoming files or the data that's about to be saved to an outgoing file. It's generally a good feature to include at the outset for any framework, to support features that can be added later.

You may already have other arguments in mind for your framework's fields. If so, feel free to add them in now, following the same basic pattern. Don't worry about planning for everything at the outset, though; there will be plenty of opportunity to add more later on. Next on the agenda is to get the fields properly connected to their associated classes.

Attaching a Field to a Class

Next, we need to set up the hook for getting additional data from the class the field is assigned to, including the field's name. This new attach_to_class() method is—as its name suggests—responsible for attaching the field to the class where it was assigned. Even though Python automatically adds the attributes to the class where they're assigned, that assignment doesn't convey anything to the attribute, so we'll have to do so in the metaclass.

First, we need to decide what information the attribute needs to know about how it was assigned. After preparing for a title in the previous section, it's clear that the attribute will need to know what name it was given when assigned. By obtaining that name directly in code, we can avoid the trouble of having to write the name out separately as an argument to the attribute instantiation.

The long-term flexibility of the framework will also depend on providing as much information as possible to attributes, so that they can easily provide advanced functionality by introspecting the classes they're attached to. Unfortunately, the name alone doesn't say anything about the class where the attribute now resides, so we'll have to provide that in the metaclass as well.

Lastly, the options that were defined earlier, such as encoding, will have some bearing on the attribute's behavior. Rather than expecting the attribute to have to retrieve those options based on the class that was passed in, it's easier to simply accept the options as another argument. This leaves us with an attach_to_class() that looks something like this.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, options):
        self.cls = cls
        self.name = name
        self.options = options

This alone will allow other methods of the attribute object to access a wealth of information, such as the name of the class, what other attributes and methods were declared on it, what module it was defined in and more. The first task we'll need to perform with that information is somewhat more mundane, though, as we still need to deal with the title. If no title was specified when the attribute was created, this method can use the name to define one.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, options):
        self.cls = cls
        self.name = name
        self.options = options
        if self.title is None:
            # Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')

This addition takes an attribute name with underscores and converts it to a title using multiple words. We could impose other conventions, but this is simple enough to work with, accurate for most situations and fits in with common naming conventions. This simple approach will cover most use cases without being difficult to understand or maintain.

As the comment indicates, the if test for this new feature goes against standard idioms by explicitly checking for None rather than simply letting an unspecified title evaluate to False. Doing things "the right way" here would remove the ability to specify an empty string as a title, which can explicitly indicate that no title is necessary.

Checking for None allows empty strings to still retain that string as the title, rather than having it replaced by the attribute name. One example of the usefulness of an empty title would be as a way to indicate that the column doesn't need to be presented in a display of the file's data. It's also a good example of where comments can be crucial to understanding the intent of a piece of code.

Tip

Even though this attach_to_class() method doesn't use the options that were provided, it's generally a good idea to include it in the protocol. The next section will show that the options will be available as an attribute of the class, but it's a bit more clear to pass it in as its own argument. If your framework needs to apply these class-level options to individual fields, it'll be easier to accept it as an argument than to extract it form the class.

Adding a Metaclass

With the attach_to_class() method in place, we must now move on to the other side of the equation. After all, attach_to_class() can only receive information; the metaclass is responsible for providing that information. Until now, we haven't even started looking at the metaclass for this framework, so we need to start with the basics.

All metaclasses start out the same, by subclassing type. In this case, we'll also add an __init__() method because all we need is to process the contents of the class definition after Python has finished with them. First up, the metaclass needs to identify any options that were defined in the class and create a new Dialect object to hold them. There are a few ways to go about this.

The most obvious option would be to simply define options as class-level attributes. That would make defining the individual classes easy later on, but it would impose some problems that may not be as obvious. For one, it would clutter up the main class namespace. If you tried to create a class to process CSV files containing information about coded documents, you might reasonably have a column named encoding. Because we also have a class option named encoding, we'd have to name our column something else in order to avoid one of them overwriting the other and causing problems.

On a more practical note, it's easier to pick out options if they're contained in their own namespace. By being able to easily identify which attributes are options, we can pass them all in as arguments to Dialect and immediately know if any were missing or if invalid names were specified. So the task now is to determine how to provide a new namespace for options, while still declaring them as part of the main class.

The simplest solution is to use an inner class. Alongside any other attributes and methods, we can add a new class, named Dialect, to contain the various option assignments. This way, we can let Python create and manage the extra namespace for us, so that all we have to do is look for the name Dialect in the attribute list and pull it out.

Tip

Even though the inner Dialect class inhabits the main namespace alongside other attributes and methods, there's much less chance of a clash because it's only one name instead of several. Further, we use a name that starts with a capital letter, which is discouraged for attribute and method names, there's even less chance of collision. Because Python names are case-sensitive, you're free to define an attribute called dialect (note the small "d") on the class without fear of bumping into this Dialect class.

To extract this new Dialect class, we'll turn to the first implementation of a metaclass in this framework. Because this will help form the base class for future inheritance, we'll put the code into a new module, named base.py.

from sheets import options

class RowMeta(type):
    def __init__(cls, name, bases, attrs):
        if 'Dialect' in attrs:
            # Filter out Python's own additions to the namespace
            items = attrs['Dialect'].__dict__.items()
            items = dict((k, v) for (k, v) in items if not k.startswith('__'))
        else:
            # No dialect options were explicitly defined
            items = {}
        dialect = options.Dialect(**items)

Now that the options have been pulled out of the class definition and have populated a Dialect object, we'll need to do something with that new object. We know from the definition of attach_to_class() in the previous section that it gets passed into that method for each field attribute that was defined, but what else?

In the spirit of retaining as much information as possible for later, we'll keep it assigned to the class itself. But since the capitalized name doesn't work as well as an attribute name, it's best to rename it to something more suitable. Since it also forms a private interface to the inner workings of the framework, we can prefix the new name with an underscore to further prevent any accidental name clashes.

from sheets import options

class RowMeta(type):
    def __init__(cls, name, bases, attrs):
        if 'Dialect' in attrs:
            # Filter out Python's own additions to the namespace
            items = attrs.pop('Dialect').__dict__.items()
            items = {k: v for k, v in items if not k.startswith('__')}
        else:
            # No dialect options were explicitly defined
            items = {}
        cls._dialect = options.Dialect(**items)

This simple change removes it from the class namespace where it was given the original name and instead inserts it under a new name, _dialect. Both names avoid clashes with common attribute names, but this change makes it use a more standard private attribute name. Previously, it used the standard style for naming a class because that's how it's defined.

With that, we finally have all the pieces in places to continue working with the field attributes. The first task is to locate them in the class definition and call attach_to_class() on any that are found. This is easily accomplished with a simple loop through the attributes.

from sheets import options

class RowMeta(type):
    def __init__(cls, name, bases, attrs):
        if 'Dialect' in attrs:
            # Filter out Python's own additions to the namespace
            items = attrs.pop('Dialect').__dict__.items()
            items = {k: v for k, v in items if not k.startswith('__')}
        else:
            # No dialect options were explicitly defined
            items = {}
        cls._dialect = options.Dialect(**items)

        for key, attr in attrs.items():
            if hasattr(attr, 'attach_to_class'):
                attr.attach_to_class(cls, key, cls._dialect)

This simple metaclass contains a loop that just checks each attribute to see if it has an attach_to_class() method. If it does, the method is called, passing in the class object and the name of the attribute. This way, all the columns can get the information they need very early on in the process.

Now all that's necessary to fill out the rest of base.py is to include a true base class that individual CSV definitions can subclass. Since each subclass is a single row in a spreadsheet, we can name the base class Row to indicate its purpose. All it needs to do at the moment is include RowMeta as its metaclass, and it'll automatically get the necessary behavior.

class Row(metaclass=RowMeta):
    pass

Bringing It Together

Technically, all the pieces are now in place to demonstrate at least the basics of a working system, but there's still one important piece to take care of. Currently, we have three different modules, each with some of the parts that need to be exposed in a public API. Ideally, all of the important bits should be available from one central import instead of three or potentially even more.

If you haven't already, create an __init__.py module in the same directory as the other scripts mentioned so far. That file can be empty and still have the ability to import all the packages individually, but with a little effort, it can be put to better use. Because this is the file imported when simply importing the package name directly, we can use that as a trigger to pull in the useful bits from all the other files. Open up __init__.py and put this code in it.

from sheets.base import *
from sheets.options import *
from sheets.columns import *

Note

Ordinarily, using an asterisk to import everything is a bad idea because it can make it more difficult to identify what came from where. Since this module is only importing code and not doing anything with it, that problem doesn't really apply. As long as the package is imported on its own, such as import sheets, there won't be any confusion as to where the objects come from. And since we don't have to mention any of the objects by name, this will hold for anything we may add to those modules as well.

Now we have enough working parts to show that the framework can function, at least at a very basic level. If we create an example.py one directory up from the framework code itself, so that sheets is on the PYTHONPATH, we can now create a class that does some very simple work to show that it's starting to come together.

import sheets

class Example(sheets.Row):
    title = sheets.Column()
    description = sheets.Column()


if __name__ == '__main__':
    print(Example._dialect)
    print(Example.title)

All this really does so far is allow us to name the columns, though. In order to line them up with data in CSV files, we need to know the order in which the fields were defined in the class.

Ordering Fields

As it stands, the fields are all available as attributes of the class itself. This allows you to get some information about individual fields but only if you know the name of the field. Without a name, you'd have to inspect all the attributes on the class and check which of them are instances of Column or its subclasses. Even if you do that, though, you still don't know the order in which they were defined, so it's impossible to line them up with data from a CSV file.

In order to address both of those issues, we need to set up a list of columns, where each of the columns can be stored in the order in which it was defined. But first, we need to be able to identify that order at runtime, without the benefit of being able to ask the developer. There are at least three different ways to do this, each with its own benefits.

DeclarativeMeta.__prepare__()

Chapter 4 showed that metaclasses can control the behavior of the class namespace while Python is processing the block of code that makes up the class definition. By including a __prepare__() method on the declarative metaclass—in this case, RowMeta—we can provide an ordered dictionary, which can then keep the order of attribute assignments itself. It's as simple as importing an ordered dictionary implementation and returning it from a custom __prepare__() method.

from collections import OrderedDict

from sheets import options

class RowMeta(type):
    def __init__(cls, name, bases, attrs):
        if 'Dialect' in attrs:
            # Filter out Python's own additions to the namespace
            items = attrs.pop('Dialect').__dict__.items()
            items = {k: v for k, v in items if not k.startswith('__')}
        else:
            # No dialect options were explicitly defined
            items = {}
        cls._dialect = options.Dialect(**items)

        for key, attr in attrs.items():
            if hasattr(attr, 'attach_to_class'):
                attr.attach_to_class(cls, key, cls._dialect)

    @classmethod
    def __prepare__(self, name, bases):
        return OrderedDict()

That only gets us part of the way, though. Now the namespace dictionary contains all the class attributes, and it knows the order in which they were defined, but it doesn't address the issue of having a simple list of just the CSV columns. The namespace dictionary will also hold all the methods and other miscellaneous attributes that were defined, so we'll still need to grab the columns out of it and put them into another list.

One obvious way to do that would be to look at each attribute in the dictionary and check to see whether it's a column or not. That's the same process mentioned earlier in this section, but the difference with considering it now is that you can hide the complexity inside the metaclass.

Since __init__() gets run after the entire body has been processed, its attrs argument will be an ordered dictionary containing all the attributes. All that's left is to loop over them and pull out any columns that were found. Again, in the spirit of duck typing, we'll use the presence of attach_to_class() to determine which attributes are columns. In fact, we can use the existing loop and just inject the new code into the inner if block.

In order to use it in the real world, it'll need to be placed somewhere more useful, such as the Dialect object stored in the _dialect attribute of the class. Rather than simply assigning a list externally, it makes more sense to have Dialect manage that itself by giving it an add_column() method that we can call from the metaclass instead.

class Dialect:
    """
    A container for dialect options that control how a CSV file should be
    handled when converting it to a set of objects.

    has_header_row
        A Boolean indicating whether the file has a row containing header
        values. If True, that row will be skipped when looking for data.
        Defaults to False.

    For a list of additional options that can be passed in, see documentation
    for the dialects and formatting parameters of Python's csv module at
    http://docs.python.org/library/csv.html#dialects-and-formatting-parameters
    """

    def __init__(self, has_header_row=False, **kwargs):
        self.has_header_row = has_header_row
        self.csv_dialect = kwargs
        self.columns = []

    def add_column(self, column):
        self.columns.append(column)

Now that Dialect knows how to keep a record of fields, it's only a small matter to change RowMeta to add the columns to the dialect as they're found. Because the namespace is already sorted according to when the attributes were assigned, we can be sure that they'll be attached to the class in the right order. Thus, we can simply add a quick call to the dialect's add_column() in the column's attach_to_class() method.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, dialect):
        self.cls = cls
        self.name = name
selfdialect = dialect
        if self.title is None:
            # Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')
        dialect.add_column(self)

Note

This example also changes the name of the options attribute to dialect instead, to be consistent with the rest of the framework.

Now our code has an easy way to get at the columns that were provided to the class, in their original order. There's one fairly significant flaw with it, though: the __prepare__() technique is only available in Python starting with version 3.0. Because there was no equivalent functionality before then, any older versions will need to use a completely different approach to the problem.

We can make use of a basic principle of Python's class processing: the body of a class is executed as a block of code. That means that each of the column attributes are instantiated in the order they were written in the class definition. The Column class already has a block of code that runs when the attribute is instantiated, which can be extended a bit to keep track of each instantiation.

Column.__init__()

The most obvious choice is where we already have code: the __init__() method. It gets called for each Column object as it's instantiated, so it makes a convenient place to keep track of the order those objects are encountered. The actual process is fairly simple. All it takes is a counter that can be maintained in one place, regardless of which column is being processed and a small bit of code to increment that counter every time a new column is found.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    # This will be updated for each column that's instantiated.
    counter = 0

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required
        self.counter = Column.counter
        Column.counter += 1

    def attach_to_class(self, cls, name, dialect):
        self.cls = cls
        self.name = name
        self.dialect = dialect
        if self.title is None:
# Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')
        dialect.add_column(self)

This code handles part of the problem. Now, each column has a counter attribute that indicates its position among the rest.

The next step is to use that counter to force the ordering of the columns as they're stored on the Dialect object. In the __prepare__() approach, the namespace handled the ordering on its own, so there wasn't anything else to do. Here, we need to sort the list of fields explicitly, using the counter attribute to determine the order.

We can't do it right away in __init__() because that gets a dictionary of all the attributes, not just the columns. It doesn't know which attributes are columns until they're processed using their attach_to_class() methods. Sorting the list after processing all the columns with attach_to_class() instead would provide a complete list of just columns in the correct order. Here's what you'll need to add to the RowMeta class.

from sheets import options

class RowMeta(type):
    def __init__(cls, name, bases, attrs):
        if 'Dialect' in attrs:
            # Filter out Python's own additions to the namespace
            items = attrs.pop('Dialect').__dict__.items()
            items = {k: v for k, v in items if not k.startswith('__')}
        else:
            # No dialect options were explicitly defined
            items = {}
        cls._dialect = options.Dialect(**items)
for key, attr in attrs.items():
            if hasattr(attr, 'attach_to_class'):
                attr.attach_to_class(cls, key, cls._dialect)

        # Sort the columns according to their order of instantiation
        cls._dialect.columns.sort(key=lambda column: column.counter)

This function call may look a little more complicated than it really is. It's just invoking a standard sort() operation but with a function that will be called to determine what value to use when sorting items. We could add a method to Column that just returns the counter and use that, but since it's only used here, a lambda function will do the same job inline.

Most of the code that was added in this approach is necessary whenever __prepare__() isn't available, regardless of any other preferences. The only area where we really have any room to use a different approach is where the counter value is updated. There are a few different ways to go about managing that value.

So far we've used the __init__() method of the Column class because that's always called during instantiation and it already had a basic implementation anyway. The trouble is that many __init__() methods are only used to save argument values as attributes on the object, so programmers have come to expect similar behavior. Aside from managing the counter, our own __init__() method matches that expectation perfectly.

So if a programmer wants to write a new column that doesn't use any of the same arguments as the base Column class, it's easy to write an __init__() method that simply doesn't call super(). Without using super() to fire the original __init__() method, that new column won't be ordered properly. Its counter attribute will always be the same as whatever was processed right before it, so sort() won't be able to reliably determine where it belongs.

You could argue that the problem here is with the programmers' assumption that __init__() doesn't do anything of value, but that's not a very productive approach to the problem. There are still a couple ways we can try to make things easier for users of the frameworks that can help avoid problems if someone neglects to use super().

Column.__new__()

Thinking about instantiation without __init__(), the next clear choice is __new__(), which is called earlier in the process. Using __new__() provides a chance to do the same work without competing with __init__(), so they can be independent of each other. The initialization of the object can still take place in __init__(), leaving __new__() to manage the counter value.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    # This will be updated for each column that's instantiated.
    counter = 0

    def __new__(cls, *args, **kwargs):
        # Keep track of the order each column is instantiated
        obj = super(Column, cls).__new__(cls, *args, **kwargs)
        obj.counter = Column.counter
        Column.counter += 1
        return obj

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, dialect):
        self.cls = cls
        self.name = name
        self.dialect = dialect
        if self.title is None:
            # Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')
        dialect.add_column(self)

The code in __new__() grows a bit from what was used in __init__() previously because __new__() is responsible for creating and returning the new object. Therefore, we need to create the object explicitly before assigning the counter to it. Then, the method needs to explicitly return the new object in order for it to be accessible by anything else.

Using __new__() instead of __init__() is merely a way to reduce the odds of colliding with a custom implementation. It may be less likely, but it's still possible for a subclass to provide __new__() on its own, and doing so without using super() would still cause problems. There's still one other option that separates the counting behavior even further.

CounterMeta.__call__()

There's still one other method that gets called when instantiating a class that you may not have considered. Technically, the class object itself is being called as a function, which means that there's a __call__() method somewhere that would be called. Since __call__() is only executed as an instance method, but instantiation takes place when calling a class, we need to look at the class as an instance of something else: a metaclass.

That means that we can create a metaclass to support the counter functionality entirely outside the Column class. A simple CounterMeta class with a __call__() method can keep track of the counter on its own, and Column can then use that as its metaclass. The body of this method looks essentially just like __new__() because it's called as pretty much the same part of the process. It needs to create the object by using super() and return it explicitly.

class CounterMeta(type):
    """
    A simple metaclass that keeps track of the order that each instance
    of a given class was instantiated.
    """

    counter = 0

    def __call__(cls, *args, **kwargs):
        obj = super(CounterMeta, cls).__call__(*args, **kwargs)
        obj.counter = CounterMeta.counter
        CounterMeta.counter += 1
        return obj

Now that all of this functionality is isolated to a metaclass, the Column class gets a bit simpler. It can get rid of all the counter handling code, including the entire __new__() method. All it needs now to maintain the counting behavior is to use CounterMeta as its metaclass.

class Column(metaclass=CounterMeta):
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """

    def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, dialect):
        self.cls = cls
        self.name = name
        self.dialect = dialect
        if self.title is None:
            # Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')
        dialect.add_column(self)

In fact, this CounterMeta is now capable of providing this counting behavior for any class that needs it. By simply applying the metaclass, every instance of the given class will have a counter attribute attached to it. Then you can use that counter to sort the instances according to when they were instantiated, just like the columns in the sheets framework.

Choosing an Option

Of the options presented here, it's not always easy to determine which to choose. With each layer of added flexibility comes added complexity, and it's always best to keep things as simple as possible. If you're working in a Python 3 environment, __prepare__() is definitely the way to go. It doesn't require any additional classes to support it, it doesn't need to sort the list of columns after the fact and it works without touching the Column class at all.

The options for earlier versions of Python are more subjective. Which one you choose depends largely on how much you expect of your target audience and how much complexity you're willing to allow into your code. The simpler solutions require more vigilance on the part of your users, so you'll need to decide what's most important.

Since this chapter is designed for use with Python 3.1, the remaining examples of the code will use __prepare__(). Of course, the ability to order a set of fields is only useful once you have a collection of fields to work with.

Building a Field Library

In most declarative frameworks, sheets included, a primary function of fields is to convert data between native Python objects and some other data format. In our case, the other format is a string contained in the CSV file, so we need a way to convert between those strings and the objects the fields represent. Before we get into the details of specific field types, we need to set up a couple methods for managing data conversion.

The first method, to_python(), takes a string from the file and converts that string into a native Python value. This step is performed for each column, every time a row is read in from the file, to ensure that you can work with the correct type of value in Python. Because that behavior will be different for various types, delegating to a method like to_python() allows you to change this specific behavior on individual classes without having to do so all on one Column class.

The second method is to_string(), which works as an inverse to to_python() and will be called when saving a CSV file with values assigned in Python. Because the csv module works with strings by default, this method is used to provide any special formatting required by a particular CSV format. Delegating to this method means that each column can have its own options to suit the data that belongs in that field.

Even though each type of data behaves differently, the base Column class can support a simple use case by default. The csv module only works with files that are opened in text mode, so Python's own file access manages the conversion to Unicode while reading data. That means the value that comes from csv is already a string and can be used easily.

class Column:
    """
    An individual column within a CSV file. This serves as a base for attributes
    and methods that are common to all types of columns. Subclasses of Column
    will define behavior for more specific data types.
    """
def __init__(self, title=None, required=True):
        self.title = title
        self.required = required

    def attach_to_class(self, cls, name, dialect):
        self.cls = cls
        self.name = name
        self.dialect = dialect
        if self.title is None:
            # Check for None so that an empty string will skip this behavior
            self.title = name.replace('_', ' ')
        dialect.add_column(self)

    def to_python(self, value):
        """
        Convert the given string to a native Python object.
        """
        return value

    def to_string(self, value):
        """
        Convert the given Python object to a string.
        """
        return value

Now we can start implementing them for individual data types.

StringField

The most obvious field to start with is a string because it can encompass any number of more specific forms of data. Titles, names, places, descriptions and comments are just some examples of the more specific values you might find in these fields, but from a technical standpoint they all work the same way. The sheets framework doesn't have to care what form of strings you'll be dealing with, only that they are in fact all strings.

The csv module provides strings on its own, so this class doesn't really have to do much. In fact, to_python() and to_string() don't need any custom implementation at all because they only need to return what they're given. The most important thing that's offered by a StringColumn is actually the name itself.

By having an attribute that's named according to the type of data it interacts with, the attribute becomes somewhat self-documenting. Rather than just using a generic Column to describe how strings are passed back and forth, you can use a StringColumn to be clear about how it works.

class StringColumn(Column):
    """
    A column that contains data formatted as generic strings.
    """
    pass

In fact, you could even call the base class StringColumn instead of just Column because it does the job on its own. Unfortunately, that would cause its own confusion when subclassing it, by requiring something like an IntegerColumn to subclass StringColumn. To keep things clearer, the base class will remain Column and each subclass will add only the necessary features on top of it, even though there's nothing useful to add beyond the name.

IntegerColumn

The next field type to add manages integers. Numbers are used quite a bit in spreadsheets, storing everything from ages to sales figures to inventory counts. Much of the time, those numbers will be plain integers that can be converted easily using the built-in int() function.

class IntegerColumn(Column):
    """
    A column that contains data in the form of numeric integers.
    """
    def to_python(self, value):
        return int(value)

IntegerColumn doesn't actually need to implement a to_string() method because the csv module automatically calls str() on whatever value is given to it. Since that's all we'd do in a to_string() method anyway, we can just leave it out and let the framework handle that task. As you'll see with other columns, to_string() is most useful when the column can specify a more explicit format to use. Simply writing out a number doesn't require that much flexibility.

FloatColumn

Many numbers in spreadsheets have finer granularity than integers, requiring additional information to convey the value beyond the decimal point. Floating point numbers are a decent way to handle those values, and supporting them as a column is just as easy as it was with IntegerColumn. We can simply replace all the instances of int with float and be done.

class FloatColumn(Column):
    """
    A column that contains data in the form of floating point numbers.
    """
    def to_python(self, value):
        return float(value)

Of course, floating point numbers have their share of problems when it comes to viewing them or adding them together in many cases. This is caused by a lack of defined precision in the decimal point: it floats around according to how well a given value can be represented in code. To be more explicit and avoid things like rounding errors, we turn to DecimalColumn.

DecimalColumn

Like FloatColumn, this can work with numbers beyond just the integers. Instead of working with floating point numbers, though, DecimalColumn will rely on the functionality of the decimal module providing with Python. Decimal values preserve as much detail in the original number as possible, which helps prevents rounding errors. This makes decimals much more suitable for use with monetary spreadsheets.

In Python, decimals are provided using the decimal module, which provides a Decimal class to manage individual numbers. Therefore, DecimalColumn needs to convert numbers from text in CSV files to Decimal objects in Python and back again. Like floats, Decimal already converts to strings well enough on its own, so the only conversion DecimalColumn really needs to do is from strings to Decimal when reading values. Since Decimal is designed to work with strings, it's just as easy as the other columns shown so far.

import decimal

class DecimalColumn(Column):
    """
    A column that contains data in the form of decimal values,
    represented in Python by decimal.Decimal.
    """

    def to_python(self, value):
        return decimal.Decimal(value)

There's one difference about this method from those in the other classes, though. Each of the others have the added side-effect of raising a ValueError if the value can't be properly converted, which we can use later to support validation. Decimal does validate during instantiation, but it raises an exception from the decimal module, InvalidOperation. In order to match the behavior of the others, we'll need to catch that and re-raise it as a ValueError.

import decimal

class DecimalColumn(Column):
    """
    A column that contains data in the form of decimal values,
    represented in Python by decimal.Decimal.
    """

    def to_python(self, value):
        try:
            return decimal.Decimal(value)
        except decimal.InvalidOperation as e:
            raise ValueError(str(e))

Even though DecimalColumn supports a more specialized data type, the code behind it is still fairly simple. Supporting dates, on the other hand, requires some added complexity.

DateColumn

Dates are also extremely common in spreadsheet documents, storing everything from employee paydays and holidays to meeting agendas and attendance. Like decimal values, dates require the use of a separate class to provide a native Python data type, but there's one significant difference: dates don't have a universally accepted string representation. There are some standards that are fairly well established, but there are still plenty of variations, from the placement of the date components to the punctuation used to separate them.

In order to support the necessary flexibility, a new DateColumn would need to accept a format string during instantiation, which can be used to parse values from the file as well as construct strings to store in the file. Python dates already use a flexible format string syntax,[25] so there's no need to invent a new one just for sheets. In order to specify the format during instantiation, though, we'll need to override __init__().

class DateColumn(Column):
    """
    A column that contains data in the form of dates,
    represented in Python by datetime.date.

    format
        A strptime()-style format string.
        See http://docs.python.org/library/datetime.html for details
    """

    def __init__(self, *args, format='%Y-%m-%d', **kwargs):
        super(DateColumn, self).__init__(*args, **kwargs)
        self.format = format

Notice that the format object has a default value, which makes it optional. It's usually best to provide defaults like this for field attributes, so users can get up and running quickly. The default value used here was chosen because it's fairly common and it places the values in order from the least specific to the most specific—from year to day, respectively. That helps reduce the ambiguity we might otherwise encounter across cultures that format dates differently. Since the goal is to work with existing data, though, it's always possible for a specific Row class to override this behavior with whatever format is used by a given file.

Having the format available on the DateColumn object, the next step is, as it was for the others, to make a to_python() method. Python's datetime object accepts each component of the date as a separate argument, but since to_python() only gets a string, we'll need another way to do it. The alternative comes in the form of a datetime class method called strptime().

The strptime() method accepts a string value as its first argument and a format string as its second. The value is then parsed according to the format string and a datetime object is returned. We don't actually need a full datetime, though, so we can also use that object's date() method to return just the date portion of the value as a date object.

import datetime

class DateColumn(Column):
    """
    A column that contains data in the form of dates,
    represented in Python by datetime.date.

    format
        A strptime()-style format string.
        See http://docs.python.org/library/datetime.html for details
    """

    def __init__(self, *args, format='%Y-%m-%d', **kwargs):
        super(DateColumn, self).__init__(*args, **kwargs)
        self.format = format

    def to_python(self, value):
        """
        Parse a string value according to self.format
        and return only the date portion.
        """
        return datetime.datetime.strptime(value, self.format).date()

Note

datetime is the name of the module as well as the name of the class, so that's why it's written twice.

There's a subtle problem with to_python() as it's written here, though. All the other column types so far can accept both a string and a native object as values in to_python(), but strptime() will fail with a TypeError if you pass in a date object instead of a string. In order to construct a row in Python and save it in a file, we'll need to be able to accept a datetime object here, which will be converted to a string later, when saving.

Since to_python() is supposed to return a native object, this is a very simple task. All it takes is checking whether the value passed in is already a date object. If it is, to_python() can simply return that without doing any more work. Otherwise, it can continue on with the conversion.

class DateColumn(Column):
    """
    A column that contains data in the form of dates,
    represented in Python by datetime.date.

    format
        A strptime()-style format string.
        See http://docs.python.org/library/datetime.html for details
    """

    def __init__(self, *args, format='%Y-%m-%d', **kwargs):
        super(DateColumn, self).__init__(*args, **kwargs)
        self.format = format
def to_python(self, value):
        """
        Parse a string value according to self.format
        and return only the date portion.
        """
        if isinstance(value, datetime.date):
            return value
        return datetime.datetime.strptime(value, self.format).date()

Writing the to_python() method was actually the most troublesome part of the DateColumn class. Converting an existing date value to a string is even simpler because there's an instance method, strftime(), available to do the job. It just accepts a format and returns a string containing the formatted value.

import datetime

class DateColumn(Column):
    """
    A column that contains data in the form of dates,
    represented in Python by datetime.date.

    format
        A strptime()-style format string.
        See http://docs.python.org/library/datetime.html for details
    """
    def __init__(self, *args, format='%Y-%m-%d', **kwargs):
        super(DateColumn, self).__init__(*args, **kwargs)
        self.format = format

    def to_python(self, value):
        """
        Parse a string value according to self.format
        and return only the date portion.
        """
        if isinstance(value, datetime.date):
            return value
        return datetime.datetime.strptime(value, self.format).date()

    def to_string(self, value):
        """
        Format a date according to self.format and return that as a string.
        """
        return value.strftime(self.format)

Tip

A useful way to remember the difference between the two method names is that p stands for "parse" and f stands for "format."

We could go on adding more and more fields, but the ones shown here cover the basic forms of data found in most CSV files, as well as most of the techniques necessary to build your own field attributes in a declarative framework. Next, we'll need to set up the CSV functionality in order to bring these data types to life.

Getting Back to CSV

So far, this chapter has been fairly generic, showing tools and techniques that can be applied to any variety of declarative class frameworks. In order to put them to real-world use, we need to get back to the problem of parsing CSV files. Much of the work done in this section will also be applicable to other frameworks, but will be presented in a way specific to CSV.

The first thing to do is take a look at how Python's own csv module works. There's no sense completely reinventing the wheel. It's important to understand the existing interface so that we can match it as closely as possible. The csv module's functionality is provided in two basic object types: readers and writers.

Readers and writers are configured in similar ways. They both accept a file argument, an optional dialect and any number of keyword arguments that specify individual dialect parameters to override the main dialect. The main difference between readers and writers is that readers require a file to be opened for read access and writers require write access.

For readers, the file argument is typically a file object but may in fact be any iterable object that yields a single string for each iteration. Because the csv module also handles more complex newline usage, such as newlines encoded within a value, you should always open the file with the argument newline='' to make sure Python's own newline handling doesn't get in the way.

>>> import csv
>>> reader = csv.reader(open('example.csv', newline=''))

Once instantiated for use with a particular file and dialect, a CSV reader object has an extremely simple interface: it's an iterable object. Iterating over a reader will yield each row in the CSV file as a data structure that's usable outside the csv module. The standard csv.reader yields a list of values for each row, because the only thing it knows about is the position of each value in the row.

A more advanced options is csv.DictReader, which also accepts a sequence of column names during instantiation, so that each row can be produced as a dictionary. Our framework goes even further, yielding an object with each value from the file converted to a native Python data type and made available as an attribute.

Writer objects, on the other hand, are slightly more complex. Because simple iteration only allows reading values, rather than writing them, writers rely on a couple of methods to do the necessary work. The first, writerow(), writes out a single row to the file, as its name suggests. Its companion, writerows(), accepts a sequence of rows, which will be written to the file in the order they're found in the sequence.

Exactly what constitutes a row will differ based on what type of writer is used. As with readers, the csv module provides some different options. The standard csv.writer accepts a simple sequence of values for each row, placing each value on the row in the position it's found in the list. The more complex DictWriter accepts a dictionary, which uses the sequence of column names passed in during instantiation to determine where in the row each value should be written.

The interface for working with our framework should look as much as possibly like the interfaces to these standard readers and writers. A sheets reader should be an iterable object that yields instances of the custom class where all the column attributes were defined. Likewise, the writer should accept instances of that same class. In both cases, the order of the column attributes in the class definition will be used to determine where the values go.

One key factor of both the reader and the writer, though, is the notion of a row object. So far, we don't have any such object for the sheets framework, so we need to create one. As a class-based framework, sheets is already well-equipped to build an object that can represent a row. The columns and dialect are already defined on a class, so the ideal way to create an object would be to simply instantiate that class with a set of values. This will bring in aspects of the dialect and column classes described in earlier sections in order to produce a usable object.

The obvious place to implement this behavior is __init__(), but from there things get a little tricky. The first question is how to accept the values that will populate the attributes. Because we don't yet know anything about the layout of any particular Row subclass, we'll have to accept all arguments and deal with the requirements in the __init__() method itself.

Checking Arguments

As with any function, arguments to __init__() can be passed positionally or by keyword, but that decision has particular impact here because the object can be instantiated in one of two ways. When instantiating from a CSV file, as the next section will show, it's easiest to pass the values in positionally. When building an instance manually, though, it's highly convenient to be able to pass values in by keyword as well. Therefore, it's best to accept all positional and keyword arguments and manage them internally.

Two cases of invalid arguments are clear at the outset: too many positional arguments and keyword arguments that don't match any column names. Each of these cases requires a separate bit of code to support it, but they're both fairly easy to work with. For the positional case, we can simply check the number of arguments against the number of columns.

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        if len(args) > len(self._dialect.columns):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(self._dialect.columns), len(args)))

That takes care of the case where too many positional arguments are passed in, using the same error message Python would issue when the arguments are defined explicitly. The next step is to make sure that all of the provided keyword arguments match up with existing column names. This is easy to test by cycling through the keyword argument names and checking to see if each is also present in the list of column names.

Because the dialect only stores a list of columns, and not the list of column names, it's easiest to make a new list of column names here before testing them. Additional code to be added to __init__() later will also make use of this new list, so it's best to create it now.

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        column_names = [column.name for column in self._dialect.columns]

        if len(args) > len(column_names):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(column_names), len(args)))

        for name in kwargs:
            if name not in column_names:
                msg = "__init__() got an unexpected keyword argument '%s'"
                raise TypeError(msg % name)

That takes care of the obvious cases, but there's still one situation not yet covered: keyword arguments that target columns that also have positional arguments. To address this concern, we'll look at the behavior of Python itself. When confronted with an argument passed positionally and by keyword, Python raises a TypeError, rather than be forced to decide which of the two values to use.

>>> def example(x):
...     return x
...
>>> example(1)
1
>>> example(x=1)
1
>>> example(1, x=1)
Traceback (most recent call last):
  ...
TypeError: example() got multiple values for keyword argument 'x'

Providing that same behavior of our own __init__() is a bit more complex than the previous examples, but it's still fairly straightforward. We just need to look at each of the positional arguments and check whether there's a keyword argument matching the corresponding column name.

A useful shortcut for situations like this is to use a slice on the column name array to get only as many names as there are positional arguments. This way, we don't have to look through more names than necessary, and it eliminates the separate step of having to look up the column name by index inside the loop.

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        column_names = [column.name for column in self._dialect.columns]

        if len(args) > len(column_names):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(column_names), len(args)))

        for name in kwargs:
            if name not in column_names:
msg = "__init__() got an unexpected keyword argument '%s'"
                raise TypeError(msg % name)

        for name in column_names[:len(args)]:
            if name in kwargs:
                msg = "__init__() got multiple values for keyword argument '%s'"
                raise TypeError(msg % name)

With all the argument checking out of the way, __init__() can continue on with certainty that no invalid arguments were provided. From here, we can use those arguments to populate the values on the object itself.

Populating Values

There are actually two steps involved in populating the values on the object. The first is due to __init__() accepting both positional and keyword arguments. By offering both options, we now have arguments in two separate locations: args and kwargs. In order to set the values in one pass, we'll need to combine them into a single structure.

Ideally, that structure would be a dictionary because it combines the names and values, so we'll need to move positional arguments into the dictionary already provided by kwargs. For that, we'll need an index for each of the values passed in positionally and a reference to the corresponding column name, so the value can be assigned to the right name.

The last check from the previous section already provides that loop, so we can reuse that block to assign the value to kwargs. The only change we need to make to the loop is to use enumerate() to get the index of each column as well as its name. That index can then be used to get the value from args.

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        column_names = [column.name for column in self._dialect.columns]

        if len(args) > len(column_names):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(column_names), len(args)))

        for name in kwargs:
            if name not in column_names:
                msg = "__init__() got an unexpected keyword argument '%s'"
                raise TypeError(msg % name)

        for i, name in enumerate(column_names[:len(args)]):
            if name in kwargs:
                msg = "__init__() got multiple values for keyword argument '%s'"
                raise TypeError(msg % name)
            kwargs[name] = args[i]

Now kwargs has all the values passed into the constructor, each mapped to the appropriate column name. Next, we'll need to convert those values to the appropriate Python values before assigning them to the object. To do that, we'll need the actual column objects, rather than just the list of names we've been working with so far.

There's still one minor issue to consider. Looping through the columns gets us all the columns that were defined for the class, but kwargs only contains the values that were passed into the object. We'll need to decide what to do for columns that don't have a value available. When pulling in data from a CSV file, this won't usually be a problem because every row in the file should have an entry for each column. But when populating an object in Python, to be saved in a file later, it's often useful to assign the attributes after instantiating the object.

Therefore, the most flexible approach here is to simply assign None to any of the columns that don't have a value. Checking for required fields can be performed as a separate step later, when we get to validating fields for other things as well. For now, assigning None will work just fine.

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        column_names = [column.name for column in self._dialect.columns]

        if len(args) > len(column_names):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(column_names), len(args)))

        for name in kwargs:
            if name not in column_names:
                msg = "__init__() got an unexpected keyword argument '%s'"
                raise TypeError(msg % name)

        for i, name in enumerate(column_names[:len(args)]):
            if name in kwargs:
                msg = "__init__() got multiple values for keyword argument '%s'"
                raise TypeError(msg % name)
            kwargs[name] = args[i]

        # Now populate the actual values on the object
        for column in self._dialect.columns:
            try:
                value = column.to_python(kwargs[column.name])
            except KeyError:
                # No value was provided
                value = None
            setattr(self, column.name, value)

With this functionality finally in place, you can see the Row class in action on its own. It's now capable of managing a set of columns, accepting values as inputs, converting them to Python objects while loading and assigning those values to the appropriate attributes.

>>> import sheets
>>> class Author(sheets.Row):
...     name = sheets.StringColumn()
...     birthdate = sheets.DateColumn()
...     age = sheets.IntegerColumn()
...
>>> ex = Author('Marty Alchin', birthdate='1981-12-17', age='28')
>>> ex.name
'Marty Alchin'
>>> ex.birthdate
datetime.date(1981, 12, 17)
>>> ex.age
28

Now we can finally implement the code to actually interact with CSV files.

The Reader

Using the csv module directly, you obtain a reader by instantiating a class and passing in a file and the necessary configuration options. The sheets framework allows each custom Row class to specify all the columns and dialect parameters directly on the class, so that now contains everything we need. The direct analogy with csv would be to pass a file and a Row class into a function that then returns a reader object capable of reading the file.

The trouble with that approach is that it requires any code that wants to use the reader to import the sheets module in order to get the function that creates the reader object. Instead, we can get by with just the Row class itself by providing a class method that can do the necessary work. Then, the only argument that method needs to accept is the file to read. To match the existing csv naming conventions, we'll call this new method reader().

In order to work like the standard readers, our own reader() will need to return an iterable object that yields a row for each iteration. That's a simple requirement to fulfill, and it can be done without even involving any new objects. Remember that generator functions actually return an iterable object when they're first called. The body of a generator is then executed on each iteration of a loop, so that makes an ideal way to support a CSV reader.

In order to get the values from a CSV file, reader() can rely on the existing csv module's own reader functionality. The standard csv.reader returns a list for each row in the file, regardless of what the actual values mean or what their names should be. Since a row class can already process arguments that are stored in sequences such as lists, it's very simple to bind the two together.

import csv

class Row(metaclass=RowMeta):
    def __init__(self, *args, **kwargs):
        # First, make sure the arguments make sense
        column_names = [column.name for column in self._dialect.columns]

        if len(args) > len(column_names):
            msg = "__init__() takes at most %d arguments (%d given)"
            raise TypeError(msg % (len(column_names), len(args)))

        for name in kwargs:
            if name not in column_names:
                msg = "__init__() got an unexpected keyword argument '%s'"
                raise TypeError(msg % name)

        for i, name in enumerate(column_names[:len(args)]):
            if name in kwargs:
                msg = "__init__() got multiple values for keyword argument '%s'"
                raise TypeError(msg % name)
            kwargs[name] = args[i]
# Now populate the actual values on the object
        for column in self._dialect.columns:
            try:
                value = column.to_python(kwargs[column.name])
            except KeyError:
                # No value was provided
                value = None
            setattr(self, column.name, value)

    @classmethod
    def reader(cls, file):
        for values in csv.reader(file):
            yield cls(*values)

This neglects one important aspect of reading from CSV files, though. There are enough variations in how values are stored within a file that you may need to specify some options to control how the file is processed. Earlier, the Dialect class provided a way to specify those options on the Row class, so now we need to pass some of those options along in the call to csv.reader(). In particular, these are the options stored in the dialect's csv_dialect attribute.

@classmethod
    def reader(cls, file):
        for values in csv.reader(file, **cls._dialect.csv_dialect):
            yield cls(*values)

That covers the options that the csv module already knows about, but remember that our own Dialect class allows for another option to indicate whether the file has a header row. In order to support that feature in the reader, we'll need to add some extra code that skips the first row if the dialect indicates that row would be a header.

@classmethod
    def reader(cls, file):
        csv_reader = csv.reader(file, **cls._dialect.csv_dialect)

        # Skip the first row if it's a header
        if cls._dialect.has_header_row:
            csv_reader.__next__()

        for values in csv_reader:
            yield cls(*values)

Because all the reader needs to provide is an iterable that yields a row for each object, this method now does everything it needs to. It's not very forward-thinking, though. Since we're building a framework that may need to be improved later, it's always a good idea to at least consider future expansion.

Rather than relying solely on a generator function, a more flexible approach would be to create a new iterable class that will do the same job. As we'll see in the next section, the writer will need a separate class as well, so building this new iterable will create a pair of classes that will be easier to understand. First, the reader() method gets a whole lot simpler.

@classmethod
    def reader(cls, file):
        return Reader(cls, file)

That delegates all the real work to a new Reader class, which must implement __iter__() and __next__() in order to function as an iterator. There are a few things that need to be stored away in __init__() first, though, including the row class that can create each instance and a csv.reader object to actually read the file.

class Reader:
    def __init__(self, row_cls, file):
        self.row_cls = row_cls
        self.csv_reader = csv.reader(file, **row_cls._dialect.csv_dialect)

The __iter__() method is easy to support because the Reader itself will be the iterator. Therefore, all that's necessary is to return self.

class Reader:
    def __init__(self, row_cls, file):
        self.row_cls = row_cls
        self.csv_reader = csv.reader(file, **row_cls._dialect.csv_dialect)

    def __iter__(self):
        return self

Because __next__() will get called for each iteration, its logic can be a bit simpler for the obvious task of returning individual row objects. All it needs to do is call __next__() on the csv.reader's iterator, passing the values into the row class that was stored in __init__().

class Reader:
    def __init__(self, row_cls, file):
        self.row_cls = row_cls
        self.csv_reader = csv.reader(file, **row_cls._dialect.csv_dialect)

    def __iter__(self):
        return self

    def __next__(self):
        return self.row_cls(*self.csv_reader.__next__())

You'll remember from Chapter 5 that when manually building an iterator, you have to be careful to raise a StopIteration exception in order to avoid an infinite loop. In this case, we don't have to do that directly because the csv.reader will do that on its own. Once it runs out of records, our own __next__() method just needs to let StopIteration go by without being caught.

The last feature to implement is the header row, which gets slightly more complex. In the generator function shown earlier, it's easy to just deal with the header row before getting into the real loop. As a manual iterator, we have to manage it separately because __next__() will get called from the beginning for each record.

To do so, we'll need to keep a Boolean attribute that indicates whether we still need to skip the header row. At the beginning, that attribute will be the same as the dialect's has_header_row attribute, but once the header row has been skipped, that attribute needs to be reset so that __next__() can yield a valid record every other time.

class Reader:
    def __init__(self, row_cls, file):
        self.row_cls = row_cls
        self.csv_reader = csv.reader(file, **row_cls._dialect.csv_dialect)
        self.skip_header_row = row_cls._dialect.has_header_row

    def __iter__(self):
        return self

    def __next__(self):
        # Skip the first row if it's a header
        if self.skip_header_row:
            self.csv_reader.__next__()
            self.skip_header_row = False

        return self.row_cls(*self.csv_reader.__next__())

You can test it by supplying a simple CSV file and reading it in. Consider a file containing a rough table of contents, with a column for the chapter number and another for the chapter title. Here's how you could write a Row to represent that file and parse the contents.

>>> import sheets
>>> class Content(sheets.Row):
...     chapter = sheets.IntegerColumn()
...     title = sheets.StringColumn()
...
>>> file = open('contents.csv', newline='')
>>> for entry in Content.reader(file):
...     print('%s: %s' % (entry.chapter, entry.title))
...
1: Principles and Philosophy
2: Advanced Basics
3: Functions
4: Classes
5: Protocols
6: Object Management
7: Strings
8: Documentation
9: Testing
10: Distribution
11: Sheets: A CSV Framework

This completes the transition from rows in a CSV file to individual Python objects. Because each of the rows in an instance of the Content class, you can also define whatever other methods you like and have those available when processing entries from the file. For the other side of the framework, we need a writer to move those objects back into a CSV file.

The Writer

Unlike the reader, the interface for a CSV writer requires some instance methods, so the implementation is a bit more complex. A generator method won't cut it this time around, so we'll need to add a new class to the mix in order to manage the file writing behavior. We can still rely on the csv module's own behavior to do most of the heavy lifting, so this new class only has to manage the additional features of the sheets framework.

The first part of the interface is simple. To mirror the availability of the reader, the writer should be accessible from a method on the Row subclass. This method will also take a file object, but this time it must return a new object rather than doing anything with that file right away. That makes the implementation of this writer() method simple on its own.

@classmethod
    def writer(cls, file):
        return Writer(file, cls._dialect)

Note

The SheetWriter can't get by with just the file because it's separate from Row and wouldn't otherwise have access to any of the dialect options.

This obviously doesn't do anything useful yet, though, so the main task is to create and fill out the SheetWriter class. There are two necessary methods to satisfy the writer interface, writerow() and writerows(). The former is responsible for taking a single object and writing out a row to the file, while the latter accepts a sequence of objects, writing them each out as a separate row in the file.

Before starting on either of those methods, Writer needs some basic initialization. The first obvious information it will need access to is the list of columns for the class. Beyond that, it'll also need the CSV options, but those are only necessary to create a writer using the csv module itself, just like the reader did. Lastly, it needs access to the one option that csv doesn't know about its own: has_header_row.

class Writer:
    def __init__(self, file, dialect):
        self.columns = dialect.columns
        self._writer = csv.writer(file, dialect.csv_dialect)
        self.needs_header_row = dialect.has_header_row

Before moving on to the all-important writerow() method, notice the header row option is actually named needs_header_row when assigned to the class. This allows writerow() to use that attribute as a flag to indicate whether the header row still needs to be written. If no row is needed in the first place, it starts as False, but if it comes in as True, it can be flipped to False once the header has actually be written to the file.

To write the header row itself, we can also defer to the csv.writer instead that will be used later to write the value rows. The csv module doesn't care what the overall structure of the file is, so we can pass in a row of header values and it'll be processed the same way as all the other rows. Those header values come from the title attribute of each column on the class, but we can use the string's title() method to make them a bit friendlier.

class Writer:
    def __init__(self, file, dialect):
        self.columns = dialect.columns
        self._writer = csv.writer(file, dialect.csv_dialect)
        self.needs_header_row = dialect.has_header_row

    def writerow(self, row):
        if self.needs_header_row:
            values = [column.title.title() for column in self.columns]
            self._writer.writerow(values)
            self.needs_header_row = False

With the header out of the way, writerow() can move on to write the actual row that was passed into the method. The code to support the header already lays out most of what needs to be done. The only difference is that, rather than getting the title of each column, the list comprehension needs to get the corresponding value from the row object that was passed in.

class Writer:
    def __init__(self, file, dialect):
        self.columns = dialect.columns
        self._writer = csv.writer(file, dialect.csv_dialect)
        self.needs_header_row = dialect.has_header_row

    def writerow(self, row):
        if self.needs_header_row:
            values = [column.title.title() for column in self.columns]
            self._writer.writerow(values)
            self.needs_header_row = False
        values = [getattr(row, column.name) for column in self.columns]
        self._writer.writerow(values)

Lastly, the writer also needs a writerows() method that can take a sequence of objects and write them out as individual rows. The hard work is already done, so all writerows() needs to do is call writerow() for each object that was passed into the sequence.

class Writer:
    def __init__(self, file, dialect):
        self.columns = dialect.columns
        self._writer = csv.writer(file, dialect.csv_dialect)
        self.needs_header_row = dialect.has_header_row

    def writerow(self, row):
        if self.needs_header_row:
            values = [column.title.title() for column in self.columns]
            self._writer.writerow(values)
            self.needs_header_row = False
        values = [getattr(row, column.name) for column in self.columns]
        self._writer.writerow(values)
def writerows(self, rows):
        for row in rows:
            self.writerow(row)

With a CSV reader and writer, the sheets framework is complete. You can add more column classes to support additional data types or add more dialect options based on more specific needs you may have, but the framework on the whole is intact. You can verify the full functionality by reading an existing file and writing it back out to a new file. As long as all the dialect parameters match the file's structure, the contents of the two files will be identical.

>>> import sheets
>>> class Content(sheets.Row):
...     chapter = sheets.IntegerColumn()
...     title = sheets.StringColumn()
...
>>> input = open('contents.csv', newline='')
>>> reader = Content.reader(input)
>>> output = open('compare.csv', 'w', newline='')
>>> writer = Content.writer(output)
>>> writer.writerows(reader)
>>> input.close()
>>> output.close()
>>> open('contents.csv').read() == open('compare.csv').read()
Tru

Taking It With You

In this chapter, you've seen how to plan, build and customize a framework using many of the tools Python makes available. What was a complicated task that would have had to be repeated multiple times has been reduced to a reusable and extendable tool. This is just one example of how the techniques in this book can combine for such a complex task, though. The rest is up to you.



[20] http://propython.com/django/

[21] http://propython.com/sqlalchemy/

[22] http://propython.com/twisted/

[23] http://propython.com/elixir/

[24] http://propython.com/csv-module/

[25] http://propython.com/datetime-formatting/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset