© J. Burton Browning and Marty Alchin 2019
J. Burton Browning and Marty AlchinPro Python 3https://doi.org/10.1007/978-1-4842-4385-5_6

6. Object Management

J. Burton Browning1  and Marty Alchin2
(1)
Oak Island, NC, USA
(2)
Agoura Hills, CA, USA
 

Creating an instance of a class is only the beginning; once you have an object, there are a number of things you can do with it. This is obvious, of course, because objects have methods and attributes that are intended to control their behavior, but those are defined by each class. Objects, as a whole, have an additional set of features that allow you to manage them in a number of different ways.

In order to understand these features, it’s first necessary to understand what actually constitutes an object. At a high level, an object is simply the product of data and behavior, but internally, Python considers an object to be a combination of three specific things (five if you add base cclasses and attributes):
  • Identity : Each object is unique, with an identity that can be used to compare objects to each other without having to look at any other details. This comparison, using the is operator, is very strict, however, without access to any of the subtleties outlined in Chapter 5. In actual implementations, an object’s identity is simply its address in memory, so no two objects can ever have the same identity.

  • Type : The subject of the previous two chapters, an object’s type is defined by its class and any base classes that support it. Unlike identity, a type is shared among all of its instances; each object simply contains a reference to its class.

  • Value : With a shared type to provide behavior, each object also has a value that makes it distinct among its peers. This value is provided by a namespace dictionary that is specific to a given object, where any aspect of its individuality can be stored and retrieved. This is different from the identity, however, because the value is designed to work with the type to do useful things; identity is unrelated to the type at all, so it doesn’t have anything to do with the behaviors specified for the class.

These three things can be referenced and, in some cases, changed to suit the needs of an application. An object’s identity can’t be modified at any time, so its value is constant for the life of the object. But once the object is destroyed, its identity can—and often will—be reused for a future object, which then retains that identity until the object is destroyed.

If you want to retrieve an identity at any time, you can pass the object into the built-in id() function because the object itself doesn’t know anything about its identity (ID method). In fact, the identity isn’t related to anything specific to the object; none of its attributes have any bearing on its identity. Therefore, you won’t get the same identity if you instantiate what would otherwise be an identical object. It also varies based on available memory, so in one session the location (returned as an integer) will most likely be different during another session. Types have been covered thoroughly in the previous two chapters, so the next obvious component is the value, which is implemented by way of a namespace dictionary.

Namespace Dictionary

As hinted at previously, an object’s namespace is implemented as a dictionary that is created for each new object as it’s being instantiated. This is then used to store values for all the attributes on the object, thus comprising the value for the object as a whole.

Unlike the identity, however, this namespace dictionary can be accessed and modified at runtime, as it’s available as the __dict__ attribute on an object. In fact, because it’s an attribute, it can even be replaced with a new dictionary altogether. This is the basis of what’s commonly referred to as the Borg pattern, named after the collective consciousness from the Star Trek universe.

Example: Borg Pattern

Like its namesake, the Borg pattern allows a large number of instances to share a single namespace. In this way the identity for each object remains distinct, but its attributes—and thus its behaviors—are always the same as all of its peers. This primarily allows a class to be used in applications in which it could be instantiated several times, with potential modifications made to it each time. By using the Borg pattern these changes can be accumulated in a single namespace, so each instance reflects all the changes that have been made to each object.

This is achieved by attaching a dictionary to the class and then assigning that dictionary to the namespace of each object as it is being instantiated. As Chapter 4 demonstrated, this can be achieved like this: __init__() and __new__(). Because both methods execute during instantiation of the object, they seem to be equally viable options. However, let’s take a look at how they would each work individually.

The __init__() method is the usual place to start because it’s much better understood and more widely adopted. This method typically initializes instance attributes, so the dictionary assignment would need to take place prior to any other initialization. That’s easy enough to do, however, by simply placing it at the beginning of the method. Here’s how this would work:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figa_HTML.jpg
>>> class Borg:
...     _namespace = {}
...     def __init__(self):
...         self.__dict__ = Borg._namespace
...         # Do more interesting stuff here.
...
>>> a = Borg()
>>> b = Borg()
>>> hasattr(a, 'attribute')
False
>>> b.attribute = 'value'
>>> hasattr(a, 'attribute')
True
>>> a.attribute
'value'
>>> Borg._namespace
{'attribute': 'value'}

This certainly does the job, but there are a few pitfalls with the approach, particularly when you start working with inheritance. All subclasses would need to make sure they use super() in order to call the initialization procedures from the Borg class. If any subclass fails to do so, it won’t use the shared namespace; nor will any of its subclasses, even if they do use super(). Furthermore, subclasses should use super() before doing any attribute assignments of their own. Otherwise, those assignments will get overwritten by the shared namespace.

That only applies when Borg is applied to other classes that know about it, however. The problem is even more pronounced when working with Borg as a mixin, because it would get applied alongside classes that don’t know about it—and they shouldn’t have to. But because they can get combined anyway, it’s worth examining what would happen:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figb_HTML.jpg
>>> class Base:
...     def __init__(self):
...         print('Base')
...
>>> class Borg:
...     _namespace = {}
...     def __init__(self, *args, **kwargs):
...         self.__dict__ = Borg._namespace
...         print('Borg')
...
>>> class Testing(Borg, Base):
...     pass
...
>>> Testing()
Borg
<__main__.Testing object at 0x...>
>>> class Testing(Base, Borg):
...     pass
...
>>> Testing()
Base
<__main__.Testing object at 0x...>

As you can see, this exhibits the typical problem when not using super(), where the order of base classes can completely exclude the behaviors of one or more of them. The solution, of course, is to just use super(), but in the case of mixins, you typically don’t have control over both the classes involved. Adding super() would suffice in the case of Borg coming before its peer, but mixins are usually applied after their peers, so it doesn’t really help much.

With all this in mind, it’s worth considering the alternative __new__() method . All methods are vulnerable to the same types of problems that were shown for __init__() , but at least we can reduce the chance of collisions that would cause those problems. Because the __new__() method is less commonly implemented, the odds of running into conflicting implementations are much smaller.

When implementing the Borg pattern with __new__() , the object must be created along the way, usually by calling __new__() on the base object. In order to play nicely with other classes as a mixin, however, it’s still better to use super() here as well. Once the object is created, we can replace its namespace dictionary with one for the entire class:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figc_HTML.jpg
>>> class Base:
...     def __init__(self):
...         print('Base')
...
>>> class Borg:
...     _namespace = {}
...     def __new__(cls, *args, **kwargs):
...         print('Borg')
...         obj = super(Borg, cls).__new__(cls, *args, **kwargs)
...         obj.__dict__ = cls._namespace
...         return obj
...
>>> class Testing(Borg, Base):
...     pass
...
>>> Testing()
Borg
Base
<__main__.Testing object at 0x...>
>>> class Testing(Base, Borg):
...     pass
...
>>> Testing()
Borg
Base
<__main__.Testing object at 0x...>
>>> a = Testing()
Borg
Base
>>> b = Testing()
Borg
Base
>>> a.attribute = 'value'
>>> b.attribute
'value'

Now, Borg comes first in the most common situations, without any unusual requirements on any classes that operate alongside them. There’s still one problem with this implementation, however, and it’s not very obvious from this example. As a mixin, Borg could be applied in any class definition, and you might expect that its namespace behavior would be limited to that defined class and its subclasses.

Unfortunately, that’s not what would happen. Because the _namespace dictionary is on Borg itself, it’ll be shared among all the classes that inherit from Borg at all. In order to break that out and apply it only to those classes where Borg is applied, a slightly different technique is necessary.

Because the __new__() method receives the class as its first positional argument, the Borg mixin can use that object as a namespace on its own, thereby splitting up the managed dictionary into individual namespaces, with one for each class that is used. In a nutshell, Borg.__new__() must create a new dictionary for each new class it encounters, assigning it to a value in the existing _namespace dictionary, using the class object as its key:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figd_HTML.jpg
>>> class Borg:
...     _namespace = {}
...     def __new__(cls, *args, **kwargs):
...         obj = super(Borg, cls).__new__(cls, *args, **kwargs)
...         obj.__dict__ = cls._namespace.setdefault(cls, {})
...         return obj
...
>>> class TestOne(Borg):
...     pass
...
>>> class TestTwo(Borg):
...     pass
...
>>> a = TestOne()
>>> b = TestOne()
>>> a.spam = 'eggs'
>>> b.spam
'eggs'
>>> c = TestTwo()
>>> c.spam
Traceback (most recent call last):
  ...
AttributeError: 'TestTwo' object has no attribute 'spam'
>>> c.spam = 'burger'
>>> d = TestTwo()
>>> d.spam
'burger'
>>> a.spam
'eggs'

As you can see, by using cls as a kind of namespace of its own, we can compartmentalize the managed values on a per-class basis. All instances of TestOne share the same namespace, whereas all instances of TestTwo share a separate namespace, so there’s never any overlap between the two.

Example: Self-Caching Properties

Even though attributes are the primary means of accessing an object’s namespace dictionary, remember from Chapter 4 that attribute access can be customized using special methods, such as __getattr__() and __setattr__() . Those methods are what Python actually uses when accessing an attribute, and it’s up to those methods to look things up in the namespace dictionary internally. If you were to define them in pure Python, they’d look a lot like this:

../images/330715_3_En_6_Chapter/330715_3_En_6_Fige_HTML.jpg
class object:
    def __getattr__(self, name):
        try:
            return self.__dict__[name]
        except KeyError:
            raise AttributeError('%s object has no attribute named %s'
                % (self.__class__.__module__, name))
    def __setattr__(self, name, value):
        self.__dict__[name] = value
    def __delattr__(self, name):
        try:
            del self.__dict__[name]
        except KeyError:
            raise AttributeError('%s object has no attribute named %s'
                % (self.__class__.__module__, name))

As you can see, every access to the attribute performs a lookup in the namespace, raising an error if it wasn’t there. This means that in order to retrieve an attribute, its value must have been created and stored previously. For most cases this behavior is appropriate, but in some cases the attribute’s value can be a complex object that’s expensive to create, and it might not get used very often, so it’s not very advantageous to create it along with its host object.

One common example of this situation is an Object-Relational Mapping (ORM) sitting between application code and a relational database. When retrieving information about a person, for instance, you’d get a Person object in Python. That person might also have a spouse, children, a house, an employer, or even a wardrobe filled with clothing, all of which could also be represented in the database as related to the person you’ve retrieved.

If we were to access all of that information as attributes, the simple approach described previously would require all of that data to be pulled out of the database every time a person is retrieved. Then, all of that data must be collected into separate objects for each of the types of data: Person, House, Company, Clothing, and probably a host of others. Worse yet, each of those related objects has other relationships that would be accessible as attributes, which can quickly seem like you need to load up the entire database every time a query is made.

Instead, the obvious solution is to load that information only when requested. By keeping track of a unique identifier for the person, along with a set of queries that know how to retrieve the related information, methods can be added that will be able to retrieve that information when necessary.

Unfortunately, methods are expected to perform their task every time they’re called. If you need the person’s employer, for example, you’d have to call a Person.get_employer() method, which would make a query in the database and return the result. If you call the method again another query is made, even though it’s often unnecessary. This could be avoided by storing the employer as a separate variable, which could be reused instead of calling the method again, but that doesn’t hold up once you start passing the Person object around to different functions that might have different needs.

Instead, a more preferable solution would be to make an attribute that starts out with as little information as possible—perhaps even none at all. Then, when that attribute is accessed, the database query is made, returning the appropriate object. This related object can then be stored in the main object’s namespace dictionary, where it can be accessed directly later on, without having to hit the database again.

Querying a database when accessing an attribute is a fairly easy task, actually. Applying the @property decorator to a method will produce the desired effect, calling the function whenever the attribute is accessed. Caching its return value requires a bit more finesse, however, but it’s really fairly simple: simply overwrite the existing value if there’s already one in the object’s namespace or create a new one otherwise.

This could be simply added into the behavior of an existing property, as it only requires a few extra lines of code to support. Here’s all it would take:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figf_HTML.jpg
class Example:
    @property
    def attribute(self):
        if 'attribute' not in self.__dict__:
            # Do the real work of retrieving the value
            self.__dict__['attribute'] = value
        return self.__dict__['attribute']

Caution

When caching property values like this, be careful to check that the computed value shouldn’t change based on the value of other attributes. Computing a full name based on first and last names, for example, is a poor candidate for caching because changing the first name or last name should change the value of the full name as well; caching would prevent incorrect behavior.

Notice, however, that this really just performs a little work before the real code and a little bit afterward, making it an ideal task for a decorator. Here’s what that decorator could look like:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figg_HTML.jpg
import functools
def cachedproperty(name):
    def decorator(func):
        @property
        @functools.wraps(func)
        def wrapper(self):
            if name not in self.__dict__:
                self.__dict__[name] = func(self)
            return self.__dict__[name]
        return wrapper
    return decorator

Once applied to a function, cachedproperty() will work like a standard property, but with the caching behavior applied automatically. The one difference you’ll notice, however, is that you must supply the name of the attribute as an argument to cachedproperty() in addition to naming the function that you’re decorating. Assuming you typed in the previous function, here’s how it would look:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figh_HTML.jpg
>>> class Example:
...     @cachedproperty('attr')
...     def attr(self):
...         print('Getting the value!')
...         return 42
...
>>> e = Example()
>>> e.attr
Getting the value!
42
>>> e.attr
42

Why must the name be supplied twice? The problem, as mentioned in previous chapters, is that descriptors, including properties, don’t get access to the names they’re given. Because the cached value is stored in the object namespace according to the name of the attribute, we need a way to pass that name into the property itself. This is a clear violation of DRY, however, so let’s see what other techniques are available and what their pitfalls would be.

One option would be to store a dictionary on the cached property descriptor directly, using object instances as keys. Each descriptor would get a unique dictionary, and each key would be a unique object, so you’d be able to store as many values as you have objects that have the attribute attached:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figi_HTML.jpg
def cachedproperty(func):
    values = {}
    @property
    @functools.wraps(func)
    def wrapper(self):
        if self not in values:
            values[self] = func(self)
        return values[self]
    return wrapper

This new decorator allows you to cache an attribute without having to specify the name. If you’re skeptical about it, however, you might wonder about storing those values in a single dictionary for all objects, without referencing the name of the attribute. After all, that would seem to mean that if you had more than one cached property on a single object, their values would overwrite each other and you’d have all sorts of confusion.

That’s not a problem in this situation, however, because the dictionary is created inside of the cachedproperty() function, which means each property gets its own dictionary name values. This way there’s no chance of collision, no matter how many cached properties you place on an object. The dictionary will be shared only if you assign an existing property to a new name without redefining it. In that case, the second name should always behave exactly like the first, and the cache described here will still maintain that behavior.

However, there is one other problem with this property that may not be so obvious. Believe it or not, this contains a memory leak, which could be severely harmful if it gets used in a large part of an application without being fixed (this will be discussed shortly in more detail).

In some cases the best fix will be to simply go back to the first form described in this chapter, where the attribute’s name is provided explicitly. Because the name isn’t provided to a descriptor, this approach would require the use of a metaclass. Of course, metaclasses are overkill for simple situations like this, but in cases in which a metaclass is used for other reasons anyway, having the name available can be quite useful. Chapter 11 showcases a framework that uses the metaclass approach to great effect.

In order to avoid using a metaclass, it’s first necessary to understand what the memory leak is, why it’s happening, and how we can avoid it. It all has to do with how Python removes objects from memory when they’re no longer in use, a process called garbage collection.

Garbage Collection

Unlike lower-level languages like C, Python doesn’t require you to manage your own memory usage. You don’t have to allocate a certain amount of memory for an object or remove your claim on that memory when the object is no longer needed. In fact, you often don’t even need to worry about how much memory an object will take up or how to determine when it’s no longer needed. Python handles those gritty details behind the scenes.

Garbage collection is easy to understand: Python deletes any objects that are identified as garbage, clearing whatever memory they were using so that memory is available for other objects. Without this process every object created would stay in memory forever, and you’d slowly—or quickly—run out of memory, at which point everything comes to a grinding halt.

As you probably noticed, effective garbage collection first requires the ability to reliably identify an object as garbage. Even with the ability to remove garbage from memory, failing to recognize garbage will cause memory leaks to creep into an application. The last example in the previous section contains a simple situation that can cause Python to not notice when an object becomes garbage, so we need to examine how that gets determined. It is important to note that because Python is not a strongly typed language (you do not explicitly declare a variables type), variables that are changed during a command session are rereferenced if you redeclare the variable with a previously used value during that session. The next terminal prompt example shows this by showing the location of the variable in memory, and as you will note it changes back with the original value:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figj_HTML.jpg
>>> x=10
>>> type(x)
<class 'int'>
>>> id(x)  #location of x
1368047320
>>> x="foobar"
>>> type(x)
<class 'str'>
>>> id(x)  #location of x as a string instead of int
62523328
>>> x=10
>>> id(x)  #back to the original location of x as an int at 10
1368047320

Reference Counting

At a high level, an object is considered garbage when it’s no longer accessible by any code. In order to determine whether an object is accessible, Python counts how many data structures refer to the object at any given time.

The most obvious way to reference an object is to assign it in any namespace, including modules, classes, objects, and even dictionaries. Other types of references include any kind of container object, such as a list, tuple, or set. Even less obvious is that every function has its own namespace, which can contain references to objects, even in the case of closures. Essentially, anything that provides access to an object increases its reference count. In turn, removing an object from such a container decreases its reference count.

To illustrate, here are a few examples of situations that would create new references:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figk_HTML.jpg
>>> a = [1, 2, 3]
>>> b = {'example': a}
>>> c = a

After executing these three lines of code, there are now three references to the list [1, 2, 3]. Two of them are fairly obvious, when it was assigned to a and later reassigned to c. The dictionary at b also has a reference to that list, however, as the value of its 'example' key. That dictionary, in turn, has just one reference, having been assigned as the value of b.

The del statement is perhaps the most obvious way to remove a reference to an object, but it’s not the only option. If you replace a reference to one object with a reference to another (rebind it), you’ll also implicitly remove the reference to the first object. For example, if we were to run these two lines of code, we end with just one reference to the list shown as a:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figl_HTML.jpg
>>> del c
>>> a = None

Even though it’s no longer available in the root namespace, that list is still available as part of the dictionary, which itself is still accessible as b. Therefore, they each have just one reference, and neither will be garbage collected. If you were to del b right now, the reference count for the dictionary becomes zero and will be eligible for garbage collection. Once that’s been collected, the reference count for the list is reduced to zero and is collected as garbage.

Tip

By default, Python simply clears out the memory that was occupied by the object. You don’t need to do anything in order to support that behavior, and it works just fine for most cases. In the rare event that an object has some special needs to address when it’s deleted, the __del__() method can provide this customization.

Instead of deleting objects, there are a number of other things you can do with them as well. Here’s a look at a very different situation that can alter the way reference counting works.

Cyclical References

Consider the scenario in which you have a dictionary that refers to a list as one of its values. Because lists are containers as well, you could actually append the dictionary as a value to the list. What you end up with is a cyclical reference, where each object refers to the other. To extend the previous examples, let’s examine what would happen with this line of code:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figm_HTML.jpg
>>> b['example'].append(b)

Prior to this the dictionary and the list had one reference each, but now the dictionary gains another reference by being included as a member of the inner list. This situation will work just fine in normal operation, but it does present an interesting problem when it comes to garbage collection.

Remember that using del b would decrease the reference count of the dictionary by one, but now that the list also contains a reference to that same dictionary, its reference count goes from two to one, rather than dropping to zero. With a reference count above zero, the dictionary wouldn’t be considered garbage and it would stay in memory, along with its reference to the list. Therefore, the list also has a reference count of one, keeping it in memory.

What’s the problem here? Well, after you delete the reference at the variable b, the references those two objects have to each other are now the only references they have in the entire Python interpreter. They’re completely cut off from any code that will continue executing, but because garbage collection uses reference counts, they’ll stay in memory forever unless something else is done.

To address this, Python’s garbage collection comes with code designed to spot these structures when they occur, so they can be removed from memory as well. Any time a set of objects is referenced only by other objects in that set—and not from anywhere else in memory—it's flagged as a reference cycle. This allows the garbage collection system to reclaim the memory it was using.

Things start to get really tricky when you implement __del__(), however. Ordinarily, __del__() works just fine because Python can intelligently figure out when to delete the object. Therefore, __del__() can be executed in a predictable manner, even when multiple objects are deleted within a short span.

When Python encounters a reference cycle that’s inaccessible from any other code, it doesn’t know the order to delete the objects in that cycle. This becomes a problem with the custom __del__() method, because it could act on related objects as well. If one object is part of an orphaned reference cycle, any related objects are all also scheduled for deletion, so which one should fire first?

After all, each object in the cycle could reference one or more of the other objects in that same cycle. Without an object to be considered first, Python would have to simply guess which one it should be. Unfortunately, that leads to behavior that is not only unpredictable but also unreliable across the many times it could occur.

Therefore, Python has to take one of only two predictable, reliable courses of action. One option would be to simply ignore the __del__() method and delete the object just as it would if the __del__() method wasn’t found. Unfortunately, that changes the behavior of the object based on things outside that object’s control.

The other option, which Python does take, is to leave the object in memory. This avoids the problem of trying to order a variety of __del__() methods while maintaining the behavior of the object itself. The problem, however, is that this is in fact a memory leak, and it’s only there because Python can’t make a reliable assumption about your intentions.

In the Face of Ambiguity, Refuse the Temptation to Guess

This situation with __del__() in a cyclical reference is a perfect example of ambiguity because there’s no clear way to handle the situation. Rather than guess, Python sidesteps it by simply leaving the objects in memory. It’s not the most memory-efficient way to address the problem, but consistency is far more important in situations like this. Even though it potentially means more work for the programmer, that extra work results in much more explicit, reliable behavior.

There are three ways you can avoid this problem. First, you can avoid having any objects with __del__() methods involved in any cyclical references. The easiest way to accomplish that is to avoid the __del__() method entirely. Most of the common reasons to customize an object’s teardown are more appropriately handled using a context manager.

In those rare cases in which __del__() proves necessary, the second option is to simply avoid having the objects appear in reference cycles. That’s not always easy to do, however, because it requires you to have complete control over all the ways the object might be used. That might work for some highly internalized implementation details, but if it’s part of a public interface, it’s probably not an option.

Finally, if you can’t prevent the cycles from being orphaned, Python does provide a way that you can still detect them and have a chance to clean them up on a regular basis. Once all other references are removed and the garbage collection cycle runs, Python keeps the entire cycle alive by placing each object involved into a special list, available in the gc module.

The gc module provides a few options that are useful for getting into the guts of the garbage collection system, but the factor at hand here is the garbage attribute. This attribute contains objects that are otherwise unreachable but are part of a cycle that includes __del__() somewhere along the line. Accessing them as part of gc.garbage allows you to try to break the cycle after the fact, which will allow their memory to be relinquished.

Consider the following example, which also shows the usage of gc.collect() , a module-level function that manually runs the garbage collector so that cyclical references are detected and placed in gc.garbage accordingly:

../images/330715_3_En_6_Chapter/330715_3_En_6_Fign_HTML.jpg
>>> import gc
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __repr__(self):
...         return 'Example %s' % self.value
...     def __del__(self):
...         print('Deleting %r' % self)
...
>>> e = Example(1)
>>> e
Example 1
>>> del e
>>> gc.collect()
Deleting Example 1
0
# Now let's try it with a cyclical reference
>>> e = Example(2)
>>> e.attr = e
>>> del e
>>> gc.collect()
2
>>> gc.garbage
# From here, we can break the cycle and remove it from memory
>>> e = gc.garbage[0]
>>> del e.attr
>>> del e
>>> gc.collect()
0
>>> gc.garbage
# Don't forget to clear out gc.garbage as well
>>> gc.garbage[:] = []
Deleting Example 2
>>> gc.garbage
[]

In the real world, however, __del__() is rarely needed, and it’s even more rare to run into very severe problems with cyclical references. Far more common, however, is the need to adjust how references themselves are created and what to do when you don’t really need a reference all your own.

Weak References

As we’ve seen, assigning an object creates a reference to it, and those references keep that object alive in memory. But what happens when you need to access an object but you don’t care to keep it alive? For this, Python provides the concept of a weak reference: you get a reference to the object without increasing its reference count.

By getting a reference without increasing the object’s reference count, you can perform operations on that object without getting in the way of how it would ordinarily be deleted. This can be very important for applications that register objects for use later. The registry itself keeps references to all the registered objects, which ordinarily wouldn’t get deleted, because the application that knows about the object typically doesn’t know anything about the registration system.

Creating a weak reference is fairly simple, thanks to the weakref module in the standard library. The ref() class within that module creates a weak reference to whatever object is passed into it, allowing that reference to be used later. To provide access to the original object, a weak reference is a callable object that takes no arguments and returns the object.

In order to see what was supposed to happen, we have to first store a reference to that object outside the weak reference. That way we cannot only create a weak reference that has access to the object, but we can then delete the additional reference to see how the weak reference behaves:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figo_HTML.jpg
>>> import weakref
>>> class Example:
...     pass
...
>>> e = Example()
>>> e
<__main__.Example object at 0x...>
>>> ref = weakref.ref(e)
>>> ref
<weakref at ...; to 'Example' at ...>
>>> ref()
<__main__.Example object at 0x...>
>>> del e
>>> ref
<weakref at ...; dead>
>>> ref()
>>>

As you can see, as long as there’s at least one other reference keeping the object alive, the weak reference has easy access to it. Once the object is deleted elsewhere, the weak reference object itself is still available, but it simply returns None when called. We could make the example even simpler as well, by passing a new object directly into the weak reference:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figp_HTML.jpg
>>> ref = weakref.ref(Example())
>>> ref
<weakref at ...; dead>
>>> ref()
>>>

Wait, what just happened? Where did the Example object go? This simple example illustrates one of the most common problems you’re likely to encounter with weak references. Because you’re instantiating the object as part of the call to ref(), the only reference that gets created for that object is inside of ref().

Ordinarily that would be fine, but that particular reference doesn’t help keep the object alive, so the object is immediately marked for garbage collection. The weak reference provides access to the object only if there’s something else to keep it alive, so in this case, the reference simply returns None when called. That situation may seem obvious, but there are a few others that may come up when you least expect them.

One such situation that can come up involves creating a weak reference inside of a function:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figq_HTML.jpg
>>> def example():
...     e = Example()
...     ref = weakref.ref(e)
...     return ref
...
>>> e = example()
>>> e
<weakref at ...; dead>
>>> e()
>>>

As you can see, even though the example() function stores a strong reference inside itself, the weak reference goes dead immediately. The problem here is that every function gets a brand-new namespace every time it executes, and it’s deleted when the function finishes, because execution is the only thing keeping it alive.

By default, all assignments in the function take place in that namespace, so once it’s destroyed any objects assigned are destroyed as well unless they have references stored elsewhere. In this case the only other reference to the Example object is weak, so the object gets destroyed once the example() function returns.

The recurring theme here is that weak references can cause problems when used along with any kind of implicit reference removal. We’ve discussed two already, but there are other similar situations as well. For example, a for loop automatically assigns at least one variable each time the loop begins, overwriting any values that were previously assigned to the same name. Because that also destroys the reference to whatever object was used in the previous iteration, a weak reference created inside the loop isn’t enough to keep that object alive.

Pickling

So far we’ve only discussed how objects are handled inside of Python, but it’s often necessary to exchange data with external processes such as files, databases, and network protocols. Most of the time the structure of that data outside of Python is already established, so your application will need to adhere to that structure. Other times, however, the only reason to send the data into something else is to store it for a while and read it back into Python later. The Pickle command is used to convert a Python object such as a list or dictionary into a persistent character stream that can be reloaded later to recreate the object for use in a different Python application. It is used for serializing and deserializing a Python object to and from a file.

In this case, the external system really doesn’t care what your data is or how it’s structured. As long as it’s a data type that system can understand, it should be usable. You should note that def functions and classes cannot be pickled. Because the most flexible and widely supported data type is a string, it’s necessary to export Python’s data structures to strings. For this, Python provides the pickle module. PEP 3137 has some very interesting details on byte types and strings by Guido.

In the real world, pickling is a way of preserving food so it can be stored for long periods of time and consumed much later. Without preservation techniques like pickling, food would have to be consumed almost immediately after it’s produced. The same is true for data: it’s easy to consume shortly after it’s produced, but saving it for later requires some extra work.

The action of pickling is performed by using the pickle module’s dump() or dumps() functions . Both of these functions can take any object as the first argument, but they differ in where they output the string representing that object. In the case of dump(), a second required argument specifies a writable file-like object that the function will use as the destination for the pickled value. The dumps() function, by contrast, simply returns the string directly, allowing the code that called the function to decide where to put it. Beyond that the two functions are identical, and the examples throughout the rest of this section will use dumps() , as it shows the output much more easily:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figr_HTML.jpg
>>> import pickle
>>> pickle.dumps(1)
b'x80x03Kx01.'
>>> pickle.dumps(42)
b'x80x03K*.'
>>> pickle.dumps('42')
b'x80x03Xx02x00x00x0042qx00.'

As you can see, the pickled output can contain more information than the original objects value because it also needs to store the type, so the object can be reconstituted later.

Once a value has been pickled, the resulting string can be stored or passed around however your application requires. Once it’s time to retrieve the object back into Python, the pickle module provides two addition functions, load() and loads(). The difference between the two is similar to the dump functions: load() accepts a readable file-like object, while loads() accepts a string:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figs_HTML.jpg
>>> pickled = pickle.dumps(42)
>>> pickled
b'x80x03K*.'
>>> pickle.loads(pickled)
42

Dumping objects into pickled strings and loading them back again are just the external tasks, however. Like in the many protocols described previously, Python allows individual objects to control how they’re pickled and restored. Because pickling represents a sort of snapshot of the object at the time it was pickled, these functions are named to refer to the state of the object at a given time.

The first method to consider is __getstate__() , which controls what gets included in the pickled value. It doesn’t take any additional arguments and returns whatever value Python should include in the pickled output. For complex objects the value will typically be a dictionary or perhaps a tuple, but it’s completely up to each class to define what values are pertinent to the object.

For example, a currency conversion class might contain a number to use as the current amount as well as a string to indicate the currency being represented. In addition, it would likely have access to a dictionary of current exchange rates, so that it can convert the amount to a different currency. If a reference to that dictionary were placed on the object itself, Python would pickle it all together :

../images/330715_3_En_6_Chapter/330715_3_En_6_Figt_HTML.jpg
>>> class Money:
...     def __init__(self, amount, currency):
...         self.amount = amount
...         self.currency = currency
...         self.conversion = {'USD': 1, 'CAD': .95}
...     def __str__(self):
...         return '%.2f %s' % (self.amount, self.currency)
...     def __repr__(self):
...         return 'Money(%r, %r)' % (self.amount, self.currency)
...     def in_currency(self, currency):
...         ratio = self.conversion[currency] / self.conversion[self.currency]
...         return Money(self.amount * ratio, currency)
...
>>> us_dollar = Money(250, 'USD')
>>> us_dollar
Money(250, 'USD')
>>> us_dollar.in_currency('CAD')
Money(237.5, 'CAD')
>>> pickled = pickle.dumps(us_dollar)
>>> pickled
b'x80x03c__main__ Money qx00)x81qx01}qx02(Xx08x00x00x00currencyqx03
Xx03x00x00x00USDqx04Xx06x00x00x00amountqx05KxfaX x00x00x00convers
ionqx06}q]x07(hx04Kx01Xx03x00x00x00CADqx08G?xeeffffffuub.'

As you can see, this is already quite an expansive pickled value, and that’s with just having two currencies stored in the dictionary. Because the currency conversion values aren’t specific to the instance at hand—and they’ll change over time anyway—there’s no reason to store them in the pickled string, so we can use __getstate__() to provide just those values that are actually important.

If you look closely at the pickled output of the existing Money object, you’ll notice that the attribute names are also included because Python doesn’t know if they’re important. In lieu of any explicit instructions from __getstate__() , it includes as much information as possible, to be sure the object can be recreated later. Because we already know that there are just two values that are necessary, we can return just those two values as a tuple:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figu_HTML.jpg
>>> class Money:
...     def __init__(self, amount, currency):
...         self.amount = amount
...         self.currency = currency
...         self.conversion = {'USD': 1, 'CAD': .95}
...     def __str__(self):
...         return '%.2f %s' % (self.amount, self.currency)
...     def __repr__(self):
...         return 'Money(%r, %r)' % (self.amount, self.currency)
...     def __getstate__(self):
...         return self.amount, self.currency
...     def in_currency(self, currency):
...         ratio = self.conversion[currency] / self.conversion[self.currency]
...         return Money(self.amount * ratio, currency)
...
>>> us_dollar = Money(250, 'USD')
>>> us_dollar
Money(250, 'USD')
>>> us_dollar.in_currency('CAD')
Money(237.5, 'CAD')
>>> pickled = pickle.dumps(us_dollar)
>>> pickled
b'x80x03c__main__ Money qx00)x81qx01KxfaXx03x00x00x00USDqx02x86qx
03b.'

As you can see, this cuts the size of the pickled output to just over a third of what it was before. In addition to being more efficient, it’s more practical because it doesn’t contain unnecessary information. Other attributes that should avoid being pickled are initialization values, system-specific details, and other transient information that are simply related to the object’s value rather than being part of that value directly.

That’s only half of the equation, however. Once you have customized the pickled output of an object, it can’t be retrieved back into a Python object without also customizing that side of things. After all, by storing the value as a tuple, we’ve removed some of the hints Python used to rebuild the object, so we have to provide an alternative.

As you might have guessed, the complement to __getstate__() is __setstate__() . The __setstate__() method accepts just one additional argument: the state of the object to restore. Because __getstate__() can return any object to represent state, there’s no specific type that will also be passed into __setstate__(). It’s not at all random, however; the value passed into __setstate__() will be exactly the same value that was returned from __getstate__().

In the case of our currency converter, the state is represented by a 2-tuple containing the amount and currency:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figv_HTML.jpg
>>> class Money:
...     def __init__(self, amount, currency):
...         self.amount = amount
...         self.currency = currency
...         self.conversion = {'USD': 1, 'CAD': .95}
...     def __str__(self):
...         return '%.2f %s' % (self.amount, self.currency)
...     def __repr__(self):
...         return 'Money(%r, %r)' % (self.amount, self.currency)
...     def __getstate__(self):
...         return self.amount, self.currency
...     def __setstate__(self, state):
...         self.amount = state[0]
...         self.currency = state[1]
...     def in_currency(self, currency):
...         ratio = self.conversion[currency] / self.conversion[self.currency]
...         return Money(self.amount * ratio, currency)
...
>>> us_dollar = Money(250, 'USD')
>>> pickled = pickle.dumps(us_dollar)
>>> pickle.loads(pickled)
Money(250, 'USD')

And with that, the Money class now fully controls how its value gets pickled and unpickled. That should be the end of it, right? Well, just to be sure, let’s test that in_currency() method again, because that’s an important aspect of its behavior:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figw_HTML.jpg
>>> us_dollar = pickle.loads(pickled)
>>> us_dollar
Money(250, 'USD')
>>> us_dollar.in_currency('CAD')
Traceback (most recent call last):
  ...
AttributeError: 'Money' object has no attribute 'conversion'

So why didn’t this work? When unpickling an object, Python doesn’t call __init__() along the way because that step is only supposed to take place when setting up new objects. Because the pickled object was already initialized once before the state was saved, it would usually be wrong to try to initialize it again. Instead, you can include initialization behaviors like that inside of __setstate__() to ensure that everything is still properly in place:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figx_HTML.jpg
>>> class Money:
...     def __init__(self, amount, currency):
...         self.amount = amount
...         self.currency = currency
...         self.conversion = self.get_conversions()
...     def __str__(self):
...         return '%.2f %s' % (self.amount, self.currency)
...     def __repr__(self):
...         return 'Money(%r, %r)' % (self.amount, self.currency)
...     def __getstate__(self):
...         return self.amount, self.currency
...     def __setstate__(self, state):
...         self.amount = state[0]
...         self.currency = state[1]
...         self.conversion = self.get_conversions()
...     def get_conversions(self):
...         return {'USD': 1, 'CAD': .95}
...     def in_currency(self, currency):
...         ratio = self.conversion[currency] / self.conversion[self.currency]
...         return Money(self.amount * ratio, currency)
...
>>> us_dollar = Money(250, 'USD')
>>> pickled = pickle.dumps(us_dollar)
>>> pickle.loads(pickled)
Money(250, 'USD')
>>> us_dollar.in_currency('CAD')
Money(237.5, 'CAD')

Of course, all of this is only useful if you’re copying an object to be stored or sent to a non-Python consumer outside. If all you’ll need to do is work with it inside of Python itself, you can simply copy the object internally .

Copying

Mutable objects come with one potentially prominent drawback: changes made to an object are visible from every reference to that object. All mutable objects work this way because of how Python references objects, but that behavior isn’t always the most useful. In particular, when working with objects passed in as arguments to a function, the code that called the function will often expect the object to be left unchanged. If the function needs to make modifications in the course of its work, you’ll need to take some extra care.

In order to make changes to an object without those changes showing up elsewhere, you’ll need to copy the object first. Some objects provide a mechanism for this right out of the box. Lists, for instance, support slicing to retrieve items from the list into a new list. That behavior can be used to get all the items at once, creating a new list with those same items. Simply leave out the start and end values, and the slice will copy the list automatically:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figy_HTML.jpg
>>> a = [1, 2, 3]
>>> b = a[:]
>>> b
[1, 2, 3]
>>> b.append(4)
>>> b
[1, 2, 3, 4]
>>> a
[1, 2, 3]

Similarly, dictionaries have their own way to copy their contents, although not using a syntax like lists use. Instead, dictionaries provide a copy() method , which returns a new dictionary with all the same keys and values:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figz_HTML.jpg
>>> a = {1: 2, 3: 4}
>>> b = a.copy()
>>> b[5] = 6
>>> b
{1: 2, 3: 4, 5: 6}
>>> a
{1: 2, 3: 4}

Not all objects include this type of copying behavior internally, but Python allows you to copy any object, even if it doesn’t have its own copying mechanism .

Shallow Copies

To get a copy of any arbitrary object, Python provides a copy module. The simplest function available in that module is also named copy(), and it provides the same basic behavior as the techniques shown in the previous section. The difference is that rather than being a method on the object you want to copy, copy.copy() allows you to pass in any object and get a shallow copy of it. Not only can you copy a wider variety of objects, you can do so without needing to know anything about the objects themselves:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figaa_HTML.jpg
>>> import copy
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...
>>> a = Example('spam')
>>> b = copy.copy(a)
>>> b.value = 'eggs'
>>> a.value
'spam'
>>> b.value
'eggs'

Of course, this is just a shallow copy. Remember from the beginning of this chapter that an object is really the combination of three components: an identity, a type, and a value. When you make a copy of an object, what you’re really doing is creating a new object with the same type, but with a new identity and a new—but identical—value.

For mutable objects, that value typically contains references to other objects, such as the items in a list or the keys and values in a dictionary. The value for the copied object may have a new namespace, but it contains all the same references. Therefore, when you make changes to a member of the copied object, those changes get reflected in all other references to that same object, just like any other namespace. To illustrate, consider a dictionary that contains lists as its values:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figab_HTML.jpg
>>> a = {'a': [1, 2, 3], 'b': [4, 5, 6]}
>>> b = a.copy()
>>> a['a'].append(4)  #Copy to a and b
>>> b['b'].append(7)  #Copy to a and b
>>> a
{'a': [1, 2, 3, 4], 'b': [4, 5, 6, 7]}
>>> b
{'a': [1, 2, 3, 4], 'b': [4, 5, 6, 7]}

As you can see, the copy only goes one level deep, so it’s considered to be “shallow.” Beyond the object’s own namespace only references get copied, not the objects themselves. This is true for all types of objects, not just the lists and dictionaries shown here. In fact, custom objects can even customize this behavior by providing a __copy__() method. The copy() function will call __copy__() with no arguments if it exists, so that method can determine which values get copied and how they’re handled.

Typically, shallow copies are useful when the first layer is the only part of a value you need to change, particularly when it makes more sense to specifically keep the rest of the objects intact. The basic example case for this is sorting a list, where a new list must be created in order to sort the items, but those items themselves should remain as they were.

To illustrate, consider a custom implementation of Python’s built-in sorted() method , which sorts the items into a new list while keeping the original unchanged:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figac_HTML.jpg
>>> def sorted(original_list, key=None):
...     copied_list = copy.copy(original_list)
...     copied_list.sort(key=key)
...     return copied_list
...
>>> a = [3, 2, 1]
>>> b = sorted(a)
>>> a
[3, 2, 1]
>>> b
[1, 2, 3]

Of course, this still relies on the object passed in being a list, but it illustrates how shallow copies can be useful. In other situations you may need to modify the whole structure as deep as you can get.

Deep Copies

It’s often necessary for algorithms to need to reorganize data in large structures in order to solve a particular problem. Sorting, indexing, aggregating, and rearranging data are all common tasks to perform in these more complex operations. Because the goal is simply to return some analysis of that data, the original structure needs to remain intact. We need a deeper copy than what we’ve examined so far.

For these situations Python’s copy module also contains a deepcopy() method, which copies not only the original structure but also the objects that are referenced by it. In fact, it looks recursively through all those objects for any other objects, copying each in turn. This way you’re free to modify the copy however you like, without fear of modifying the original or any modifications to the original being reflected in the copy:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figad_HTML.jpg
>>> original = [[1, 2, 3], [1, 2, 3]]
>>> shallow_copy = copy.copy(original)
>>> deep_copy = copy.deepcopy(original)
>>> original[0].append(4)
>>> shallow_copy
[[1, 2, 3, 4], [1, 2, 3]]
>>> deep_copy
[[1, 2, 3], [1, 2, 3]]

It’s not truly recursive, however, because full recursion would sometimes make for infinite loops if the data structure had a reference to itself at any time. Once a particular object is copied Python makes a note of it, so that any future references to that same object can simply be changed to refer to the new object rather than create a brand-new one every time (deepcopy function).

Not only does that avoid recursively copying the same object if it’s somehow a member of itself; it also means that any time the same object is found more than once in the structure, it will only be copied once and referenced as many times as necessary. That means the copied structure will have the same behavior as the original with regard to how changes are reflected in referenced objects:

../images/330715_3_En_6_Chapter/330715_3_En_6_Figae_HTML.jpg
>>> a = [1, 2, 3]
>>> b = [a, a]
>>> b
[[1, 2, 3], [1, 2, 3]]
>>> b[0].append(4)
>>> b
[[1, 2, 3, 4], [1, 2, 3, 4]]
>>> c = copy.deepcopy(b)
>>> c
[[1, 2, 3, 4], [1, 2, 3, 4]]
>>> c[0].append(5)
>>> c
[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]

This is a must for algorithms that rely on objects being present in multiple places of a structure. Each copy will behave the same as the original in that regard, so there’s no worry about how many times it gets copied before an algorithm starts working with it.

One other problem that can come up with deep copies is that Python doesn’t know what might or might not be important, so it copies everything, which might end up being far more than you need. In order to control that behavior, custom objects can specify the deep copying behavior separately from shallow copies.

By supplying a __deepcopy__() method, an object can specify which values are pertinent to the copy , much like how __getstate__() works for pickling. The biggest difference from __getstate__(), and from __copy__() as well, is that __deepcopy__() also accepts a second argument, which will be a dictionary used to manage the identity of objects during copies. Because the deep copy should only copy each object once and use references any other time that object is used, this identity namespace provides a way to keep track of which objects are indeed the same because it maps their identities to the objects themselves .

Exciting Python Extensions: Beautiful Soup

Beautiful Soup is a de facto standard library for working with HTML and XML documents. It is a file parser or screen-scraper that gives you great control in shaping files to meet your data extraction needs. In Chapter 5 you used Scrapy for web scraping. The documents you obtained can be easily cleaned to remove markup language with Beautiful Soup. This is a great library to use in conjunction with other Python extensions such as Scrapy. Consider that you would obtain the data with a tool like Scrapy and then clean it with Beautiful Soup. Beautiful Soup has some powerful searching capabilities as well, but let’s just focus on the parsing ability.

Installing Beautiful Soup

Documentation for the extension is available at https://www.crummy.com/software/BeautifulSoup :
pip install beautifulsoup4 (Enter)

Of course, with other operating systems you would use the appropriate install tool; with Elementary or Ubuntu, for example, it would be sudo apt-get name-of-package.

Using Beautiful Soup

Make sure your install is working first by running from a Python interactive prompt:
from bs4 import BeautifulSoup (Enter)

If no errors result, then your libraries are installed. If you receive errors check that you do not have another Python installation, such as Anaconda, or path issues.

As an example of the power of Beautiful Soup, we will take the HTML file harvested in Chapter 5 with Scrapy and clean it up so that it is a text file only, with the markup tags removed. This will create a file that is much better suited to data analysis such as searching for key words or occurrences. Key in and run the following code, with the quotes.html file we created in the previous chapter in the same folder, and you will see raw HTML output and prettified Beautiful Soup output:
from bs4 import BeautifulSoup
path='quotes-1.html'
filedata=open(path,'r',errors='ignore')
page=filedata.read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify()) # show raw HTML markup
print(' And a cleaner version: ')
print(soup.get_text()) # return plain text only
What you should see is the raw HTML text, then the cleaned-up version via Beautiful Soup. Note that some extraneous data was left (but not much) that we could not clean up via a looping structure. Next let’s only search for items that have a HTML ‘span’ tag, count the occurrences, and print a cleaner output of only those selected items:
from bs4 import BeautifulSoup
path='quotes-1.html'
filedata=open(path,'r',errors='ignore')
page=filedata.read()
soup = BeautifulSoup(page, 'lxml')
print(' We found this many span tags:  ',len(soup.find_all('span')))
print(' Show only span tag items ')
print(soup.find_all('span'))
print('------------------')
print(' Now clean up the span tags ')
for item in soup.find_all('span'):
    print(item.text)

In this last example we searched for a tag, then used an enhanced for to print the individual items, with the tags removed via item.text. Is there more you could do with Beautiful Soup? Certainly, but this should serve as a good stepping-off point to experiment more.

Taking It With You

Every application knows how to deal with objects at a basic level, but with the techniques shown in this chapter you’ll be able to move on to managing large collections of objects, spanning a wide variety of different types. In the next chapter, we’ll shift from a macro-level view of objects to a micro-level examination of one specific type of object: the humble string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset