© J. Burton Browning and Marty Alchin 2019
J. Burton Browning and Marty AlchinPro Python 3https://doi.org/10.1007/978-1-4842-4385-5_5

5. Common Protocols

J. Burton Browning1  and Marty Alchin2
(1)
Oak Island, NC, USA
(2)
Agoura Hills, CA, USA
 

Most of the time, you’ll want to define objects that are highly customized to the needs of your application. This often means coming up with your own interfaces and APIs that are unique to your own code. The flexibility to do this is essential to the expansion capabilities of any system, but there is a price. Everything new that you invent must be documented and understood by those who need to use it.

Understanding how to use the various classes made available by a framework can be quite a chore for users of that framework, even with proper documentation. A good way to ease the burden on users is to mimic interfaces they’re already familiar with. There are many existing types that are standard issue in Python programming, and most of them have interfaces that can be implemented in custom classes.

Methods are the most obvious way to implement an existing interface, but with many of the built-in types, most of the operations are performed with native Python syntax rather than explicit method calls. Naturally, these syntactic features are backed by actual methods behind the scenes, so they can be overridden to provide custom behaviors.

The following sections show how the interfaces for some of the most common types used in Python can be imitated in custom code. This is by no means an exhaustive list of all the types that ship with Python, nor is every method represented. Instead, this chapter is a reference for those methods that aren’t so obvious because they’re masked by syntactic sugar.

Basic Operations

Even though there are a wide variety of object types available in Python, most of them share a common set of operations. These are considered to be something of a core feature set, representing some of the most common high-level aspects of object manipulation, many of which are just as applicable to simple numbers as they are to many other objects.

One of the simplest and most common needs in all of programming, Python included, is to evaluate an expression to a Boolean value so that it can be used to make simple decisions. Typically this is used in if blocks, but these decisions also come into play when using while, and Boolean operations such as and and or. When Python encounters one of these situations, it relies on the behavior of the __bool__() method to determine the Boolean equivalent of an object.

The __bool__() method , if implemented, accepts just the usual self and must return either True or False. This allows any object to determine whether it should be considered to be true or false in a given expression, using whatever methods or attributes are appropriate:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figa_HTML.jpg
>>> bool(0)
False
>>> bool(1)
True
>>> bool(5)
True

As another example, consider that a class representing a rectangle might use its area to determine whether the rectangle is considered true or false. Therefore, __bool__() only has to check whether there exists a nonzero width and a nonzero height, since with a bool 0 is false and any other positive value, typically 1, is true. Here we use the built-in bool(), which uses __bool__() to convert the value to a Boolean:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figb_HTML.jpg
>>> class Rectangle:
...     def __init__(self, width, height):
...         self.width = width
...         self.height = height
...     def __bool__(self):
...         if self.width and self.height:
...             return True
...         return False
...
>>> bool(Rectangle(10, 15))
True
>>> bool(Rectangle(0, 0))
False
>>> bool(Rectangle(0, 15))
False

Tip

The __bool__() method isn’t the only way to customize Python’s Boolean behavior. If, instead, an object provides a __len__() method, which is described in the section on sequences later in this chapter, Python will fall back to that and consider any nonzero lengths to be true, while lengths of zero are false.

With the truthfulness of objects taken into account, you automatically get control over the behavior of such operators as and, or, and not. Therefore, there are no separate methods to override in order to customize those operators.

In addition to being able to determine the truthfulness of an object, Python offers a great deal of flexibility in other operations as well. In particular, the standard mathematical operations can be overridden because many of them can apply to a variety of objects beyond just numbers.

Mathematical Operations

Some of the earliest forms of math stemmed from observations about the world around us. Therefore, most of the math we learned in elementary school applies just as easily to other types of objects as it does to numbers. For example, addition could be seen as simply putting two things together (concatenation), such as tying two strings together to make a single longer string.

If you only look at it mathematically, you could say that you’re really just adding two lengths together, resulting in a single, greater length. But when you look at what really just happened, you now have a brand-new string, which is different from the two strings that went into it originally.

This analogy extends easily into Python strings as well, which can be concatenated using standard addition, rather than requiring a separate, named method. Similarly, if you need to write the same string out multiple times, you can simply multiply it the same way you would a regular number. These types of operations are very common in Python because they can be a simple way to implement common tasks:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figc_HTML.jpg
>>> 2 + 2
4
>>> 'two' + 'two'
'twotwo'
>>> 2 * 2
4
>>> 'two' * 2
'twotwo'

Like __bool__(), these behaviors are controlled by special methods of their own. Most of them are fairly straightforward, accepting the usual self as well as an other argument. These methods are bound to the object on the left side of the operator, with the additional other being the object on the right side.

The four basic arithmetic operations —addition, subtraction, multiplication, and division—are represented in Python using the standard operators +, -, *, and /. Behind the scenes, the first three are powered by implementations of the __add__(), __sub__(), and __mul__() methods. Division is a bit more complicated, and we’ll get to that shortly, but for now, let’s take a look at how this operator overloading works.

Consider a class that acts as a simple proxy around a value. There’s not much use for something like this in the real world, but it’s a good starting point to explain a few things:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figd_HTML.jpg
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __add__(self, other):
...         return self.value + other
...
>>> Example(10) + 20
30
This is just one example of a few basic arithmetic operations that are available for your code to customize. You’ll find more advanced operations detailed throughout the remainder of this chapter; Table 5-1 lists these basic arithmetic operators.
Table 5-1

Basic Arithmetic Operators

Operation

Operator

Custom Method

Addition

+

__add__()

Subtraction

-

__sub__()

Multiplication

*

__mul__()

Division

/

__truediv__()

Here’s where things get interesting, because you’ll notice that the method for division isn’t __div__(), as you might expect. The reason for this is that division comes in two different flavors. The kind of division you get when you use a calculator is called true division in Python, which uses the __truediv__() method , which works as you’d expect.

However, true division is the only arithmetic operation that can take two integers and return a noninteger. In some applications, it’s useful to always get an integer back instead. If you’re displaying an application’s progress as a percentage, for instance, you don’t really need to display the full floating point number.

Instead an alternative operation is available called floor division ; you may also have heard it referred to as integer division. If the result of true division would land between two integers, floor division will simply return the lower of the two, so that it always returns an integer. Floor division, as you might expect, is implemented with a separate __floordiv__() and is accessed using the // operator:

../images/330715_3_En_5_Chapter/330715_3_En_5_Fige_HTML.jpg
>>> 5 / 4
1.25
>>> 5 // 4
1

There’s also a modulo operation, which is related to division. In the event that a division operation would result in a remainder, using modulo would return that remainder, so modulo returns only the remainder of division. This uses the % operator, implemented using __mod__(). This is used by strings to perform standard variable interpretation, even though that has nothing to do with division:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figf_HTML.jpg
>>> 20 // 6
3
>>> 20 % 6
2
>>> 'test%s' % 'ing'
'testing'

In effect, you can use floor division and a modulo operation to obtain the integer result of a division operation as well as its remainder, which retains all the information about the result. This is sometimes preferable to true division, which would simply produce a floating point number. For example, consider a function that takes a number of minutes and has to return a string containing the number of hours and minutes:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figg_HTML.jpg
>>> def hours_and_minutes(minutes):
...     return minutes // 60, minutes % 60
...
>>> hours_and_minutes(60)
(1, 0)
>>> hours_and_minutes(137)
(2, 17)
>>> hours_and_minutes(42)
(0, 42)

In fact, this basic task is common enough that Python has its own function for it: divmod(). By passing in a base value and a value to divide it by, you can get the results of floor division and a modulo operation at the same time. Rather than simply delegating to those two methods independently, however, Python will try to call a __divmod__() method , which allows a custom implementation to be more efficient.

In lieu of a more efficient implementation, the __divmod__() method can be illustrated using the same technique as the hours_and_minutes() function. All we have to do is accept a second argument in order to take the hard-coded 60 out of the method:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figh_HTML.jpg
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __divmod__(self, divisor):
...         return self.value // divisor, self.value % divisor
...
>>> divmod(Example(20), 6)
(3, 2)

There’s also an extension of multiplication called exponentiation, where a value is multiplied by itself a number of times. Given its relationship to multiplication, Python uses a double-asterisk ** notation to perform the operation. It’s implemented using a __pow__() method , because real-world math typically calls it raising a value to a power of some other value:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figi_HTML.jpg
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __pow__(self, power):
...         val = 1
...         for x in range(power):
...             val *= self.value
...         return val
...
>>> Example(5) ** 3
125

Unlike the other operations, exponentiation can be performed in one other way as well, by way of the built-in pow() function. The reason there’s a different operator is that it allows for an extra argument to be passed in. This extra argument is a value that should be used to perform a modulo operation after the exponentiation has been performed. This extra behavior allows for a more efficient way to perform such tasks as finding prime numbers, which is commonly used in cryptography:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figj_HTML.jpg
>>> 5 ** 3
125
>>> 125 % 50
25
>>> 5 ** 3 % 50
25
>>> pow(5, 3, 50)
25

In order to support this behavior with the __pow__() method , you can optionally accept an extra argument, which will be used to perform the modulo operation. This new argument must be optional in order to support the normal ** operator. There’s no reasonable default value that can be used blindly without causing problems with standard exponentiation, so it should default to None to determine whether the modulo operation should be performed:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figk_HTML.jpg
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __pow__(self, power, modulo=None):
...         val = 1
...         for x in range(power):
...             val *= self.value
...         if modulo is not None:
...             val %= modulo
...         return val
...
>>> Example(5) ** 3
125
>>> Example(5) ** 3 % 50
25
>>> pow(Example(5), 3, 50)
25

Caution

As with the __divmod__() implementation shown previously, this example is not a very efficient approach at solving the problem. It does produce the correct values, but it should be used only for illustration.

Bitwise Operations

Bitwise operations are used in situations in which you are working on binary files, cryptography, encoding, hardware drivers, and networking protocols. As such, they are often associated with low-level programming; however, they are certainly not exclusively reserved for that domain. With bitwise operations, a separate group of operations act on values not as numbers directly, but rather as a sequence of individual bits. At that level, there are a few different ways of manipulating values that are applicable to not only numbers but some other types of sequences as well. The simplest bitwise manipulation is a shift, where the bits within a value are moved to the right or to the left, resulting in a new value.

In binary arithmetic, shifting bits one place to the left multiplies the value by two. This is just like in decimal math: if you move all the digits in a number one place to the left and fill in the gap on the right with a zero, you’ve essentially multiplied the value by ten. This behavior exists for any numbered base, but computers work in binary, so the shifting operations do as well.

Shifting is achieved using the << and >> operators for left and right, respectively. The right-hand side of the operator indicates how many positions the bits should be shifted. Internally, these operations are supported by the __lshift__() and __rshift__() methods, each of which accepts the number of positions to shift as its only additional argument:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figl_HTML.jpg
>>> 10 << 1
20
>>> 10 >> 1
5

In addition to shuffling the bits around, there are a few operations that compare the bits in each value to each other, resulting in a new value that represents some combination of the two individual values. The four bitwise comparison operations are &, |, ^, and ~, referred to AND, OR, XOR (exclusive OR), and inversion, respectively.

An AND comparison returns 1 only if both of the individual bits being compared are 1. If it’s any other combination, the result is 0. This behavior is often used to create a bitmask, where you can reset all irrelevant values to 0 by applying AND to a value that has 1 for each of the useful bits and 0 for the rest. This will clear out any bits you aren’t interested in, allowing for easy comparisons with sets of binary flags. Supporting this behavior in your code requires the presence of an __and__() method.

OR comparisons return 1 if either of the individual bits being compared is 1. It doesn’t matter if both of them are 1; as long as at least one of them is 1, the result will be 1. This is often used to join sets of binary flags together, so that all the flags from both sides of the operator are set in the result. The method required to support this functionality is __or__().

The standard OR operator is sometimes called an inclusive OR, to contrast it with its cousin, the exclusive OR, which is typically abbreviated as XOR. In an XOR operation, the result is 1 only if one of the individual bits was 1 but not the other. If both bits are 1 or both bits are 0, the result will be 0. XOR is supported by the __xor__() method.

Finally, Python also offers bitwise inversion, where each of the bits gets flipped to the opposite value from what it is currently; 1 becomes 0, and vice versa. Numerically, this swaps between negative and positive values, but it doesn’t simply change the sign. Here’s an example of how numbers react when inverted using the ~ operator:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figm_HTML.jpg
>>> ~42
-43
>>> ~-256
255

This behavior is based on the way computers work with signed values. The most significant bit is used to determine whether the value is positive or negative, so flipping that bit changes the sign. The change in the absolute value after inversion is due to a lack of –0. When 0 is inverted it becomes –1 rather than –0, so all other values follow suit after that.

In custom code, inversion is typically most useful when you have a known set of all possible values, along with individual subsets of those values. Inverting these subsets would remove any existing values and replace them with any values from the master set that weren’t previous in the subset.

This behavior can be provided by supplying an __invert__() method on your object. Unlike the other bitwise methods, however, __invert__() is unary, so it doesn’t accept any additional arguments beyond the standard self.

Note

The inversion behavior described here is valid for numbers that are encoded using the two’s-complement method for working with signed numbers. There are other options1 available that can behave differently than what’s shown here if a custom number class provides the __invert__() method to do so. By default, Python works only with the two’s-complemented encoding method.

Variations

In addition to the normal behavior of operations, there are a couple different ways they can also be accessed. The most obvious issue is that the methods are typically bound to the value on the left-hand side of the operator. If your custom object gets placed on the right-hand side instead, there’s a good chance that the value on the left won’t know how to work with it, so you’ll end up with a TypeError instead of a usable value.

This behavior is understandable but unfortunate, because if the custom object knows how to interact with the other value, it should be able to do so regardless of their positions. To allow for this, Python gives the value on the right-hand side of the operator a chance to return a valid value.

When the left-hand side of the expression fails to yield a value, Python then checks to see if the value on the right is of the same type. If it is, there’s no reason to expect that it would be able to do any better than the first time around, so Python simply raises the TypeError. If it’s a different type, however, Python will call a method on the right-hand value, passing in the left-hand value as its argument.

This process swaps the arguments around, binding the method to the value on the right-hand side. For some operations, such as subtraction and division, the order of the values is important, so Python uses a different method to indicate the change in ordering. The names of these separate methods are mostly the same as the left-hand methods, but with an r added after the first two underscores:

../images/330715_3_En_5_Chapter/330715_3_En_5_Fign_HTML.jpg
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __add__(self, other):
...         return self.value + other
...
>>> Example(20) + 10
30
>>> 10 + Example(20)
Traceback (most recent call last):
  ...
TypeError: unsupported operand type(s) for +: 'int' and 'Example'
>>> class Example:
...     def __init__(self, value):
...         self.value = value
...     def __add__(self, other):
...         return self.value + other
...     def __radd__(self, other):
...         return self.value + other
...
>>> Example(20) + 10
30
>>> 10 + Example(20)
30

Tip

In cases like this in which the order of the values doesn’t affect the result, you can actually just assign the left-hand method to the name of the right-hand method. Just remember that not all operations work that way, so you can’t blindly copy the method to both sides without ensuring that it makes sense.

Another common way to use these operators is to modify an existing value and assign the result right back to the original value. As has been demonstrated without explanation earlier in this chapter, an alternative form of assignment is catered to these modifications. By simply appending = to the operator you need, you can assign the result of the operation to the value on the left-hand side:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figo_HTML.jpg
>>> value = 5
>>> value *= 3
>>> value
15

By default, this form of augmented assignment uses the standard operator methods in the same way as was described previously in this chapter. However, that requires creating a new value after the operation, which is then used to rebind an existing value. Instead, it can sometimes be advantageous to modify the value in place, as long as you can identify when this assignment is taking place.

Like the right-hand side methods, in-place operators use essentially the same method names as the standard operators, but this time with an i after the underscores. There’s no right-hand side equivalent of this operation, however, because the assignment is always done with the variable on the left-hand side. With everything taken into account, Table 5-2 lists the available operators, along with the methods required to customize their behavior.
Table 5-2

Available Operators

Operation

Operator

Left-hand

Right-hand

In-line

Addition

+

__add__()

__radd__()

__iadd__()

Subtraction

-

__sub__()

__rsub__()

__isub__()

Multiplication

*

__mul__()

__rmul__()

__imul__()

True division

/

__truediv__()

__rtruediv__()

__itruediv__()

Floor division

//

__floordiv__()

__rfloordiv__()

__ifloordiv__()

Modulo

%

__mod__()

__rmod__()

__imod__()

Division & modulo

divmod()

__divmod__()

__rdivmod__()

N/A

Exponentiation

**

__pow__()

__rpow__()

__ipow__()

Left binary shift

<<

__lshift__()

__rlshift__()

__ilshift__()

Right binary shift

>>

__rshift__()

__rrshift__()

__irshift__()

Bitwise AND

&

__and__()

__rand__()

__iand__()

Bitwise OR

|

__or__()

__ror__()

__ior__()

Bitwise XOR

^

__xor__()

__rxor__()

__ixor__()

Bitwise inversion

~

__invert__()

N/A

N/A

Note

There’s no in-line method for the division and modulo operation because it’s not available as an operator that supports assignment. It’s only called as the divmod() method , which has no in-line capabilities. Also, bitwise inversion is a unary operation, so there’s no right-side or in-line method available.

Even though these operations are primarily focused on numbers, many of them also make sense for other types of objects. There is another set of behaviors, however, that really only makes sense for numbers and objects that can act like numbers.

Numbers

Underneath it all computers are all about numbers, so it’s only natural that they play an important role in most applications. Beyond the operations outlined in the previous section, there are many various behaviors exhibited by numbers that may not be as obvious.

The most basic behavior a custom number can have is to convince Python that it is in fact a number. This is necessary when trying to use an object as an index in a sequence. Python requires that all indexes be integers, so there needs to be a way to coerce an object into an integer for the sake of being used as an index. For this Python uses an __index__() method , raising a TypeError if it doesn’t exist or it returns something other than an integer:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figp_HTML.jpg
>>> sequence = [1, 2, 3, 4, 5]
>>> sequence[3.14]
Traceback (most recent call last):
  ...
TypeError: list indices must be integers, not float
>>> class FloatIndex(float):
...     def __index__(self):
...         # For the sake of example, return just the integer portion
...         return int(self)
...
>>> sequence[FloatIndex(3.14)]
4
>>> sequence[3]
4

In addition to simple index access, __index__() is used to coerce an integer for the sake of slicing and to generate a starting value for conversion using the built-in bin(), hex(), and oct() functions. When looking to explicitly force an integer in other situations, you can use the __int__() method , which is used by the built-in int() function. Other type conversions can be performed using __float__() to support float() and __complex__() for complex().

One of the most commonly required operations when converting one number to another is rounding. Unlike int(), which blindly truncates any part of the value that’s not an integer, rounding affords more control over what type of value you end up with and how much precision is retained.

When you pass a decimal or a floating point number into int(), the effect is essentially just a floor operation. Like floor division mentioned previously, a floor operation takes a number between two integers and returns the lower of the two. The math module contains a floor() function to perform this operation.

As you might expect, this relies on a __floor__() method on a custom object to perform the floor operation. It doesn’t require any arguments beyond the usual self and should always return an integer. Python doesn’t actually enforce any requirements on the return value, however, so if you’re working with some subclass of integers, you can return one of those instead.

By contrast, you may need to go with the higher of the two, which would be a ceiling operation. This is done using math.ceil() and implemented with the __ceil__() method. Like __floor__(), it doesn’t take any additional arguments and returns an integer.

More likely, you’ll need to round a value to a specific number of digits. This is achieved using the round() function, which is a built-in function, rather than being located in the math module. It takes up to two arguments and is implemented using the __round__() method on a custom object.

The first argument to round() is the object that __round__() will be bound to, so it comes through as the standard self. The second argument is a bit more nuanced, however. It’s the number of digits to the right of the decimal point that should be considered significant, and thus retained in the result. If it’s not provided, round() should assume that none of those digits are significant and return an integer:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figq_HTML.jpg
>>> round(3.14, 1)
3.1
>>> round(3.14)
3
>>> round(3.14, 0)
3.0
>>> import decimal
>>> round(decimal.Decimal('3.14'), 1)
Decimal('3.1')
>>> round(decimal.Decimal('3.14'))
3

As you can see, there’s actually a difference between passing a second argument of 0 and not passing one at all. The return value is essentially the same, but when not passing it in, you should always get an integer. When passing in a 0 instead, you’ll get whatever type you pass in, but with only the significant digits included.

In addition to rounding digits to the right of the decimal point, round() can act on the other side as well. By passing in a negative number, you can specify the number of digits to the left of the decimal point that should be rounded away, leaving the other digits remaining:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figr_HTML.jpg
>>> round(256, -1)
260
>>> round(512, -2)
500

Sign Operations

There is also a selection of unary operations that can be used to adjust the sign of a value. The first, -, negates the sign, swapping between positive and negative values. Customization of this behavior is made available by providing a __neg__() method, which accepts no extra arguments beyond self.

To complement the negative sign, Python also supports a positive sign, using +. Because numbers are ordinarily assumed to be positive, this operator actually doesn’t do anything on its own; it simply returns the number unchanged. In the event that a custom object needs an actual behavior attached to this, however, a __pos__() method can provide it.

Finally, a number can also have an absolute value, which is generally defined as its distance from zero. The sign is irrelevant, and all values become positive. Therefore, applying abs() to a number removes the negative sign if present but leaves positive values unchanged. This behavior is modified by an __abs__() method.

Comparison Operations

The operations shown thus far have been concerned with returning a modified value, based at least in part on one or more existing values. Comparison operators, by contrast, return either True or False, based on the relationship between two values.

The most basic comparison operators, is and is not, operate directly on the internal identity of each object. Because the identity is typically implemented as the object’s address in memory, which can’t be changed by Python code, there’s no way to override this behavior. Its use is generally reserved for comparison with known constants, such as None.

The operators that are available represent the standard numerical comparisons, which detect if one value is higher, lower, or exactly equal to another. The most versatile is testing for equality, using ==. Its versatility comes from the fact that it’s not limited to numerical values because many other types can have objects that are considered equal to each other. This behavior is controlled by an __eq__() method.

Inequality is represented in Python by the != operator, which behaves just as you would expect. What you might not expect, however, is that this functionality is not tied to == in any way. Rather than simply calling __eq__() and inverting its result, Python relies on a separate __ne__() method to handle inequality testing. Therefore, if you implement __eq__(), always remember to supply __ne__() as well to ensure that everything works as expected.

In addition, you can compare one value as less than or greater than another, using < and >, which are implemented using __lt__() and __gt__(), respectively. Equality can also be combined with these, so that one value can be greater than or equal to another, for instance. These operations use <= and >= and are supported by __lte__() and __gte__().

These comparisons are often used for objects that are predominantly represented by a number, even if the object itself is much more than that. Dates and times are notable examples of objects that are easily comparable because they’re each essentially a series of numbers that can each be compared individually if needed:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figs_HTML.jpg
>>> import datetime
>>> a = datetime.date(2019, 10, 31)
>>> b = datetime.date(2017, 1, 1)
>>> a == b
False
>>> a < b
True

Strings are an interesting case with regard to comparisons. Even though a string isn’t numeric in an obvious sense, each character in a string is simply another representation of a number, so string comparisons also work. These comparisons drive the sorting features of strings.

Iterables

It may seem like sequences are the obvious next choice, but there’s a more generic form to consider first. An object is considered iterable if it can yield objects one at a time, typically within a for loop. This definition is intentionally simple, because at a high level, iterables really don’t go beyond that. Python does have a more specific definition of iterables, however.

In particular, an object is iterable if passing it into the built-in iter() function returns an iterator. Internally, iter() inspects the object passed in, looking first for an __iter__() method . If such a method is found, it’s called without any arguments and is expected to return an iterator. There’s another step that will take place if __iter__() wasn’t available, but for now, let’s focus on iterators.

Even though the object is considered iterable, it’s the iterator that does all the real work, but there’s really not that much to it. There’s no requirement for what the __init__() method should look like, because it gets instantiated within the __iter__() method of its master object. The required interface consists of just two methods.

The first method, perhaps surprisingly, is __iter__(). Iterators should always be iterable on their own as well, so they must provide an __iter__() method. There’s usually no reason to do anything special in this method, though, so it’s typically implemented to just return self. If you don’t supply __iter__() on the iterator the main object will still be iterable in most cases, but some code will expect its iterator to be usable on its own as well.

More importantly, an iterator must always provide a __next__() method , where all the real work happens. Python will call __next__() to retrieve the next value from the iterator, with that value being used in the body of whatever code called the iterator. When that code needs a new value, typically for the next pass in a loop, it calls __next__() again to get a new value. This process continues until one of a few things happens.

If Python encounters anything that causes the loop to complete while the iterator still has items it could produce, the iterator just stands by, waiting for some other code to ask for another item. If that never happens, eventually there will be no more code that knows about the iterator at all, so Python will remove it from memory. Chapter 6 covers this garbage collection process in greater detail.

There are a few different cases where an iterator might not be given a chance to finish. The most obvious is a break statement , which would stop the loop and continue on afterward. Additionally, a return or a raise statement would implicitly break out of any loop it’s part of, so the iterator is left in the same state as when a break occurs.

More commonly, however, the loop will just let the iterator run until it doesn’t have any more items to produce. When using a generator, this case is handled automatically when the function returns without yielding a new value. With an iterator, this behavior must be provided explicitly.

Because None is a perfectly valid object that could reasonably be yielded from an iterator, Python can’t just react to __next__() failing to return a value. Instead, the StopIteration exception provides a way for __next__() to indicate that there are no more items. When this is raised the loop is considered complete, and execution resumes on the next line after the end of the loop.

To illustrate how all of this fits together, let’s take a look at the behavior of the built-in range() function . It’s not a generator because you can iterate over it multiple times. To provide similar functionality we need to return an iterable object instead, which can then be iterated as many times as necessary:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figt_HTML.jpg
class Range:
    def __init__(self, count):
        self.count = count
    def __iter__(self):
        return RangeIter(self.count)
class RangeIter:
    def __init__(self, count):
        self.count = count
        self.current = 0
    def __iter__(self):
        return self
    def __next__(self):
        value = self.current
        self.current += 1
        if self.current > self.count:
            raise StopIteration
        return value
>>> def range_gen(count):
...     for x in range(count):
...         yield x
...
>>> r = range_gen(5)
>>> list(r)
[0, 1, 2, 3, 4]
>>> list(r)
[]
>>> r = Range(5)
>>> list(r)
[0, 1, 2, 3, 4]
>>> list(r)
[0, 1, 2, 3, 4]

Iterators are the most powerful and flexible way to implement an iterable, so they’re generally preferred, but there’s also another way to achieve a similar effect. What makes an object iterable is the fact that iter() returns an iterator, so it’s worth noting that iter() supports a certain kind of special case.

If an object doesn’t have an __iter__() method, but contains a __getitem__() method instead, Python can use that in a special iterator that exists just to handle that case. We’ll get to more details in the next section on sequences, but the basic idea is that __getitem__() accepts an index and is expected to return the item in that position.

If Python finds __getitem__() instead of __iter__(), it will automatically create an iterator designed to work with it. This new iterator calls __getitem__() several times, each with a value from a series of numbers, beginning with zero, until __getitem__() raises an IndexError. Therefore, our custom Range iterable can be rewritten quite simply:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figu_HTML.jpg
class Range:
    def __init__(self, count):
        self.count = count
    def __getitem__(self, index):
        if index < self.count:
            return index
        raise IndexError
>>> r = Range(5)
>>> list(r)
[0, 1, 2, 3, 4]
>>> list(r)
[0, 1, 2, 3, 4]

Note

Python will only use this __getitem__() behavior if __iter__() is not present. If both are provided on a class, the __iter__() method will be used to control the iteration behavior.

Example: Repeatable Generators

The ability to iterate over an object multiple times is very common among explicitly iterable object types, but generators are often more convenient to work with. If you need to have a generator that can restart itself each different time the iterator is accessed, it may seem like you’re stuck either losing out on that functionality or adding a bunch of otherwise unnecessary code that exists solely to allow for proper iteration.

Instead, like many other behaviors, we can rely on Python’s standard way to augment a function and factor it out into a decorator. When applied to a generator function, this new decorator can handle everything necessary to create an iterable that triggers the generator from the beginning each time a new iterator is requested:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figv_HTML.jpg
def repeatable(generator):
    """
    A decorator to turn a generator into an object that can be
    iterated multiple times, restarting the generator each time.
    """
    class RepeatableGenerator:
        def __init__(self, *args, **kwargs):
            self.args = args
            self.kwargs = kwargs
        def __iter__(self):
            return iter(generator(*self.args, **self.kwargs))
    return RepeatableGenerator
>>> @repeatable
... def generator(max):
...     for x in range(max):
...         yield x
...
>>> g = generator(5)
>>> list(g)
[0, 1, 2, 3, 4]
>>> list(g)
[0, 1, 2, 3, 4]

By creating a new class that can be instantiated when the generator function is called, its __iter__() method will get called instead of the generator’s. This way, the generator can be called from scratch each time a new loop begins, yielding a new sequence rather than trying to pick up where it left off, which would often mean returning an empty sequence.

Caution

Even though most generators return a similar sequence each time through and can be restarted without worry, not all of them behave that way. If a generator changes its output based on when it’s called, picks up where it left off on subsequent calls or produces side effects, this decorator is not recommended. By changing the behavior to explicitly restart the decorator each time, the new generator could yield unpredictable results.

There’s one problem with the code as it stands, however. The @repeatable decorator receives a function but returns a class, which works fine in the example provided but has some very troubling implications. To start, remember from Chapter 3 that wrapper functions have new properties, a problem that can be fixed using the @functools.wraps decorator.

Before we can even consider using another decorator, however, we have to solve the bigger problem: we’re returning a completely different type than the original function. By returning a class instead of a function, we’ll cause problems with any code that expects it to be a function, including other decorators. Worse yet, the class returned can’t be used as a method because it doesn’t have a __get__() method to bind it to its owner class or an instance of it.

To solve these issues, we have to introduce a wrapper function around the class, which will instantiate the object and return it. This way, we can use @functools.wraps to retain as much of the original decorator as possible. Better yet, we can then also return a function, which can be bound to classes and instances without any trouble:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figw_HTML.jpg
import functools
def repeatable(generator):
    """
    A decorator to turn a generator into an object that can be
    iterated multiple times, restarting the generator each time.
    """
    class RepeatableGenerator:
        def __init__(self, *args, **kwargs):
            self.args = args
            self.kwargs = kwargs
        def __iter__(self):
            return iter(generator(*self.args, **self.kwargs))
    @functools.wraps(generator)
    def wrapper(*args, **kwargs):
        return RepeatableGenerator(*args, **kwargs)
    return wrapper

Sequences

After numbers, sequences are perhaps some of most commonly used data structures in all of programming, including Python. Lists, tuples, and even strings are sequences that share a common set of features, which are actually a specialized type of iterator. In addition to being able to yield a series of items individually, sequences have additional attributes and behaviors supporting the fact that they know about the entire set of items all at once.

These extra behaviors don’t necessarily require that all the items be loaded into memory at the same time. The efficiency gains achieved through iteration are just as valid with sequences as with any other iterable, so that behavior doesn’t change. Instead, the added options simply refer to collection as a whole, including its length and the ability to get a subset of it, as well as accessing individual items without getting the whole sequence.

The most obvious feature of a sequence is the ability to determine its length. For objects that can contain any arbitrary items, this requires knowing—or perhaps counting—all those items. For others, the object can use some other information to reach the same result. Customization of this behavior is achieved by providing a __len__() method , which is called internally when the object is passed into the built-in len() function .

To continue along the same lines as previous examples, here’s how a simple replacement Range class could use knowledge of its configuration to return the length without having to yield a single value:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figx_HTML.jpg
class Range:
    def __init__(self, max):
        self.max = max
    def __iter__(self):
        for x in range(self.max):
            yield x
    def __len__(self):
        return self.max

Because sequences contain a fixed collection of items, they can be iterated not only from start to finish but also in reverse. Python provides the reversed() function , which takes a sequence as its only argument and returns an iterable that yields items from the sequence in reverse. There may be particular efficiency gains to be had, so a custom sequence object can provide a __reversed__() method to customize the internal behavior of reversed().

Taking this notion to the Range class again, it’s possible to provide a reversed range using an alternative form of the built-in range():

../images/330715_3_En_5_Chapter/330715_3_En_5_Figy_HTML.jpg
class Range:
    def __init__(self, max):
        self.max = max
    def __iter__(self):
        for x in range(self.max):
            yield x
    def __reversed__(self):
        for x in range(self.max - 1, -1, -1):
            yield x

Now that we have the ability to iterate over a sequence both forward and backward as well as report its length, the next step is to provide access to individual items. In a plain iterable, you can only access items by retrieving them one at a time as part of a loop. With all the values in the sequence known in advance, a custom class can provide access to any item at any time.

The most obvious task is to retrieve an item given an index that’s known in advance. For example, if a custom object contained the arguments passed in on the command line, the application would know the specific meaning of each argument and would typically access them by index rather than simply iterating over the whole sequence. This uses the standard sequence[index] syntax, with its behavior controlled by the __getitem__() method .

With __getitem__(), individual items can be picked out of the sequence or retrieved from some other data structure if necessary. Continuing on the Range theme again, __getitem__() can calculate what the appropriate value should be without cycling through the sequence. In fact, it can even support the full range of arguments that are available to the built-in range():

../images/330715_3_En_5_Chapter/330715_3_En_5_Figz_HTML.jpg
class Range:
    def __init__(self, a, b=None, step=1):
        """
        Define a range according to a starting value, an end value and a step.
        If only one argument is provided, it's taken to be the end value. If
        two arguments are passed in, the first becomes a start value, while the
        second is the end value. An optional step can be provided to control
        how far apart each value is from the next.
        """
        if b is not None:
            self.start = a
            self.end = b
        else:
            self.start = 0
            self.end = a
        self.step = step
    def __getitem__(self, key):
        value = self.step * key + self.start
        if value < self.end:
            return value
        else:
            raise IndexError("key outside of the given range")
>>> r = Range(5)
>>> list(r)
[0, 1, 2, 3, 4]
>>> r[3]
3
>>> r = Range(3, 17, step=4)
>>> list(r)
[3, 7, 11, 15]
>>> r[2]
11
>>> r[4]
Traceback (most recent call last):
  ...
IndexError: indexed value outside of the given range

In the event that the index passed in is beyond the range of available items, __getitem__() should raise an IndexError . Highly specialized applications could define a more specific subclass and raise that instead, but most use cases will simply catch IndexError on its own.

In addition to matching the expectations of most Python programmers, properly raising IndexError is essential to allow a sequence to be used as an iterable without implementing __iter__(). Python will simply pass in integer indexes until the __getitem__() method raises an IndexError, at which point it will stop iterating over the sequence.

Beyond just accessing a single item at a time, a sequence can provide access to subsets of its contents by way of slicing. When using the slicing syntax, __getitem__() receives a special slice object instead of an integer index. A slice object has dedicated attributes for the start, stop, and step portions of the slice, which can be used to determine which items to return. Here’s how this affects the Range object we’ve been examining:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figaa_HTML.jpg
class Range:
    def __init__(self, a, b=None, step=1):
        """
        Define a range according to a starting value, an end value and a step.
        If only one argument is provided, it's taken to be the end value. If
        two arguments are passed in, the first becomes a start value, while the
        second is the end value. An optional step can be provided to control
        how far apart each value is from the next.
        """
        if b is not None:
            self.start = a
            self.end = b
        else:
            self.start = 0
            self.end = a
        self.step = step
    def __getitem__(self, key):
        if isinstance(key, slice):
            r = range(key.start or 0, key.stop, key.step or 1)
            return [self.step * val + self.start for val in r]
        value = self.step * key + self.start
        if value < self.end:
            return value
        else:
            raise IndexError("key outside of the given range")

The next logical step is to allow an individual item in the sequence to be set according to its index. This in-place assignment uses essentially the same sequence[index] syntax but as the target of an assignment operation. It’s supported by a custom object in its __setitem__() method , which accepts both the index to access and the value to store at that index.

Like __getitem__(), however, __setitem__() can also accept a slice object as its index, rather than an integer. Because a slice defines a subset of the sequence, however, the value that’s passed is expected to be another sequence. The values in this new sequence will then take the place of those in the subset referenced by the slice.

Things aren’t exactly as they seem, however, because the sequence being assigned to the slice doesn’t actually need to have the same number of items as the slice itself. In fact, it can be of any size, whether larger or smaller than the slice it’s being assigned to. The expected behavior of __setitem__() is simply to remove the items referenced by the slice, then place the new items in that gap, expanding or contracting the size of the total list as necessary to accommodate the new values.

Note

The __setitem__() method is only intended for replacing existing values in the sequence, not for strictly adding new items. To do that you’ll need to also implement append() and insert(), using the same interfaces as standard lists.

Removing an item from a list can be achieved in one of two different ways. The explicit method for this is remove() (e.g., my_list(range(10, 20)).remove(5)), which takes the index of the item that should be removed. The remaining items that were positioned after the removed item are then shifted to the left to fill in the gap. This same behavior is also available using a del sequence[index] statement.

Implementing remove() is straightforward enough, given that it’s an explicit method call. The simple case for del works just like remove(), but using a __delitem__() method instead. In fact, if deleting a single item was all that mattered, you could simply assign an existing remove() method to the __delitem__ attribute, and it would work as expected. Unfortunately, slicing complicates matters slightly.

Deleting items from a slice works just like the first portion of the slicing behavior of __setitem__(). Instead of replacing the items in the slice with a new sequence, however, the sequence should simply shift its items to close up the gap.

With all the different ways to make changes to the contents of a sequence, the last—but not least—important feature is to test whether an item is a part of the given sequence. By default, Python will simply iterate over the sequence—using the techniques listed previously in the section on iterables—until it either finds the item being tested or exhausts all the values provided by the iterator. This allows a membership test to be performed on iterables of any type, without being limited to full sequences.

In order to be more efficient, sequences can override this behavior as well, by providing a __contains__() method . Its signature looks like __getitem__(), but rather than accepting an index, it accepts an object and returns True if the given object is present in the sequence or False otherwise. In the Range example examined previously, the result of __contains__() can be calculated on the fly, based on the configuration of the object:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figab_HTML.jpg
class Range:
    def __init__(self, a, b=None, step=1):
        """
        Define a range according to a starting value, an end value and a step.
        If only one argument is provided, it's taken to be the end value. If
        two arguments are passed in, the first becomes a start value, while the
        second is the end value. An optional step can be provided to control
        how far apart each value is from the next.
        """
        if b is not None:
            self.start = a
            self.end = b
        else:
            self.start = 0
            self.end = a
        self.step = step
    def __contains__(self, num):
        return self.start <= num < self.end and
               not (num – self.start) % self.step
>>> list(range(5, 30, 7))
[5, 12, 19, 26]
>>> 5 in Range(5, 30, 7)
True
>>> 10 in Range(5, 30, 7)
False
>>> 33 in Range(5, 30, 7)
False

Many of the methods presented here for sequences are also valid for the next container type, which maps a collection of keys to associated values.

Mappings

Whereas sequences are contiguous collections of objects, mappings work a bit differently. In a mapping the individual items are actually a pair, consisting of both a key and a value. Keys don’t have to be ordered because iterating over them isn’t generally the point. Instead, the goal is to provide fast access to the value referenced by a given key. The key is typically known in advance, and most common usage expects it.

Accessing a value by its key uses the same syntax as using indexes in sequences. In fact, Python doesn’t know or care if you’re implementing a sequence, a mapping or something completely different. The same methods, __getitem__(), __setitem__(), and __delitem__(), are reused to support the obj[key] syntax regardless of which type of object is used. That doesn’t mean the implementations of these methods can be identical, however.

For a mapping, a key is used as the index. Even though there’s no difference in syntax between the two, keys support a wider range of allowed objects. In addition to plain integers, a key may be any hashable Python object such as dates, times, or strings; of these, strings are by far the most common. It’s up to your application, however, to decide whether there should be any limitations on what keys to accept.

Python supports so much flexibility, in fact, that you can even use the standard slicing syntax without regard to what values are involved in the slice. Python simply passes along whatever objects were referenced in the slice, so it’s up to the mapping to decide how to deal with them. By default, lists handle slices by explicitly looking for integers, using __index__() if necessary to coerce objects into integers. For dictionaries, by contrast, slice objects aren’t hashable, so dictionaries don’t allow them to be used as keys.

Tip

For the most part you can accept anything in a custom dictionary, even if you intend to use only a specific type, such as strings, as your keys. As long as it only gets used in your own code, it won’t make any difference because you’re in control of all its uses. If you make modifications that prove to be useful outside of your application, other developers will make use of it for their own needs. Therefore, you should restrict the available keys and values only if you really need to; otherwise, it’s best to leave options open, even for yourself.

Even though this chapter hasn’t generally covered any methods that are called directly as part of the public interface, mappings have three methods that provide particularly useful access to internal components, which should always be implemented. These methods are necessary because mappings essentially contain two separate collections—keys and values—which are then joined together by association, whereas sequences only contain a single collection.

The first of these extra methods, keys(), iterates over all the keys in the mapping without regard to their values. By default, the keys can be returned in any order, but some more specialized classes could choose to provide an explicit order for these keys. This same behavior is provided by iteration over the mapping object itself, so be sure to always supply an __iter__() method that does the same thing as keys().

The next method, values(), is complementary, iterating over the values side of the mapping instead. Like the keys, these values generally aren’t assumed to be in any sort of order. In practice, the C implementation of Python uses the same order as it does for the keys, but order is never guaranteed, even between the keys and values of the same object.

In order to reliably get all the keys and values in their associated pairs, mappings provide an items() method. This iterates over the entire collection, yielding each pair as a tuple in the form of (key, value). Because this is often more efficient than iterating over the keys and using mapping[key] to get the associated value, all mappings should provide an items() method and make it as efficient as possible.

Callables

In Python, both functions and classes can be called to execute code at any time, but those aren’t the only objects that can do so. In fact, any Python class can be made callable by simply attaching a single extra method to the class definition. This method, appropriately named __call__(), accepts the usual self along with any arguments that should be passed along in the method call.

There are no special requirements for what arguments __call__() can accept because it works like any other method when it’s being called. The only difference is that it also receives the object it’s attached to as the first argument:

../images/330715_3_En_5_Chapter/330715_3_En_5_Figac_HTML.jpg
>>> class CallCounter:
...     def __init__(self):
...         self.count = 0
...     def __call__(self, *args, **kwargs):
...         self.count += 1
...         return 'Number of calls so far: %s' % self.count
...     def reset(self):
...         self.count = 0
...
>>> counter = CallCounter()
>>> counter()
'Number of calls so far: 1'
>>> counter()
'Number of calls so far: 2'
>>> counter()
'Number of calls so far: 3'
>>> counter.reset()
>>> counter()
'Number of calls so far: 1'

Caution

As a method itself, __call__() can also be decorated any number of times, but remember that it’s still a method, even though it is invoked by calling the object directly. As a method, any decorators applied to it must be able to deal with the first argument being an instance of the object.

As for what __call__() can do, the sky is the limit. Its purpose is solely to allow an object to be callable; what happens during that call depends entirely on the needs at hand. This example shows that it can also take any additional arguments you may need, like any other method or function. Its greatest strength, however, is that it allows you to essentially provide a function that can be customized on its own, without the need for any decorators.

Context Managers

As mentioned briefly in Chapter 2, objects can also be used as context managers for use in a with statement. This allows an object to define what it means to work within the context of that object, setting things up prior to executing the contained code and cleaning up after execution has finished.

One common example is file handling, because a file must be opened for a specific type of access before it can be used. Then it also needs to be closed when it’s no longer in use, to flush any pending changes to disk. This makes sure other code can open the same file later on, without conflicting with any open references. What happens between those two operations is said to be executed within the context of the open file.

As mentioned, there are two distinct steps to be performed by a context manager. First, the context needs to be initialized, so that the code that executes inside the with block can make use of the features provided by the context. Just prior to execution of the interior code block, Python will call the __enter__() method on the object. This method doesn’t receive any additional arguments, just the instance object itself. Its responsibility is then to provide the necessary initialization for the code block, whether that means modifying the object itself or making global changes.

If the with statement includes an as clause, the return value of the __enter__() method will be used to populate the variable referenced in that clause. It’s important to realize that the object itself won’t necessarily be that value, even though it may seem that way looking at the syntax for the with statement. Using the return value of __enter__() allows the context object to be more flexible, although that behavior can be achieved by simply returning self.

Once the code inside the with block finishes executing, Python will call the __exit__() method on the object. This method is then responsible for cleaning up any changes that were made during __enter__(), returning the context to whatever it was prior to processing the with statement. In the case of files, this would mean closing the file, but it could be virtually anything.

Of course, there are a few ways that execution within the with block can complete. The most obvious is if the code simply finishes on its own, without any problems or other flow control. Statements such as return, yield, continue, and break can also stop execution of the code block, in which case __exit__() will still be called because the cleanup is still necessary. In fact, even if an exception is raised, __exit__() is still given a chance to reverse any changes that were applied during __enter__().

In order to identify whether the code finished normally or stopped early by way of an exception, the __exit__() method will be given three additional arguments. The first is the class object for the exception that was raised, followed by the instance of that class, which is what was actually raised in the code. Finally, __exit__() will also receive a traceback object, representing the state of execution as of when the exception was raised.

All three of those arguments are always passed in, so any implementations of __exit__() must accept them all. If execution completed without raising any exceptions, the arguments will still be provided, but their values will simply be None. Having access to both the exception and a traceback allows your implementation of __exit__() to intelligently react to whatever went wrong and what led to the problem.

Tip

The __exit__() method doesn’t suppress any exceptions on its own. If __exit__() completes without a return value, the original exception, if any, will be reraised automatically. If you need to explicitly catch any errors that occur within the with block, simply return True from __exit__() instead of letting it fall off the end, which would return an implicit None.

To show one simple example, consider a class that uses the context management protocol to silence any exceptions that are raised within the with block. In this case, __enter__() doesn’t need to do anything because the exception handling will be done in __exit__():

../images/330715_3_En_5_Chapter/330715_3_En_5_Figad_HTML.jpg
>>> class SuppressErrors:
...     def __init__(self, *exceptions):
...         if not exceptions:
...             exceptions = (Exception,)
...         self.exceptions = exceptions
...     def __enter__(self):
...         pass
...     def __exit__(self, exc_class, exc_instance, traceback):
...         if isinstance(exc_instance, self.exceptions):
...          return True
...         return False
...
>>> with SuppressErrors():
...     1 / 0  # Raises a ZeroDivisionError
...
>>> with SuppressErrors(IndexError):
...     a = [1, 2, 3]
...     print(a[4])
...
>>> with SuppressErrors(KeyError):
...     a = [1, 2, 3]
...     print(a[4])
...
Traceback (most recent call last):
  ...
IndexError: list index out of range

Exciting Python Extensions: Scrapy

If you ever have the need to extract data from the Internet, most specifically making sense of data on web sites, then a web-scraping tool will be of great benefit. Scrapy is an open source and full featured tool for web scraping. If you have heard of “spiders” or “web crawling,” then you already are familiar with other terms for web scraping, but they are all the same. In the big scope of things, a web-scraping tool is one part of working with big data. Web scraping allows you to data mine information from the Internet while other tools would allow you to clean it up and others to categorize the raw and cleaned data you obtained. Python makes it easy to build a scaper. Read on to see how to get your raw data with Scrapy.

Installation

First you will need to install the libraries for the web-scraping tool Scrapy. To do this, get to an escalated command prompt (Windows) and type:
pip install scrapy (Enter)

MacOS and Linux will be similar; just check the scrapy.​org site for details.

Running Scrapy

You can run a spider directly via the run spider command or you can create a project directory that can hold one or more spiders. For quick work, such as just running one spider, it is just one simple command: scrapy runspider my_spider.py. However, sometimes you may want a project directory so that you can store configuration information and multiple spiders in an orderly manner. For our purposes, one spider will more than suffice.

Project Setup

The initial process will be to find and download web pages and then extract information, based on given criteria, from the pages. To do this you will want your spider in a folder of your choice to organize everything into one area. Make a folder on your system you can easily navigate to from a command prompt for this example. For example, if on the root of your C: drive is on MS Windows:
md firstspider (Enter)
It really does not matter where you make the folder, but do make sure you are able to navigate to it. Next, using your Python IDLE IDE, write the following very basic spider code and save the file as scraper.py to the folder you just created:
import scrapy
# filename scraper.py
class QuotesSpider(scrapy.Spider):
        name = "quotes"
        def start_requests(self):
            urls = [ 'http://quotes.toscrape.com/page/1/' ]
            for url in urls:
                     yield scrapy.Request(url=url, callback=self.parse)
        def parse(self, response):
            print(' URL we went to: ', response, ' ')
Now, running the aforementioned code is not very exciting. For our purposes, Scrapy will run better via the command line using the Scrapy command interface. It is very similar to how you would run a Python script from the command line. With Python it would be python name_of_file.py and with Scrapy it will be similar, from within the folder you just created and where you saved your file to: scrapy runspider scraper.py (Enter). If everything runs properly, you should see something similar to the following:
../images/330715_3_En_5_Chapter/330715_3_En_5_Fig1_HTML.jpg
Figure 5-1

Screen capture running the sample scraper via the terminal

If you received any errors, it could be that your path or search drive to find Scrapy is not set. If on Windows and you receive a win32api error, you will need to most likely install pypiwin32. Complete this if needed by typing from an escalated command prompt:
pip install pypiwin32api (Enter)

By itself, this was only exciting in that there were (hopefully) no errors and we showed the URL we went to. That being said, let’s now perform a bit more productive work.

Retrieve Web Data with Scrapy

Scrapy has a command line interface that is quite handy. Of course you will write your spider(s) in Python, but the Scrapy shell can help you with what to write in your spider code. Consider how to view a web page with Scrapy.

View a Web Page via Scrapy

From an escalated command prompt, scrapy view http://quotes.toscrape.com/page/1/ will cause Scrapy to load the URL you specify, in a browser. This is handy because you may want to check a site before having Scrapy extract data from it. Note the title of the page; we will extract only that next.

Shell Options

You will, of course, what to know what Scrapy shell options are available. To see them, use the interactive shell and enter from the command linescrapy shell http://quotes.toscrape.com/page/1/ . You now see options. Try from the command prompt: response.css(‘title’). Note that you are still in the Scrapy interactive shell, and note that the title, from the HTML markup tags, is returned. Use CTRL Z to exit the shell.

To perform the same thing programmatically with Python, consider the following:
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/'  ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        print()
        print("Title will follow: ")
        print(response.css("title"))
        print()
This will give use the extracted title from the page, with markup tags.
../images/330715_3_En_5_Chapter/330715_3_En_5_Fig2_HTML.png
Figure 5-2

CLI output of title

Now, to clean it up a bit, change the line:
print(response.css("title"))
to:
print(response.css("title").extract_first(),)

Then save and rerun the spider, and you will note a much cleaner and usable output of the HTML tags and the title. The extract_first() method returns a string of the first occurrence found.

Of course, this is just a bit to get you started with Scrapy. You can do much more with it; use what you have learned to expand your web-scraping skills. The best place to find more information would be at the docs.​scrapy.​org site for more information on methods and features of Scrapy. In fact, the Quotes URL used in this example is the same used in the Scrapy sites tutorials.

Taking It With You

There is perhaps one thing that is most important to understand about all the protocols listed in this chapter: they aren’t mutually exclusive. It is possible—and sometimes very advantageous—to implement multiple protocols on a single object. For example, a sequence can also be used as a callable and a context manager if both of those behaviors make sense for a given class.

This chapter has dealt primarily with the behaviors of objects, as provided by their classes; the next chapter will cover how you can manage those objects and their data once they have been instantiated in working code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset