Creating indexes to improve efficiency

One of the rules of efficiency is to avoid search. Our previous example of using an iterator over the keys in a shelf is inefficient. To state that more strongly, use of search defines an inefficient application. We'll emphasize this.

Brute-force search is perhaps the worst possible way to work with data. Try to design indexes based on subsets or key mappings to improve performance.

To avoid searching, we need to create indexes that list the items users are most likely to want. This saves you reading through the entire shelf to find an item or subset of items. A shelf index can't reference Python objects, as that would change the granularity at which the objects are stored. An index will only list key values, a separate retrieval is done to get the object in question. This makes navigation among objects indirect but still much faster than a brute-force search of all items in the shelf.

As an example of an index, we can keep a list of the Post keys associated with each Blog in the shelf. We can easily change the add_blog(), add_post(), and delete_post() methods to update the associated Blog entry too. Here are the revised versions of these blog update methods:

class Access2(Access):

    def create_post(self, blog: Blog, post: Post) -> Post:
        super().create_post(blog, post)
        # Update the index; append doesn't work.
        blog_index = f"_Index:{blog._id}"
        self.database.setdefault(blog_index, [])
        self.database[blog_index] = self.database[blog_index] + [post._id]
        return post

    def delete_post(self, post: Post) -> None:
        super().delete_post(post)
        # Update the index.
        blog_index = f"_Index:{post._blog_id}"
        index_list = self.database[post._blog_id]
        index_list.remove(post._id)
        self.database[post._blog_id] = index_list

    def post_iter(self, blog: Blog) -> Iterator[Post]:
        blog_index = f"_Index:{blog._id}"
        for k in self.database[blog_index]:
            yield self.database[k]

Most of the methods are inherited without change from the Access class. We've extended three methods to create a useful index of children for a given blog:

create_post()
delete_post()
post_iter()

The create_post() method uses the create_post() superclass to save the Post object to the shelf. Then it makes sure the "_Index:{blog}" object is in the shelf, by using setdefault(). This object will be a list with the keys for related posts. The list of keys is updated using the following statement:

self.database[blog_index] = self.database[blog_index] + [post._id]

This is required to update the shelf. We cannot simply use self.database[blog_index].append(post._id). These kind of update-in-place methods of a dictionary don't work as expected with a shelf object. Instead, we must retrieve the object out of the shelf with self.database[blog_index]. Update the retrieved object, and then replace the object in the shelf using a simple assignment statement.

Similarly, the delete_post() method keeps the index up to date by removing unused posts from _post_list of the owning blog. As with create_post(), two updates are done to the shelf: a del statement removes Post, and then the Blog object is updated to remove the key from the associated index.

This change alters our queries for a Post object in a profound way. We're able to replace a scan of all the items in post_iter() with a much more efficient operation. This loop will rapidly yield the Post objects based on the keys saved in the _post_list attribute of Blog. An alternative body is this generator expression:

return (self.database[k] for k in blog._post_list)

The point of this optimization to the post_iter() method is to eliminate the search of all the keys for the matching keys. We've replaced searching all keys with a simple iteration over an appropriate sequence of relevant keys. A simple timing test, which alternates between updating Blog and Post and rendering Blog to RST, shows us the following results:

Access Layer Access: 33.5 seconds 
Access Layer Access2: 4.0 seconds

As expected, eliminating the search reduced the time required to process Blog and its individual Posts. The change is profound: almost 86% of the processing time was wasted in the search for relevant posts.

Let's see how to create a cache.

Table of Contents for Creating indexes to improve efficiency

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating indexes to improve efficiency