Persisting information with database systems

Our prediction service will use data in a number of ways. When we start the service, we have standard configurations we would like to retrieve (for example, the model parameters), and we might also like to log records of the requests that the application responds to for debugging purposes. As we score data or prepare trained models, we would ideally like to store these somewhere in case the prediction service needs to be restarted. Finally, as we will discuss in more detail, a database can allow us to keep track of application state (such as which tasks are in progress). For all these uses, a number of database systems can be applied.

Databases are generally categorized into two groups: relational and non-relational. Relational databases are probably familiar to you, as they are used in most business data warehouses. Data is stored in the form of tables, often with facts (such as purchases or search events) containing columns (such as user account IDs or an item identifier) that may be joined to dimensional tables (containing information on an item or user) or relational information (such as a hierarchy of items IDs that define the contents of an online store). In a web application, a relational system can be used behind the scenes to retrieve information (for example, in response to a GET request for user information), to insert new information, or delete rows from the database. Because the data in a relational system is stored in tables, it needs to follow a common series of columns, and these sorts of systems are not designed with nested structures such as JSON in mind. If we know there are columns we will frequently query (such as an item ID), we can design indices on the tables in these systems that speed up retrieval. Some common popular (and open source) relational systems are MySQL, PostGreSQL, and SQLite.

Non-relational databases, also known as 'NoSQL', follow a very different data model. Instead of being formed of tables with multiple columns, these systems are designed as with alternative layouts such as key-value stores, where a row of information (such as a customer account) has a key (such as an item index) and an arbitrary amount of information in the value field. For example, the value could be a single item or a nested series of other key-values. This flexibility means that NoSQL databases can store information with diverse schema even in the same table, since the fields in the value do not need to be specifically defined. Some of these applications allow us to create indices on particular fields within the value, just as for relational systems. In addition to key-value databases (such as Redis) and document stores (such as MongoDB), NoSQL systems also include columnar stores where data are co-located in files based primarily on column chunks rather than rows (examples include Cassandra and Druid), and graph databases such as Neo4j which are optimized for data composed of nodes and edges (such as what we studied in the context of spectral clustering in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning). We will use MongoDB and Redis in our example in this chapter.

In addition to storing data with flexible schema, such as the nested JSON strings we might encounter in REST API calls, key-value stores can server another function in a web application by allowing us to persist the state of a task. For quickly answered requests such as a GET class for information, this is not necessary. However, prediction services might frequently have long-running tasks that are launched by a POST request and take time to compute a response. Even if the task is not complete though, we want to return an immediate response to the client that initiated the task. Otherwise, the client will stall waiting for the server to complete, and this can potentially affect performance of the client and is very much against the philosophy of decoupling the components of the system described previously. Instead, we want to return a task identifier to the client immediately, which will allow the client to poll the service to check on the progress of the task and retrieve the result when it is available. We can store the state of a task using a key-value database and provide both update methods to allow us to provide information on intermediate progress by editing the task records and GET methods to allow clients to retrieve the current status of the task. In our example, we will be using Redis as the backend to store task results for long-running applications, and also as the message queue by which tasks can communicate, a role known as a "broker".

Now that we have covered the basic structure of our prediction service, let us examine a concrete example that ties together many of the patterns we have developed in predictive modeling tasks over the previous sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset