Scaling a traditional, relational database is difficult because transactional guarantees (atomicity, consistency, isolation, and durability, also known as ACID) require communication among all nodes of the database. The more nodes you add, the slower your database becomes, because more nodes must coordinate transactions between each other. The way to tackle this has been to use databases that don’t adhere to these guarantees. They’re called NoSQL databases.
There are four types of NoSQL databases—document, graph, columnar, and key-value store—each with its own uses and applications. Amazon provides a NoSQL database service called DynamoDB. Unlike RDS, which effectively provides several common RDBMS engines like MySQL, Oracle Database, Microsoft SQL Server, and PostgreSQL, DynamoDB is a fully managed, proprietary, closed source key-value store. If you want to use a different type of NoSQL database—a document database like MongoDB, for example—you’ll need to spin up an EC2 instance and install MongoDB directly on that. Use the instructions in chapters 3 and 4 to do so. DynamoDB is highly available and highly durable. You can scale from one item to billions and from one request per second to tens of thousands of requests per second.
This chapter looks in detail at how to use DynamoDB: both how to administer it like any other service and how to program your applications to use it. Administering DynamoDB is simple. You can create tables and secondary indexes, and there’s only one option to tweak: its read and write capacity, which directly affects its cost and performance.
We’ll look at the basics of DynamoDB and demonstrate them by walking through a simple to-do application called nodetodo, the Hello World of modern applications. Figure 10.1 shows the to-do application nodetodo in action.
The examples in this chapter are totally covered by the Free Tier. As long as you don’t run the examples longer than a few days, you won’t pay anything for it. Keep in mind that this applies only if you created a fresh AWS account for this book and there are no other things going on in your AWS account. Try to complete the chapter within a few days, because you’ll clean up your account at the end of the chapter.
Before you get started with nodetodo, you need to know about DynamoDB 101.
DynamoDB doesn’t require administration like a traditional relational database; instead, you have other tasks to take care of. Pricing depends mostly on your storage usage and performance requirements. This section also compares DynamoDB to RDS.
With DynamoDB, you don’t need to worry about installation, updates, servers, storage, or backups:
Now you know some administrative tasks that are no longer necessary if you use DynamoDB. But you still have things to consider when using DynamoDB in production: creating tables (see section 10.4), creating secondary indexes (section 10.6), monitoring capacity usage, and provisioning read and write capacity (section 10.9).
If you use DynamoDB, you pay the following monthly:
These prices are valid for the North Virginia (us-east-1) region. No additional traffic charges apply if you use AWS resources like EC2 servers to access DynamoDB in the same region
Table 10.1 compares DynamoDB and RDS. Keep in mind that this is like comparing apples and oranges; the only thing DynamoDB and RDS have in common is that both are called databases.
DynamoDB |
RDS |
|
---|---|---|
Creating a table | Management Console, SDK, or CLI aws dynamodb create-table | SQL CREATE TABLE statement |
Inserting, updating, or deleting data | SDK | SQL INSERT, UPDATE, or DELETE statement, respectively |
Querying data | If you query the primary key: SDK. Querying non-key attributes isn’t possible, but you can add a secondary index or scan the entire table. | SQL SELECT statement |
Increasing storage | No action needed: DynamoDB grows with your items. | Provision more storage. |
Increasing performance | Horizontal, by increasing capacity. DynamoDB will add more servers under the hood. | Vertical, by increasing instance size; or horizontal, by adding read replicas. There is an upper limit. |
Installing the database on your machine | DynamoDB isn’t available for download. You can only use it as a service. | Download MySQL, Oracle Database, Microsoft SQL Server, or PostgreSQL, and install it on your machine. |
Hiring an expert | Search for special DynamoDB skills. | Search for general SQL skills or special skills, depending on the database engine. |
DynamoDB is a key-value store that organizes your data in tables. Each table contains items (values) that are identified by keys. A table can also maintain secondary indexes for data look-up in addition to the primary key. In this section, you’ll look at these basic building blocks of DynamoDB, ending with a brief comparison of NoSQL databases.
A DynamoDB table has a name and organizes a collection of items. An item is a collection of attributes. An attribute is a name-value pair. The attribute value can be scalar (number, string, binary, boolean), multivalued (number set, string set, binary set), or a JSON document (object, array). Items in a table aren’t required to have the same attributes; there is no enforced schema.
You can create a table with the Management Console, CloudFormation, SDKs, or the CLI. The following example shows how you create a table with the CLI (don’t try to run this command now—you’ll create a table later in the chapter):
If you plan to run multiple applications that use DynamoDB, it’s good practice to prefix your tables with the name of your application. You can also add tables via the Management Console. Keep in mind that you can’t change the name of a table and the key schema. But you can add attribute definitions and change the provisioned throughput.
A primary key is unique within a table and identifies an item. You need the primary key to look up an item. The primary key is either a hash or a hash and a range. Hash keys
A hash key uses a single attribute of an item to create a hash index. If you want to look up an item based on its hash key, you need to know the exact hash key. A user table could use the user’s email as a hash primary key. A user then can be retrieved if you know the hash key (email, in this case).
A hash and range key uses two attributes of an item to create a more powerful index. The first attribute is the hash part of the key, and the second part is the range. To look up an item, you need to know the exact hash part of the key, but you don’t need to know the range part. The range part is sorted within the hash. This allows you to query the range part of the key from a certain starting point. A message table can use a hash and range as its primary key; the hash is the email of the user, and the range is a timestamp. You can now look up all messages of a user that are newer than a specific timestamp.
Table 10.2 compares DynamoDB to several NoSQL databases. Keep in mind that all of these databases have pros and cons, and the table shows only a high-level comparison of how they can be used on top of AWS.
Task |
DynamoDB Key-value store |
MongoDB Document store |
Neo4j Graph store |
Cassandra Columnar store |
Riak KV Key-value store |
---|---|---|---|---|---|
Run the database on AWS in production. | One click: it’s a managed service. | Cluster of EC2 instances, self-maintained. | Cluster of EC2 instances, self-maintained. | Cluster of EC2 instances, self-maintained. | Cluster of EC2 instances, self-maintained. |
Increase available storage while running. | Not necessary. The database grows automatically. | Add more EC2 instances (replica set). | Not possible (the increasing size of EBS volumes requires downtime). | Add more EC2 instances. | Add more EC2 instances. |
Imagine a team of developers working on a new app using DynamoDB. During development, each developer needs an isolated database so as not to corrupt the other team members’ data. They also want to write unit tests to make sure their app is working. You could create a unique set of DynamoDB tables with a CloudFormation stack per developer to separate them, or you could use a local DynamoDB. AWS provides a Java mockup of DynamoDB, which is available for download at http://mng.bz/27h5. Don’t run it in production! It’s only made for development purposes and provides the same functionality as DynamoDB, but it uses a different implementation: only the API is the same.
To minimize the overhead of a programming language, you’ll use Node.js/JavaScript to create a small to-do application that can be used via the terminal on your local machine. Let’s call the application nodetodo. nodetodo will use DynamoDB as a database. With nodetodo, you can do the following:
nodetodo supports multiple users and can track tasks with or without a due date. To help users deal with many tasks, a task can be assigned to a category. nodetodo is accessed via the terminal. Here’s how you would use nodetodo via the terminal to add a user (don’t try to run this command now—it’s not yet implemented):
To add a new task, you would do the following (don’t try to run this command now—it’s not yet implemented):
You would mark a task as finished as follows (don’t try to run this command now—it’s not yet implemented):
# node index.js task-done <uid> <tid> $ node index.js task-done michael 1432187491647 => task completed with tid 1432187491647
You should also be able to list tasks. Here’s how you would use nodetodo to do that (don’t try to run this command now—it’s not yet implemented):
# node index.js task-ls <uid> [<category>] [--overdue|--due|...] $ node index.js task-ls michael => tasks [...]
To implement an intuitive CLI, nodetodo uses docopt, a command-line interface description language, to describe the CLI interface. The supported commands are as follows:
In the rest of the chapter, you’ll implement those commands. The following listing shows the full CLI description of all the commands, including parameters.
DynamoDB isn’t comparable to a traditional relational database in which you create, read, update, or delete data with SQL. You’ll access DynamoDB with an SDK to call the HTTP REST API. You must integrate DynamoDB into your application; you can’t take an existing application that uses a SQL database and run it on DynamoDB. To use DynamoDB, you need to write code!
A table in DynamoDB organizes your data. You aren’t required to define all the attributes that table items will have. DynamoDB doesn’t need a static schema like a relational database, but you must define the attributes that are used as the primary key in your table. In other words, you must define the table’s primary key. To do so, you’ll use the AWS CLI. The aws dynamodb create-table command has four mandatory options:
You’ll now create a table for the users of the nodetodo application and a table that will contain all the tasks.
Before you create a table for nodetodo users, you must think carefully about the table’s name and primary key. We suggest that you prefix all your tables with the name of your application. In this case, the table name is todo-user. To choose a primary key, you have to think about the queries you’ll make in the future and whether there is something unique about your data items. Users will have a unique ID, called uid, so it makes sense to choose the uid attribute as the primary key. You must also be able to look up users based on the uid to implement the user command. If you want a single attribute to be your primary key, you can always create a hash index: an unordered index based on the hash key. The following example shows a user table where uid is used as the primary hash key:
Because users will only be looked up based on the known uid, it’s fine to use a hash key. Next you’ll create the user table, structured like the previous example, with the help of the AWS CLI:
Creating a table takes some time. Wait until the status changes to ACTIVE. You can check the status of a table as follows:
Tasks always belong to a user, and all commands that are related to tasks include the user’s ID. To implement the task-ls command, you need a way to query the tasks based on the user’s ID. In addition to the hash key, you can use a hash and range key. Because all interactions with tasks require the user’s ID, you can choose uid as the hash part and a task ID (tid), the timestamp of creation, as the range part of the key. Now you can make queries that include the user’s ID and, if needed, the task’s ID.
This solution has one limitation: users can add only one task per timestamp. Our timestamp comes with millisecond resolution, so it should be fine. But you should take care to prevent strange things from happening when the user should be able to add two tasks at the same time.
A hash and range key uses two of your table attributes. For the hash part of the key, an unordered hash index is maintained; the range part is kept in a sorted range index. The combination of the hash and the range uniquely identifies the item. The following data set shows the combination of unsorted hash parts and sorted range parts:
nodetodo offers the ability to get all tasks for a user. If the tasks have only a primary hash key, this will be difficult, because you need to know the key to extract them from DynamoDB. Luckily, the hash and range key makes things easier, because you only need to know the hash portion of the key to extract the items. For the tasks, you’ll use uid as the known hash portion. The range part is tid. The task ID is defined as the timestamp of task creation. You’ll now create the task table, using two attributes to create a hash and range index:
Wait until the table status changes to ACTIVE when you run aws dynamodb describe-table --table-name todo-task. When both tables are ready, you’ll add some data.
You have two tables up and running. To use them, you need to add some data. You’ll access DynamoDB via the Node.js SDK, so it’s time to set up the SDK and some boilerplate code before you implement adding users and tasks.
Node.js is a platform to execute JavaScript in an event-driven environment so you can easily build network applications. To install Node.js, visit https://nodejs.org and download the package that fits your OS.
After Node.js is installed, you can verify if everything works by typing node--version into your terminal. Your terminal should respond with something similar to v0.12.*. Now you’re ready to run JavaScript examples like nodetodo for AWS.
To get started with Node.js and docopt, you need some magic lines to load all the dependencies and do some configuration work. Listing 10.2 shows how this can be done.
As usual, you’ll find the code in the book’s code repository on GitHub: https://github.com/AWSinAction/code. nodetodo is located in /chapter10/.
Docopt is responsible for reading all the arguments passed to the process. It returns a JavaScript object, where the arguments are mapped to the described parameters in the CLI description.
Next you’ll implement the features of nodetodo. You can use the putItem SDK operation to add data to DynamoDB like this:
The first step is to add data to nodetodo.
You can add a user to nodetodo by calling nodetodo user-add <uid> <email> <phone>. In Node.js, you do this using the code in the following listing.
When you make a call to the AWS API, you always do the following:
1. Create a JavaScript object (map) filled with the needed parameters (the params variable).
2. Invoke the function on the AWS SDK.
3. Check whether the response contains an error, or process the returned data.
Therefore you only need to change the content of params if you want to add a task instead of a user.
You can add a task to nodetodo by calling nodetodo task-add <uid> <description> [<category>] [--dueat=<yyyymmdd>]. In Node.js, you do this with the code shown in the following listing.
Now you can add users and tasks to nodetodo. Wouldn’t it be nice if you could retrieve all this data?
DynamoDB is a key-value store. The key is usually the only way to retrieve data from such a store. When designing a data model for DynamoDB, you must be aware of that limitation when you create tables (you did so in section 10.4). If you can use only one key to look up data, you’ll soon or later experience difficulties. Luckily, DynamoDB provides two other ways to look up items: a secondary index key lookup and the scan operation. You’ll start by retrieving data with its primary key and continue with more sophisticated methods of data retrieval.
DynamoDB lets you retrieve changes to a table as soon as they’re made. A stream provides all write (create, update, delete) operations to your table items. The order is consistent within a hash key:
The simplest form of data retrieval is looking up a single item by its primary key. The getItem SDK operation to get a single item from DynamoDB can be used like this:
The command nodetodo user <uid> must retrieve a user by the user’s ID (uid). Translated to the Node.js AWS SDK, this looks like the following listing.
You can also use the getItem operation to retrieve data by primary hash and range key. The only change is that that Key has two entries instead of one. getItem returns one item or no items; if you want to get multiple items, you need to query DynamoDB.
If you want to retrieve not a single item but a collection of items, you must query DynamoDB. Retrieving multiple items by primary key only works if your table has a hash and range key. Otherwise, the hash will only identify a single item. The query SDK operation to get a collection of items from DynamoDB can be used like this:
The query operations also lets you specify an optional FilterExpression. The syntax of FilterExpression works like KeyConditionExpression, but no index is used for filters. Filters are applied to all matches that KeyConditionExpression returns.
To list all tasks for a certain user, you must query DynamoDB. The primary key of a task is the combination of the uid hash part and the tid range part. To get all tasks for a user, KeyConditionExpression only requires the equality of the hash part of the primary key. The implementation of nodetodo task-ls <uid> [<category>] [--overdue |--due|--withoutdue|--futuredue] is shown next.
Two problems arise with the query approach:
You can solve those problems with secondary indexes. Let’s look at how they work.
A secondary index is a projection of your original table that’s automatically maintained by DynamoDB. You can query a secondary index like you query the index containing all the primary keys of a table. You can imagine a global secondary index as a read-only DynamoDB table that’s automatically updated by DynamoDB: whenever you change the parent table, all indexes are asynchronously (eventually consistent!) updated as well. Figure 10.2 shows how a secondary index works.
A secondary index comes at a price: the index requires storage (the same cost as for the original table). You must provision additional write-capacity units for the index as well because a write to your table will cause a write to the secondary index.
A huge benefit of DynamoDB is that you can provision capacity based on your workload. If one of your table indexes gets tons of read traffic, you can increase the read capacity of that index. You can fine-tune your database performance by provisioning sufficient capacity for your tables and indexes. You’ll learn more about that in section 10.9.
Back to nodetodo. To implement the retrieval of tasks by category, you’ll add a secondary index to the todo-task table. This will allow you to make queries by category. A hash and range key is used: the hash is the category attribute, and the range is the tid attribute. The index also needs a name: category-index. You can find the following CLI command in the README.md file in nodetodo’s code folder:
A global secondary index takes some time to be created. You can use the CLI to find out if the index is active:
$ aws dynamodb describe-table --table-name=todo-task --query "Table.GlobalSecondaryIndexes"
The following listing shows how the implementation of nodetodo task-la <category> [--overdue|...] uses the query operation.
But there are still situations where a query doesn’t work: you can’t retrieve all users. Let’s look at what a table scan can do for you.
Sometime you can’t work with keys; instead, you need to go through all the items in the table. That’s not efficient, but in some situations, it’s okay. DynamoDB provides the scan operation to scan all items in a table:
The next listing shows the implementation of nodetodo user-ls [--limit=<limit>] [--next=<id>]. A paging mechanism is used to prevent too many items from being returned.
The scan operation reads all items in the table. This example didn’t filter any data, but you can use FilterExpression as well. Note that you shouldn’t use the scan operation too often—it’s flexible but not efficient.
DynamoDB doesn’t support transactions the same way a traditional database does. You can’t modify (create, update, delete) multiple documents in a single transaction—the atomic unit in DynamoDB is a single item.
In addition, DynamoDB is eventually consistent. That means it’s possible that if you create an item (version 1), update that item to version 2, and then get that item, you may see the old version 1; if you wait and get the item again, you’ll see version 2. Figure 10.3 shows this process. The reason for this behavior is that the item is persisted on multiple servers in the background. Depending on which server answers your request, the server may not have the latest version of the item.
You can prevent eventually consistent reads by adding "ConsistentRead": true to the DynamoDB request to get strongly consistent reads. Strongly consistent reads are supported by getItem, query, and scan operation. But a strongly consistent read takes longer and consumes more read capacity than an eventually consistent read. Reads from a global secondary index are always eventually consistent because the index itself is eventually consistent.
Like the getItem operation, the deleteItem operation requires that you specify the primary key you want to delete. Depending on whether your table uses a hash or a hash and range key, you must specify one or two attributes.
You can remove a user with nodetodo by calling nodetodo user-rm <uid>. In Node.js, this is as shown in the following listing.
Removing a task is similar: nodetodo task-rm <uid> <tid>. The only change is that the item is identified by a hash and range key and the table name, as shown in the next listing.
You’re now able to create, read, and delete items in DynamoDB. The only operation missing is updating.
You can update an item with the updateItem operation. You must identify the item you want to update by its key; you can also provide an UpdateExpression to specify the updates you want to perform. You can use one or a combination of the following update actions:
In nodetodo, you can mark a task as done by calling nodetodo task-done <uid> <tid>. To implement this feature, you need to update the task item, as shown in Node.js in the following listing.
That’s it! You’ve implemented all of nodetodo’s features.
When you create a DynamoDB table or a global secondary index, you must provision throughput. Throughput is divided into read and write capacity. DynamoDB uses ReadCapacityUnits and WriteCapacityUnits to specify the throughput of a table or global secondary index. But how is a capacity unit defined? Let’s start by doing some experimentation with the command-line interface:
More abstract rules for throughput consumption are as follows:
If capacity units aren’t your favorite unit, you can use the AWS Simple Monthly Calculator at http://aws.amazon.com/calculator to calculate your capacity needs by providing details of your read and write workload.
The provision throughput of a table or a global secondary index is defined in seconds. If you provision five read capacity units per second with ReadCapacityUnits=5, you can make five strongly consistent getItem requests for that table if the item size isn’t larger than 4 KB per second. If you make more requests than are provisioned, DynamoDB will first throttle your request. If you make many more requests than are provisioned, DynamoDB will reject your requests.
It’s important to monitor how many read and write capacity units you require. Fortunately, DynamoDB sends some useful metrics to CloudWatch every minute. To see the metrics, open the AWS Management Console, navigate to the DynamoDB service, and select one of the tables. Figure 10.4 shows the CloudFormation metrics for the todo-user table.
You can modify the provisioned throughput whenever you like, but you can only decrease the throughput capacity of a single table four times a day.
Don’t forget to delete your DynamoDB tables after you finish this section. Use the Management Console to do so.