Chapter 6
The Datomic Database

Datomic is a new database that offers developers new ways to think about building traditional database-backed applications, and exciting features that make it possible to build entirely new types of applications. It has a modern, cloud-focused design that is easy and flexible to both develop on and deploy. To top it off, it's built by the same people that brought you Clojure.

Just as working with Clojure changes the way you think about programming, Datomic will change the way you think about databases. It is definitely not just another variation on SQL or another narrowly focused NoSQL system. Datomic embodies new ideas about how you think about data and what it is like to work with a general-purpose transactional database.

In this chapter, you'll explore how Datomic works, starting from the fundamentals like its data model, and working up to its high-level APIs. Using examples, you'll look at how to build an application backed up by Datomic, ways to think about modeling domain data in Datomic, and how to tap into the powerful Clojure API.

DATOMIC BASICS

The idea for Datomic was hatched when Rich Hickey started thinking about what it would look like when some of the big ideas that went into the design of Clojure were applied to a database system. With concepts like immutable data, design focused on decomposition, and the importance of value-oriented semantics, the idea grew into a collaboration between Rich Hickey, Stuart Halloway (cofounder of Cognitect and Clojure core committer), and a small group of developers at Cognitect. It was first released in 2012.

Datomic doesn't fit cleanly into a single category. Multiple labels fit it, including NoSQL, graph database, distributed database, transactional store, analytics store, and logic database. It took inspiration from multiple sources, but its basic goal is to be a drop-in replacement for the main cases where you would otherwise use a relational database as a transactional store. Stuart Halloway says that it's targeted at the ninety-six percent use cases of relational databases, leaving off the top four percent of high write-volume users like Netflix, Facebook, etc.

This raises the question: why use Datomic instead of SQL? This section examines some of the ways that Datomic sets itself apart from this traditional and venerable technology. You'll learn about Datomic's data model, query language, data-focused transaction syntax, and its design and deployment architecture.

Why Datomic?

Datomic is a powerful alternative to relational SQL-based databases that's suitable for the vast majority of systems currently built on these more traditional databases. It represents a re-thinking of the role, model, structure, and nature of a database system, incorporating modern concepts of cloud deployment, cheap storage, distributed systems, and persistent data structures/immutable data. It offers a number of key advantages over relational databases.

  • Deployment flexibility. In contrast to monolithic relational databases, Datomic has a pluggable storage backend, which can use many different backing databases to store its data. This means that, when deploying applications, you can use whatever system is most convenient for the deployment environment and your operations infrastructure. For instance, for AWS deployments it's simple and natural to use the DynamoDB backend. If your organization already runs a Cassandra, Riak, Couchbase, Infinispan, or even a SQL database, you can use any of those as the storage backend.
  • Data-oriented design. When learning functional programming there's often a series of “a-ha!” moments as the elegance and simplicity of the data-oriented approach click into place, and the advantages over models that emphasize mutation and data hiding become clear. Datomic is animated by the same kind of mathematical beauty. Once you grasp the data model and how to work with it, using Datomic has a liberating freedom and power. It's a stark contrast to relational databases, with their mutate-in-place semantics and the feeling that you are working with data while wearing bulky and unwieldy gloves.
  • Entity orientation. The central idea of relational databases is tables, but outside of very simple toy applications it's very rare that a table has a one-to-one correspondence with the entity it represents. Typically it takes multiple tables to represent a single entity in realistic relational data models. This represents a real impedance mismatch with the way developers think of entities, and has cost untold numbers of man-years in developing ultimately awkward and unsatisfying solutions like ORMs and query generators. By contrast, the structure of data in Datomic's entity model more closely matches the structure of data as a developer would model it and work with it in an application. Datomic's query and pull APIs also provide powerful tools for elegantly pulling out complex, nested entities in a simple and declarative way.
  • Read scalability. In Datomic, reads are separated from writes and decentralized. A single Datomic database can power mixed workloads—transactional, analytical, and even batch—while retaining low latency for transactional performance, and high throughput for analytical and batch performance. This is made possible by Datomic's unique architecture, where writes don't block reads, reads don't block one another, and you can horizontally scale your read capability by adding more “peer” systems. Relational databases can only scale vertically without requiring sharding and often fundamental changes to the architecture of your application, and often require you to deploy multiple databases to power different workloads with the increased complexity cost of data synchronization between them.
  • Flexibility. The schema, data model, and query system of Datomic support an incredible range of flexibility. This flexibility means that in practice your queries are rarely tied down to a schema designed to make that specific query performant. It's almost always possible to make changes to queries or add entirely new read patterns while maintaining performance, without changing your application data model. This typically translates to simpler application data models and simpler application architecture. Best practices in relational databases, on the other hand, typically call for entirely different schema designs to power, for instance, transactional and analytical workloads, because the schema and read patterns are much more closely tied together.
  • Treat the database as a value. This is one of the big ideas of rethinking the database that motivated Datomic. Relational databases give you a handle that you treat as an external resource that you can send questions to and get answers. The database and the data it contains are treated as a separate system, and from the perspective of the application, every interaction with the databases is a side effect. Datomic's immutable data model and APIs allow you to treat the database itself and the data it contains as a value for the purposes of reads as well as “speculative” writes. This opens up exciting possibilities for how you think about and design your database-driven applications.
  • Point-in-time access. Since the data in a Datomic database is immutable, you can perform read operations and queries on historical versions of your data: you can get a view of your data as it existed at any point in time.
  • Upsert. Admittedly, this might seem like a lower-level detail, particularly compared to the previous points. Upsert is a write semantic that allows you to treat the addition of new entities and changes to existing entities the same way, like a combination of a relational INSERT and UPDATE query. If you've never worked with a database that supported this feature, it's going to forever change your outlook. Upserts elegantly solve a whole class of concurrency and state-related issues with adding data, and allow you to use a single code path for your writes.

It's not all roses, of course. Everything in software is about trade-offs. The first major disadvantage is that it's proprietary software. While there are free-as-in-beer ways to get Datomic, there is no open source version, and for any kind of real production deployment, it's likely that you're going to have to pay for it.

You also need to consider the adoption effort. Datomic is very different from relational databases. Although some of what you know about working with SQL and relational data will be able to translate over to working with Datomic, there is definitely a learning curve. Datomic is going to be unfamiliar territory in many ways, and will involve some of the same kind of transformation in the way you think about databases, as functional programming transforms the way you think about writing programs. Building applications with Datomic requires that you and your team all get on board and really dig in to understand how the system works.

The best way to get there is from the ground up, starting with Datomic's data model.

The Datomic Data Model

One of the most important ingredients in the secret sauce that makes Datomic such an outstanding database is its simple, flexible, yet powerful data model. It's the reason that Datomic is often called a “database of facts,” and it closely parallels the way people normally think and talk about information when outside the confines of traditional relational databases. In this section you'll get a complete look at that data model with a focus on the basic building blocks that are used and combined in different ways to power the higher-level APIs covered later in this chapter.

A Relational Example

You're probably already familiar with the data model of relational databases. They represent data in tables, also called “relations” in formal terminology, with a row for each “thing” in the table and a column for each piece of information about the thing. For example, a table of contacts might look like what you see in Table 6.1.

Table 6.1 Table of contacts

ID FIRST_NAME LAST_NAME EMAIL EMERGENCY_CONTACT_ID
101 Jane Doe [email protected] 115

As Rich Hickey described in his talk “The Value of Values,” this is a model of information that's all about “place.” If you want to know Jane Doe's email, you go to the location of Jane Doe's row and retrieve it from the location of the email column. If Jane Doe changes her email, you go back to its location, delete what's there, and replace it with the new one, but the old email lost forever. This model quickly starts to look a lot less like what we as Clojure developers think of as data and more like a pile of mutable references.

Inflexibility is another drawback of the relational model. You must declare the structure of all your data in very strict terms ahead of time, and you must shoehorn the dynamic heterogeneity of the real world into fixed and structured tables. Relationships between things must be carefully modeled in advance, typically requiring multiple extra tables whose sole purpose is to describe relationships between things in other tables, and every query that uses them must be written with explicit joins, while paying careful attention to the specifics of the schema.

Datomic Datoms

With that in mind, let's compare Datomic's data model. Datomic stores a collection of 5-tuples called Datoms, which are its fundamental unit of data. The Datomic documentation describes it as a database of immutable facts, and each one of these facts is expressed as a Datom, which look like this:

[e a v t added?]
  • e means “entity”. It is roughly analogous to the “id” field in the previous relational table. It contains a Datomic entity id, a Long. Also similar to the way primary keys are typically handled in relational databases is that you don't set entity ids directly when adding data; they are internal to Datomic and are generated automatically by the transactor, Datomic's writer.
  • a means “attribute.” It's a bit like a column in a relational table, or in more familiar Clojure terms it's like the key in a hash map. The attribute part of a Datom contains the Datomic entity id of an attribute entity. You'll see what this means later, in the section on schema and modeling data.
  • v means “value.” It's similar to a cell or field in a relational table. In Datomic, the value part of a Datom contains either an entity id, if the Datom is describing a relationship between the e entity and the v entity, in which case it's called a “reference,” or some data value like a string, number, date, and so forth.

Together, entity-attribute-value gives you the basic facts. Before talking about what t and added? mean, let's take a look at what the data from the previous relational example might look like as Datoms with just the [e a v] part. You'll notice that keywords are being used in the a position rather than entity ids (Longs). In the schema section you'll see why this is the case, and how Datomic resolves them to the entities they refer to.

[[101 :first-name "Jane"]
 [101 :last-name "Doe]
 [101 :email "[email protected]"]
 [101 :emergency-contact 115]]

It's very important to keep this [e a v] pattern in mind when thinking about data in Datomic because Datalog—Datomic's query language—is all about matching these patterns with variables and values. This is covered later in the section on querying.

The t and added? parts of the Datom together are what give Datomic its immutable characteristics, point-in-time query features, and database-as-a-value abstraction. This is also where Datomic departs almost completely from the paradigm of relational databases, because there is no real relational analog for this part of the data model.

t means “transaction.” Each write to Datomic contains one or more Datoms bound in a transaction. As part of the write process, the transaction itself is created as an entity, which has an attribute with a timestamp that corresponds to the time of the transaction. Every Datom in the transaction has the transaction's entity id in the t position.

Storing transaction entities in this way as part of the data model means that every fact in the database can be traced back to the exact point in time that it entered the system.

added? is a Boolean flag that will be set to true if the Datom is an assertion, and false if it's a retraction. Assertion means the fact that a Datom is conveying is true as of the transaction time, retraction means it's no longer true. Let's look at an example using Jane Doe and a new attribute:favorite-color. As of the time of transaction t1, Jane Doe's favorite color is red. The state of the database looks like this.

[[101 :favorite-color "red" t1 true]]

We have a Datom for Jane Doe (entity id 101), which was added as part of transaction t1, and it's an assertion because the added? flag is true. Now let's assume Jane Doe is going through a rebellious teenage phase, and as of the time of transaction t2 she decides that her favorite color is no longer red, but it's now black. Here's the new state of the database after t2.

[[101 :favorite-color "red" t1 true]
 [101 :favorite-color "red" t2 false]
 [101 :favorite-color "black" t2 true]]

You see what happened! Rather than an update-in-place, where you would have completely deleted the idea that Jane Doe's favorite color was ever red, you have added a retraction Datom for that fact since it is no longer true, and added an assertion Datom that tells you Jane's new favorite color.

In the most regular usage of Datomic, you don't directly deal with the complete Datom. Most of the APIs work in terms of [e a v]—entity, attribute, and value. The transaction and added flag are typically underneath an abstraction where, given a point in time of a particular transaction value, you only see the entity, attribute, and value parts of Datoms that are true at that time. The transaction and added flag of the Datoms are used to build a projection of the current state of the data as of that time. For example, as of transaction t1 the state looked like this:

[[101 :favorite-color "red"]]

Whereas, as of transaction t2 the state looked like this:

[[101 :favorite-color "black"]]

This is a very important abstraction to keep in mind, because most of the time you're working in Datomic the t and added? parts of Datoms are abstracted away from you. There are also numerous ways to powerfully leverage the full Datoms, and understanding the complete Datomic data model is very important in designing systems to use it properly, so you need to be able to think about your data in both ways.

It's an abstraction that fits in powerfully with Clojure's broader view of data and state. As a Clojure programmer, you usually think of state as something to be carefully managed and minimized in your systems, because you realize that much of the most tedious, error-prone, and dangerous aspects of programming revolve around state management. The opinionated design of Clojure works very hard to keep state to a minimum, and places more restrictive semantics around mutable state so that developers know to handle it with care. Whenever possible, data is preferable to state.

Datomic's data model works along very similar lines. It looks at the database as a data structure to which you can only add additional facts, and treats state as a projection of that data structure.

Entity Maps

When using Datomic, one of the common ways that you'll interact with data is as entity maps. There is an API for this, which is covered in more detail in the Clojure API section, but in the meantime let's examine the basic concept. You've already seen how to describe all of the facts about an entity in terms of Datoms. Another way to look at that same data is in an entity map format.

Entity maps are conceptually quite simple. You start the projection of the state of an entity as of a given transaction time—in other words, only the [e a v] parts of Datoms representing facts that are true as of that time. After that, it's a fairly straightforward matter to represent the Datoms about a particular entity as a map. The one catch is where the entity id goes. In Datomic entity maps, the entity id is in a special key called :db/id. Here's what Jane Doe would look like as an entity map.

{:db/id 101
 :first-name "Jane"
 :last-name "Doe"
 :email "[email protected]"
 :emergency-contact 115}

Querying

Datalog is Datomic's query language. It's related to logic programming and also shares many similarities with SPARQL, the semantic query language for RDF, which was one of Datomic's influences. It can look strange to those more familiar with SQL, but it offers several key advantages.

  • Datalog queries are datastructures—in fact, they're Clojure collections. This makes Datalog queries far more composable, and allows you to take advantage of all of Clojure's powerful features for working with collections to create and modify them. Manipulating SQL queries, on other hand, requires awkward and error-prone string concatenation and interpolation, heavy-weight parsing and rendering code, or idiosyncratic DSLs or ORMs that limit the expressive power of the queries you can write.
  • Joins are implicit. To write queries that deal with relationships between different entities, you only need to know that those relationships are expressed. Expressing these relationships is simple, transparent, and follows the natural flow of the relationships in the data. You don't need detailed knowledge of the table schema, and the specific way each table is joined, as you do in SQL. In Datalog, you're working with data directly and at a higher level, rather than with the details of how it's expressed in tables.
  • Advanced query logic is possible through rules. Datalog includes a declarative language for expressing rules, which allows for arbitrary recursion and highly customized logic inside of queries. Datalog's rules let you do things in your queries that would either be completely impossible in SQL, or would require SQL with such daunting complexity that you'd likely never want to use it in a production system.

Datomic arrives at answers to Datalog queries through unification—a concept from logic programming. For those unfamiliar with this style of programming, it may be helpful to do a bit of background reading so that the concepts are not completely foreign. A fantastic resource for this is The Reasoned Schemer by Friedman, Byrd, and Kiselyov. It uses the same question-and-response dialectic approach as their prominent book for Lisp beginners, The Little Schemer, and provides a fun, easy-to-understand, and approachable introduction to logic programming in a Lisp-like syntax.

Syntax

There are two ways to write Datalog queries: the list form and the map form. They are semantically equivalent, but are typically used in different scenarios. The list form is idiomatically most often what's used for queries written by humans, because many people find it somewhat easier to read and it requires fewer inner lists. Here's a simple query in the list form, which will return the first name and last name of the entity whose email address is [email protected].

'[:find ?first-name ?last-name
  :where [?e :email "[email protected]"]
         [?e :first-name ?first-name]
         [?e :last-name ?last-name]]

The map form is more often used when building queries programmatically, because it's much easier to destructure and manipulate the various parts of the query when they're in map keys. This is the same query in the map form.

'{:find [?first-name ?last-name]
  :where [[?e :email "[email protected]"]
          [?e :first-name ?first-name]
          [?e :last-name ?last-name]]}

The two forms are quite similar, but the map form requires that all of the query components be wrapped in lists. You'll notice that both forms have been quoted—the preceding '. This is required when used in Clojure code, because queries are written using symbols, like ?first-name and ?last-name, which throw exceptions when evaluated.

Since it's more idiomatic for queries written by humans, the list form will be used for the rest of this chapter, except in places discussing how to build and manipulate queries programmatically.

Here is what the equivalent SQL might look like. Notice that, in this simple example, the SQL version has at least a vague resemblance to the Datalog.

"SELECT first_name, last_name FROM contacts
 WHERE email='[email protected]'"

Let's take a closer look at that simple query example.

Find… What Am I Looking For?

The “find” part of the query—the first line starting with :find—is roughly analogous to the SELECT part of a SQL SELECT query. It says what you want to return, and in what format you want it. In this example, here is the find specification:

:find ?first-name ?last-name

In Datalog, variables are symbols prefixed with ?, so this find specification is returning the values for the two variables ?first-name and ?last-name, as defined later in the where section of the query. This type of find specification—where the variables are simply positionally listed after the :find keyword—returns a “relation.” This is the most common type of find specification in most queries you'll see and write because it's the most general form, and its behavior is the most similar to that of SQL SELECT.

The result, when run against the earlier sample data, is a set of tuples:

#{["Jane" "Doe"]}
Returning Collections

The second type of find specification returns a collection of all the matches for a single variable. Collection find specifications are written like this:

:find [?first-name ...]

When run against the example data again, it returns this result:

["Jane"]

The equivalent SQL looks much the same as the original, only containing one SELECT column rather than two.

Returning Only One Result

You can also write a find specification that has the same general behavior as the relation semantic, but only returns the first match. This is the “single tuple” specification, and it is written like this:

:find [?first-name ?last-name]

With this find specification, the query returns the first result tuple it finds, like this:

["Jane" "Doe"]

The SQL version adds a LIMIT clause to achieve the same behavior.

"SELECT first_name, last_name FROM contacts
 WHERE email='[email protected]'
 LIMIT 1"
Returning a Single Value

In some cases, it's useful for queries to return one scalar value. For that, use the scalar find specification, written like this:

:find ?first-name .

This find specification simply returns “Jane.” The SQL version retains the LIMIT clause and only the first_name SELECT column.

:where Is Where the Magic Happens

The body of the query, where almost all of the query logic and important structure live, are in the where clauses. This part of the query does almost everything that you'd do in a SQL query–conditions, joins, inner queries, filters. As such, there are probably more features that deal with what you can do in the where part of a query than for any other aspect of Datalog.

There's a great breadth of topics to cover to get a full idea of what is possible in where clauses. Truly, the level of power and flexibility that Datalog gives you in expressing query logic here is amazing, and throughout the rest of the chapter you'll see a more comprehensive set of this functionality.

You can think of the where clauses as patterns that describe what the data in the database needs to look like to satisfy the pattern. These patterns match against the entity, attribute, and value parts of Datoms as of a given transaction-time. Let's first get a basic grasp of what's going on in our simple query. The where clauses look like this:

:where [?e :email "[email protected]"] ;; 1
       [?e :first-name ?first-name]          ;; 2
       [?e :last-name ?last-name]            ;; 3

In order to return a solution, these patterns require that three Datoms exist in the database:

  1. A Datom with any entity id (which will be bound to the variable ?e), the attribute :email, and the value [email protected].
  2. A Datom with the same entity id as the previous Datom—in other words, a Datom that's about the entity that has the :email of [email protected]—has the attribute :first-name, and has any value. The value gets bound to the variable ?first-name.
  3. A Datom with the same entity id as the previous two, the attribute :last-name, and any value. The value is, as before, bound to a variable: ?last-name.

There are a few things to note here. When values are “bound” to Datalog variables, that same value is used for the rest of the solution. In this case that means several things.

  • If there was no Datom with attribute :email and value [email protected], then these where clauses would return no solutions. The result would be an empty set.
  • The entity id bound to ?e in the first pattern must also have Datoms that match the second and third patterns. In other words, that entity-id must also have Datoms with :first-name and :last-name; otherwise the where clauses would again return no solutions.
  • If there were multiple entities that matched all of the patterns, then multiple solutions are returned.

It's important to understand how this variable binding and solution finding process works. It's possible for patterns to match multiple Datoms and have multiple possible value bindings for Datalog variables, and each one of these bindings is tested against the rest of the patterns in turn. If Datoms exist in the database that satisfy all of the patterns, then that solution is considered “unified” and is returned as part of the results. This process can be recursive, since variable bindings from earlier patterns can be linked to additional variable bindings in later patterns, which must in turn each be tested against subsequent patterns.

Let's go through a somewhat more involved example to see how this process works.

Variables and Joins

The initial example was fairly simple. Let's look at one that involves data with more relationships and a query with more variables. First, we'll extend the data to include more than one contact, and the concept that contacts can have friends. For simplicity, let's keep using only the [e a v] part of the Datoms.

[[101 :first-name "Jane"]
 [101 :last-name "Doe"]
 [101 :email "[email protected]"]
 [101 :friend 102]
 [101 :friend 103]
 [102 :first-name "Ada"]
 [102 :last-name "Lovelace"]
 [102 :email "[email protected]"]
 [102 :friend 104]
 [103 :first-name "Robert"]
 [103 :last-name "Heinlein"]
 [103 :email "[email protected]"]
 [104 :first-name "Jane"]
 [104 :last-name "Smith]]

You still have your old friend Jane Doe, but she has two new friends. Using this data, let's look at a slightly more complex query.

'[:find ?first-name ?last-name
  :where [?e :email "[email protected]"]
         [?e :first-name ?first-name]
         [?e :last-name ?last-name]
         [?e :friend ?f]
         [?f :first-name "Robert"]
         [?f :last-name "Heinlein"]]

In this query, you are still looking for the first name and last name of an entity whose email is [email protected], but you have added some more clauses. You now have a join! The entity must also have a friend—another entity?f—whose first name is “Robert” and last name is “Heinlein.” As it happens, “Jane Doe” is friends with “Robert Heinlein,” so the result is the same as before: #{["Jane" "Doe"]}

Let's walk through the execution of this query to help clarify how Datalog works with multiple variables and relationships.

As before, the first three patterns will match Datoms where there is an entity with the specified value for :email, which also has :first-name and :last-name, and as before you bind those to the variables ?first-name and ?last-name. The fourth pattern binds any value from a Datom with that same entity-id and the attribute :friend to the variable ?f. This means that ?f will contain two possible values: 102 and 103.

To satisfy the next pattern, the query will try to find matches where ?f is 102 or 103. Trying 102 will not return a solution, because there is no Datom where 102 is the entity, and the :first-name is “Robert.” But the second value, 103, will return a solution because it matches both of the last two patterns.

This query can still be written in SQL, although it begins to look more unwieldy.

"SELECT e.first_name, e.last_name FROM contacts AS e
 INNER JOIN contacts AS f
 ON e.friend_id = f.id
 AND e.email = '[email protected]'
 AND f.first_name = 'Robert'
 AND f.last_name = 'Heinlein'"

Going Deeper

To help you understand this more fully, let's look at an even more complex query that shows some more of the power of logic programming, and the use of variables across joins.

'[:find ?first-name ?last-name
  :where [?e :first-name ?first-name]  ;; 1
         [?e :last-name ?last-name]    ;; 1
         [?e :friend ?f]               ;; 2
         [?f :friend ?g]               ;; 3
         [?g :first-name ?first-name]] ;; 4

This query returns the same solutions as the previous queries—#{["Jane" "Doe"]}, but why? In words, what this query is looking for is:

  1. The first name and last name of anyone
  2. Who is friends with someone
  3. Who is friends with another person
  4. Who has the same first name as the first person

In the data, “Jane Doe” is friends with “Ada Lovelace” who is friends with “Jane Smith,” so “Jane Doe” is returned as the solution. Here you start to see the magic of logic programming. You did not have to delve into any of the mechanics of how to traverse the data, or the structure and storage of intermediate results, or explicitly call out any joins between different entities. In this query you declaratively describe the data pattern at a high level, so the constraints and relationships you are interested in for the solution, and the Datalog engine, finds all of the answers that match.

The SQL equivalent for this query is left as an exercise to the reader.

However, with great power comes great responsibility. This particular query does almost nothing to narrow down the possible set of matches, so as the number of contacts and size of the friends graph in our data grows, this query will have to churn through more and more possible solutions to find the one that matches. In production systems this can become unfeasibly expensive, so exercise caution and try to write queries that are as selective as possible, as you saw in earlier queries that required a matching email.

Kaboom! A Combinatorial Explosion

Just as in SQL, it's possible to write queries that will result in combinatorial explosions. Here's one such query.

'[:find ?first-name ?last-name
  :where [?a :first-name ?first-name]
         [?b :last-name ?last-name]]

Notice how the two patterns in the where clause are disjointed. This query returns every combination of first and last name in the database.

Transactions

Now that the basics of queries in Datomic have been covered, let's look at how you write data. Writes are sent to Datomic in the form of “transactions,” which are data structures following certain rules. All transactions are expressed as lists containing data to be added or retracted.

There are two basic syntax options for how to write transactions: a lower-level list-based form that hews closely to the format of Datoms themselves, and a higher-level map-based form. The map-based form is essentially syntactic sugar that translates into the lower-level list-based form, so let's first look at the list syntax.

Low-level List Syntax

The list form follows this format:

[command e a v]
  • command is either :db/add or :db/retract—the first means that this fact should be added to the database, the second means it should be retracted. This will map directly to the added? part of the Datom.
  • e is an entity reference, which is an extended version of the e part of the Datom. This is covered in greater detail in the section on using Datomic in Clojure. For now, let's simplify this and say that it is either the entity id of an existing entity, or it's a temporary id structure that Datomic translates into a permanent entity id as part of the transaction.
  • a and v are the same as in the previous examples.

Perhaps you're wondering, “I see e a v and added?, but where's t?” The answer is that it's generated for you. Datomic automatically creates a transaction entity and attaches it to all of the Datoms in the transaction as part of the transaction process.

For example, a transaction in list form to add the data about “Jane Doe” to the system for the first time might look like:

[[:db/add jane-temp-id :first-name "Jane"]
 [:db/add jane-temp-id :last-name "Doe"]
 [:db/add jane-temp-id :email "[email protected]"]]

This adds an entity for Jane with her first and last name, and email, assuming you have generated a temporary id for her with the Datomic API and bound it to jane-temp-id. Now suppose you want to add some friends for Jane. First, you need to find Jane's actual entity id with this query.

'[:find ?jane-id .
  :where [?e :email "[email protected]"]]

Then you can add her friends with a transaction like this:

[[:db/add heinlein-temp-id :first-name "Robert"]
 [:db/add heinlein-temp-id :last-name "Heinlein"]
 [:db/add heinlein-temp-id :email "[email protected]"]
 [:db/add jane-id :friend heinlein-temp-id]
 [:db/add lovelace-temp-id :first-name "Ada"]
 [:db/add lovelace-temp-id :last-name "Lovelace"]
 [:db/add lovelace-temp-id :email "[email protected]"]
 [:db/add jane-id :friend lovelace-temp-id]]

Again, let's assume you have generated temporary ids for the two new entities being added, and you've bound Jane's permanent id from the previous query to jane-id.

All of these transactions are assertions, so let's take a quick look at what a retraction looks like. Let's assume that Jane Doe, having read some of Heinlein's later works, is shocked by his views on morality, and decides that she no longer wants to be his friend. First, you need to look up Heinlein's permanent entity id using a query—left as an exercise to the reader—quite similar to the above query to find Jane's permanent id. The transaction to retract the friendship looks like this.

[[:db/retract jane-id :friend heinlein-id]]

You can see that the list form is somewhat verbose and unwieldy. It does offer the advantage of having direct, low-level control over the transaction and, unlike the map form, allows you to retract individual Datoms. However, you will almost always end up using the higher-level map form for all your transactions, with higher-level APIs to handle things like retraction. So let's take a look at the map form.

Map Syntax

The map syntax is, as mentioned before, higher-level than the list syntax and it's also generally considered more idiomatic. It represents transaction data as maps, where attributes are the keys, and entity ids are represented by a special :db/id key, just like the format of entity maps. Let's take a look at the Jane transaction from before expressed in this syntax.

[{:db/id jane-temp-id
  :first-name "Jane"
  :last-name "Doe"
  :email "[email protected]"}]

In this form, as with entity maps, the entity id is represented by the :db/id key. Under the hood, Datomic translates this into the list form, exactly the way it was in the previous section, before transacting. But you can see that this is a much more concise, convenient, easy-to-read way of expressing that transaction data.

You can also do something really nifty that would take quite a bit more work to do using the other syntax: add Jane and her friends at the same time. Watch this:

[{:db/id jane-temp-id
  :first-name "Jane"
  :last-name "Doe"
  :email "[email protected]"
  :friend #{{:first-name "Robert"
             :last-name "Heinlein"
             :email "[email protected]"}
            {:first-name "Ada"
             :last-name "Lovelace"
             :email "[email protected]"}}}]

That's right, the map syntax supports nesting for related entities! One thing you'll notice is that Mr. Heinlein and Ms. Lovelace don't have temporary ids. That's because Datomic will generate them for you when building a transaction this way.

There are a few caveats to this. Don't worry, you won't understand them until you read about Datomic schema, but they're included here in case you're reviewing this section later and want this information all in one place. When you use the nested map syntax this way, if you don't provide temporary ids, then the nested entities must either be related to the containing entity via an isComponent attribute, or they must have an identity attribute. Also, their generated ids will be in the same partition as the containing entity's id.

Set Semantics

One thing to keep in mind: Datoms have set semantics. In other words, asserting the same [e a v] more than once is a no-op; the Datom will only exist in the database once. As a quick example, if you transact this data:

[{:db/id jane-id
  :first-name "Jane"
  :last-name "Doe"
  :email "[email protected]"}]

And then transact this data:

[{:db/id jane-id
  :first-name "Jane"}]

The second transaction will succeed and it will add a transaction entity, but it will not add another Datom about Jane. This is covered in greater detail in the section about the Clojure API.

Indexes Really Tie Your Data Together

So far the discussion has been about the data in Datomic in terms of collections of Datoms, which is true but incomplete. To understand the way Datomic really stores data, you must go deeper. You also may need to let go of some preconceptions that are often carried over from experiences with relational databases and their storage model.

Generally, you don't have to worry too much about Datomic's indexes or how it stores data. In most cases, straightforward use of Datomic's high-level API without any consideration of these lower-level details will have acceptable results. However, when doing detailed data modeling or thinking about performance, it's very important to be familiar enough with Datomic's indexes that you have a good general idea of how they will be used and how they impact what you're doing.

In a relational database, data is represented in tables and these tables generally map closely to the way data is stored on disk. The rows are generally stored in order, with a struct-like representation of their contents. Indexing is typically managed separately, where each index is created manually over one or more specific columns, with some sort of tree structure managing the indexed data with pointers to the corresponding rows. In other words, indexes are effectively separate from the record-level data.

In Datomic, the indexes are the only representation of the data in the system. The term for that is “covering index,” which means that the index contains all of the data, rather than a limited subset as in relational databases. These indexes allow Datomic to quickly access data from different lookup patterns, and enables the powerful and flexible query system. Datomic stores data in four indexes, plus the transaction log.

Each index is sorted and stored in a tree representation. The key distinction between the different indexes is their sort order.

eavt Index

The eavt index represents data in a way that is conceptually similar to the way relational databases store rows. Datoms in this index are first sorted by entity id, so Datoms about each entity are grouped together and then by attribute, then value, then transaction id. Datoms in this index would be stored quite similarly to the way we've been representing them so far, using a dummy transaction id:

[[101 :email "[email protected]" 'some-t]
 [101 :first-name "Jane" 'some-t]
 [101 :friend 102 'some-t]
 [101 :friend 103 'some-t]
 [101 :last-name "Doe" 'some-t]
 [102 :email "[email protected]" 'some-t]
 [102 :first-name "Ada" 'some-t]
 [102 :last-name "Lovelace" 'some-t]
 [103 :email "[email protected]" 'some-t]
 [103 :first-name "Robert" 'some-t]
 [103 :last-name "Heinlein" 'some-t]]

Since facts about each entity are grouped together, it's easy to retrieve all of the data about a given entity quickly. For obvious reasons, the entity API that's used to retrieve entity maps uses this index. This API is covered in more detail in a later section.

All Datoms are stored in this index.

aevt Index

The aevt index groups all Datoms about an attribute together, allowing for the quick determination of which entities have values for that attribute and, secondarily, what those values are. As the Datomic docs point out, this index is somewhat similar to a columnar store, and is one of the ways that Datomic enables flexible querying that supports mixed workloads. For this example data, the index looks something like this:

[[:email 101 "[email protected]" 'some-t]
 [:email 102 "[email protected]" 'some-t]
 [:email 103 "[email protected]" 'some-t]
 [:first-name 101 "Jane" 'some-t]
 [:first-name 102 "Ada" 'some-t]
 [:first-name 103 "Robert" 'some-t]
 [:friend 101 102 'some-t]
 [:friend 101 103 'some-t]
 [:last-name 101 "Doe" 'some-t]
 [:last-name 102 "Lovelace" 'some-t]
 [:last-name 103 "Heinlein" 'some-t]]

One thing to note here is that we are making a slight oversimplification. Attributes are not stored and sorted based on their keyword name, but rather on their underlying attribute entity id. This doesn't make a huge amount of difference in practice, but it's something to keep in mind.

All Datoms are stored in this index.

avet Index

The avet index is subtly different from the aevt index, in that it's designed to very quickly retrieve attributes with specific values. It's also, as the Datomic docs point out, the most expensive index to build and maintain. For this reason it's optional on a per-attribute basis, whether or not Datoms are stored in this index. Assuming indexing is enabled on all of the example attributes, here is what the data looks like in this index:

[[:email "[email protected]" 103 'some-t]
 [:email "[email protected]" 101 'some-t]
 [:email "[email protected]" 102 'some-t]
 [:first-name "Ada" 102 'some-t]
 [:first-name "Jane" 101 'some-t]
 [:first-name "Robert" 103 'some-t]
 [:friend 102 101 'some-t]
 [:friend 103 101 'some-t]
 [:last-name "Doe" 101 'some-t]
 [:last-name "Heinlein" 103 'some-t]
 [:last-name "Lovelace" 102 'some-t]]

The decision about whether to enable the avet index for a given attribute involves a similar tradeoff to the decision about what columns to index in a relational database. It adds cost to writes, but depending on the kinds of queries you're doing, it may be essential. Any query that relies on the specific value of an attribute, without knowing anything about which entity it belongs to, will be enormously sped up by this index. Best practices here are similar to those in the relational world: if you think you'll query against the attribute's value, then enable indexing.

Fortunately, if you miss out on adding an attribute to this index that you end up needing later, you can always enable indexing on that attribute with a run-time schema update.

vaet Index

The vaet index is for representing reverse references. That is, when you have a reference-type attribute, which means it represents a relationship between two entities, this index stores that relationship in the opposite direction. Naturally, this index is only enabled for reference type attributes. For this example data, the vaet index might look like the following:

[[102 :friend 101 'some-t]
 [103 :friend 101 'some-t]]

You can grasp the general idea. The other indexes can be used in various ways to easily find outgoing entity references, but the vaet index is required to efficiently determine incoming references.

The log

The log is structured somewhat differently from the indexes. The Datomic docs put it quite simply: the log is an ordered collection of transactions, each of which contains an unordered set of Datoms. Queries do not directly use the log; rather it's accessed via several specialized APIs that can be used as part of queries or to examine the history directly in your application.

Index structure

The above means that every Datom is stored in at least 3 and as many as 5 copies. In the past, this kind of data replication may have been cause for concern or perhaps it was even infeasible. Indeed, the idea that storage is scarce motivated many of the original design decisions that are now deeply embedded in relational database designs. Today this is a rather antiquated assumption: storage is cheap and fast. Datomic's design makes time vs. space trade-offs in the direction of optimizing for speed, because modern hardware and system architectures make data replication and storage size a much lesser concern.

The indexes themselves are stored as shallow (3-level) trees, with an extremely high branching factor. The structure looks like what is shown in Figure 6.1.

Schematic for indexes stored as shallow (3-level) trees, with an extremely high branching factor.

Figure 6.1

It's structured as a root node, directory nodes, and segments. Datomic optimizes read performance by keeping the root node and as many of the directory nodes in the memory of the peers and the transactor possible to minimize the number of trips to storage.

Each of the segments contains up to approximately 50Kb of Datoms, which have been serialized to binary using the Fressian format and then gzipped. Depending on the size of the values, each segment can contain many thousands of Datoms. This is the most important thing to grasp about how Datomic stores data: to get at a single Datom in a given segment, the entire segment must be transferred, unzipped, and deserialized into memory. Thus, when thinking about performance, the biggest factor is going to be the number of segment accesses required to complete any operation.

Datomic's Unique Architecture

One of Datomic's main advantages is its de-constructed architecture. Traditional databases are monolithic, performing all of their functions as part of one system running on one computer. Datomic splits these functions up into their own components, which can be run on different systems and even swapped out with different options.

  1. 1. Reads. Your application's data reads, including query logic, are performed by a Datomic “peer,” which runs inside of your application's process. There can be multiple peers per Datomic database to support different components of your application.
  2. 2. Writes. Datomic has a single writer, called the transactor. It serializes writes, providing ACID transaction guarantees, and also supports arbitrary transaction logic to enforce data integrity, as you'll see in a later section. The transactor runs as its own process, often on a dedicated system.
  3. 3. Storage. The data and indexes that comprise a Datomic database are actually stored in another database. This part of the system is pluggable: supported storage systems include Amazon's DynamoDB, Cassandra, Riak, a SQL database, and others.
  4. 4. Optional cache. With the Datomic Pro edition, you can incorporate memcached as a cache layer. Datomic's immutable data structures are a natural fit for caching since they never have to be invalidated.

After starting up the transactor, peers connect to it over a message queue and also establish a connection to the storage system and memcached if present. To do a read, the peers access storage directly—no coordination with the transactor is required. Peers send data to be written to the transactor, and as writes are completed the transactor pushes a stream of the newly written data to all peers.

The peers and the transactor have two pools of memory devoted to Datomic data. The first is called the “Object Cache,” which is a hot cache of recently and frequently accessed database data. You can independently set the size of the Object Cache on each peer and the transactor. This setting is important when configuring Datomic for production and involves some tradeoffs, so it's a good idea to take a look at the Datomic documentation section on capacity planning for more information.

The second is called the “memory index.” This is where newly transacted data is stored after it has been written, but not yet added to the indexes. The transactor is responsible for rebuilding the indexes, and it only does so after a configurable threshold of new data has been added. Newly added data is always immediately persisted to the transaction log, but the indexes are only updated periodically, since this is an expensive operation. This data exists in the memory index of the peers and the transactor until it's been added to the indexes. Reads therefore go against a merger of what is in the indexes and what is in the memory index.

The Life and Times of a Transaction

To get a clearer understanding of the process of how data gets into Datomic and how the different parts of the system are involved, let's walk through the life and times of an example transaction.

  1. A peer submits a transaction to the transactor.
  2. The transactor processes transactions one at a time order of receipt, and the transaction waits in line until its turn.
  3. The transaction is processed by the transactor, transaction functions run, temporary ids resolved, etc.
  4. The transactor persists the transaction to the transaction log in the storage layer.
  5. The transactor adds the new Datoms to its memory index, and sends the new Datoms to every connected peer.
  6. Some time later, the transactor begins a re-indexing job, incorporating all new Datoms into the indexes.

MODELING APPLICATION DATA

Datomic's schema provides a flexible, descriptive, and easy-to-use system for describing your application's data. In contrast to the schema of relational databases, in which you build fixed tables containing column definitions that can be used only in that table, Datomic schema is primarily concerned with defining attributes that can be used with any entity. Although best practices indicate that you should group your attributes based on the kind of entity, they are used within the application.

Like in many other databases, schema elements in Datomic need to be loaded before they are used by the application. In order to do this, you need to build transactions containing the required parts of each attribute entity, and then write those transactions.

Example Schema for Task Tracker App

In previous chapters, you've looked at various aspects of building a task tracker system. Let's look at how to build a schema for that in Datomic. The system will have several different types of data. It will have the tasks themselves (users), and as in every other task tracker system, you'll want to be able to charge money for it so it will have accounts with charges and payments.

There are some details that aren't covered here in the same depth as they are in Datomic's documentation, so you should review the Schema section of that documentation before moving past this section. There is a wealth of information there.

Let's look at the schema section by section.

Schema Basics

Let's take a look at what a basic schema transaction contains. First, we'll create the schema attributes for the tasks. The first thing a task needs is a description.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/description
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "description of the task"
  :db.install/_attribute :db.part/db}]

The first key assigns the attribute a :db/id, which is the entity id that's required for every entity in the system, including attributes. The value is a reader macro implemented as part of the Datomic API, which creates a temporary entity id in the :db.part/db partition. All attribute entities are required to be in the :db.part/db partition.

A :db/ident keyword is required for all attribute entities. It provides a universal identifier that can be used to refer to the attribute in queries and in transactions. In general an ident can be used anywhere an entity id is accepted. You can also give :db/ident values to data entities, in addition to attribute entities. Be careful of abusing this feature, though, because all the :db/ident values in the database are loaded on startup into every Datomic peer.

This attribute has the ident of :task/description. The best practices laid out in the Datomic documentation recommend that you use namespaced keywords for all your attribute idents, to organize them into categories, and to prevent name collisions. If you haven't read through the best practices section of the Datomic documentation, you should probably take a few minutes and do so now. It's an immensely useful document that's been distilled from the experience of nearly four years that Datomic has been in production.

Attribute cardinality is specified using the required :db/cardinality attribute. There are two possible values for this attribute: :db.cardinality/one or :db.cardinality/many. If it's :db.cardinality/one, every entity can only have one value true at any given time for this attribute—if you try to assert a different value, then the previous value will be retracted and replaced with the new value. Otherwise for :db.cardinality/many there is no restriction. Since tasks should only have one title, :db.cardinality/one is the right choice.

Every attribute needs to have a :db/valueType, which specifies the type that its v value must have. There are two broad categories of valueTypes: reference, which means that the value must be an entity ID, or some data type, such as string, float, integer, URI, and so forth. You want your titles to be strings, and :db.type/string is Datomic's string type.

One optional attribute that you can use with both attribute and data entities is :db/doc. This is broadly equivalent to a Clojure docstring or simple Javadoc, but there are no hard-coded ways in which it is used, so you can adapt it to whatever purposes suit the needs of your application. Here, we'll treat this attribute as a docstring-like annotation, which is only meant to be used by developers and other internal users.

Finally, there's the somewhat confusing :db.install/_attribute key. Whenever you see a Datomic attribute where the name part of the keyword begins with an underscore, you know you're seeing a reverse reference. Reverse references are where the entity and value parts have been swapped.

{attribute-temp-id :db.install/_attribute :db.part/db}

What this really means is this:

{:db.part/db :db.install/attribute attribute-temp-id}

The :db.install/attribute attribute is a special Datomic attribute used to add attributes to Datomic's schema, and attributes always need to be installed in :db.part/db.

Partitions

Datomic best practices call for creating some partitions specific to your app, rather than using the built-in :db.part/user that's intended more for development and experimentation. Creating partitions is simple: just create an entity with a :db/ident and the partition install command. Let's create two partitions: one for the tasks, and one for accounts that will contain data about users, their accounts, and their payments and charges.

[{:db/id #db/id[:db.part/db]
  :db/ident :db.part/task
  :db.install/_partition :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :db.part/account
  :db.install/_partition :db.part/db}]

Fulltext Indexing

Datomic supports fulltext indexing on specific attributes using a Lucene index, and supports fulltext search on those attributes as part of queries. This can be very useful if you have one-off needs for fulltext searching that are limited enough to not call for a dedicated search database. Let's add a task title attribute that is fulltext-indexed.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/title
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "title of the task"
  :db/fulltext true
  :db.install/_attribute :db.part/db}]

Use the :db/fulltext attribute to enable fulltext indexing for the title attribute. By default, it's disabled, but there are a few things to bear in mind about fulltext indexing.

  • The underlying Lucene indexes are not ACID, but they are eventually consistent and will be updated during transactor re-indexing. If your app requirements include fulltext search that's immediately available after data is added, then you need to use an external system for search and your app needs to pipe searchable data into it, perhaps using Datomic's transaction report queue.
  • The decision about whether an attribute is fulltext indexed is final. It is not currently possible to enable or disable fulltext indexing on an attribute after it's been installed.

Enums in Datomic

Many relational databases support the concept of an enum datatype to create columns that require one of a specific set of enumerated values. Datomic emulates this function using reference types and enumeration entities. Let's take a look at an example schema, and add a task status attribute that takes an enum.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/status
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/ref
  :db/doc "task status - an enum"
  :db.install/_attribute :db.part/db}]

There is nothing particularly special required to create an enum attribute. Here the valueType is set as :db.type/ref, because the enumerated values are entities, and a note was added in the attributes doc to make it clear that it is specifically enum entities that you want to use as the refs for this attribute.

For task status values, it would be nice if they had an attribute that provided some kind of display label that could be used when showing the status. Let's add an attribute that can be used to add a label to any entity.

[{:db/id #db/id[:db.part/db]
  :db/ident :label
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "display label of an entity"
  :db.install/_attribute :db.part/db}]

Now, let's create the enumeration values for the task status. It's quite simple to do: just create an entity with a :db/ident. You need to add task statuses for “To Do,” “In Progress,” and “Done.”

[{:db/id #db/id[:db.part/task]
  :db/ident :task.status/todo
  :label "To Do"}
 {:db/id #db/id[:db.part/task]
  :db/ident :task.status/in-progress
  :label "In Progress"}
 {:db/id #db/id[:db.part/task]
  :db/ident :task.status/done
  :label "Done"}]

You'll notice a few things here. First, since these are not schema attributes, you can't put them in the :db.part/db partition, so put them instead in the :db.part/task partition that was created for the tasks earlier. Next, you'll notice that the namespace of their ident keywords is “task.status,” corresponding to the name of the attribute that they're to be used with. This corresponds with the best practice recommended for naming enumerated values in Datomic. Finally, you can give them some display-friendly labels that the app can use, avoiding any hard-coding in the app about what specific status values should be called, or about how to translate the ident keywords into something for display.

Identity Attributes

Next, let's look at identity attributes. These are attributes whose values are meant to be external keys, and can be used to refer to the entity they belong to. In this way, they're somewhat analogous to primary keys in the relational world, although an entity can have more than one of them. It's very important to include identity attributes in the data model for your key domain entities.

It's considered an anti-pattern to use Datomic's entity id to refer to an entity, except in cases where you just have queried for it, and it's a particularly bad practice to ever store entity ids outside of Datomic. Entity ids are meant to be internal references inside the database, and they are not guaranteed to remain the same if you do a backup and restore.

Let's give the tasks in the app an identity attribute that represents their “issue id,” which is sort of like the ticket numbers you often see in other task and bug tracking systems.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/issue-id
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "task's issue ID, for external reference"
  :db/unique :db.unique/identity
  :db.install/_attribute :db.part/db}]

This is marked as an identity attribute by setting the :db/unique attribute to :db.unique/identity. Setting something as an identity attribute has several effects.

  • The attribute must be cardinality-one.
  • Values of an identity attribute have enforced uniqueness—two entities can't have the same value for an identity attribute.
  • The value of this attribute can be used to transparently look up the entity to which it belongs, and using an identity attribute as part of a transaction makes that an upsert transaction.

Identity attributes are covered in more detail later in the Datomic's Clojure API section.

Many-to-Many and Hierarchical Relationships

One of the biggest benefits of Datomic's data model is the flexibility and simplicity of expressing relationships between entities. Datomic's model of many-to-many relationships, in particular, is a welcome relief to those used to the relational way of modeling them. Here are four little words that are music to the ears of every developer who has struggled with many-to-many in SQL: “no more join tables.”

To show how many-to-many is done in Datomic, let's create some attributes for adding tags to tasks. This will allow users to tag tasks with things like “Home,” “Work,” “Shopping,” “Family Vacation,” and so forth. Let's add the task to tag relationship, and then add a tag attribute that allows us to give tags names.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/tag
  :db/cardinality :db.cardinality/many
  :db/valueType :db.type/ref
  :db/doc "task tags"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :tag/name
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "tag's name, used as identity"
  :db/unique :db.unique/identity
  :db.install/_attribute :db.part/db}]

You see how doing many-to-many in Datomic is as simple as adding a reference-type attribute with cardinality-many? Since you're not constrained by the fixed number of columns in relational tables, it's simple to model the kind of data that typically requires specialized extra tables in SQL databases.

Also, notice that you've made the name of tags an identity attribute. This makes them upsertable, and will make it easier to reuse tags in multiple tasks.

Let's next look at how to create an entity relationship that represents a hierarchy. In the task app you want to create subtasks, which are children of the parent tasks. And perhaps those subtasks can in turn have subtasks of their own, and so on… Some people have very complicated lives, after all, with activities that have deep dependencies on other things being completed. Since subtasks are often attached to parent tasks, moved around, and reorganized, let's model the relationship on the child side.

[{:db/id #db/id[:db.part/db]
  :db/ident :task/parent
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/ref
  :db/doc "parent of the task, establishing arbitrary hierarchy"
  :db.install/_attribute :db.part/db}]

This is a fairly straightforward attribute. It says that a task can have at most one parent, and the parent is a reference. Thus, tasks can have parents tasks, which can in turn have parents, allowing for hierarchies of arbitrary depth.

Datomic allows you to model relationships in either direction, so this can be modeled as :task/subtask and be an attribute of the parent, rather than the subtask. The decision largely depends on how you prefer to think about the relationship, and how your application will use it. In this case, the deciding factor was that one of the key operations you want to support is assigning an existing task to be a subtask of some other task, possibly removing it as a subtask from another task. Modeling it this way allows this operation to be done very easily: asserting a new parent for the subtask will automatically remove any existing parent relationship, since this is a cardinality-one attribute.

Let's take a moment to consider something interesting about Datomic that particularly distinguishes it from SQL. In this case we have named the parent relationship attribute such that it appears to be task-specific, but there's no reason that it has to be—Datomic won't enforce that constraint. With a single attribute, you could say that tasks can have “idea” entities as parents, and subtasks or comments or any number of differently structured entities as children. The flexibility of Datomic makes these kind of heterogeneous relationships much simpler to model.

Identity versus Unique

Let's create a simple model for users. It's quite similar to what you might see in a relational schema, but it has a few important differences.

[{:db/id #db/id[:db.part/db]
  :db/ident :user/login
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "user's login name and display name"
  :db/unique :db.unique/identity
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :user/password
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "crypted pasword"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :user/email
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/string
  :db/doc "User's email address"
  :db/unique :db.unique/value
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :user/account
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/ref
  :db/isComponent true
  :db/doc "The account linked to the user"
  :db.install/_attribute :db.part/db}]

This model adds four attributes: the user's login, an encrypted password, an email address, and a link to an account that will be modeled in a moment. For this system, separate logins are used instead of email addresses to allow people to choose their own clever nicknames like “weavejester” and “hyPiRion,” or more pedestrian ones like “rhickey.” We do, however, want to store people's email addresses and only allow one user per email address.

Notice that login is an identity attribute, and email address is modeled as a :db.unique/value attribute. They have slightly different semantics. Identity attributes, as discussed before, enforce uniqueness, but also allow for upserts: if you assert new data that contains an identity value that already exists in the database, then that data updates the entity to which that identity belongs. That means you can refer to your users by their login everywhere, and Datomic will resolve that to the correct entity.

The other type of uniqueness is set with :db.unique/value: if you assert new data that duplicates an existing :db.unique/value Datomic will throw an exception and the entire transaction will fail.

It's generally bad practice to have more than one identity attribute for a single type of entity, unless they correspond to external keys for different systems. If you need additional unique attributes for a type of entity, you should usually use :db.unique/value attributes.

Component Relationships

An attribute can also define a “component” relationship. This describes a relationship where the component entity is owned by the parent entity, and it isn't meant to exist independently. Datomic retracts component entities when you retract the parent entity, and will also return data about component entities when you fetch data about the parent in several key parts of the API.

Let's set up an account system for users that will handle paid accounts and keep track of charges and payments. First let's add an account type attribute, modeled as an enum.

[{:db/id #db/id[:db.part/db]
  :db/ident :account/type
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/ref
  :db/doc "account type (eg paid, free), an enum"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/account]
  :db/ident :account.type/free
  :label "Free account"}
 {db/id #db/id[:db.part/account]
  :db/ident :account.type/paid
  :label "Paid account"}]

This is just like the enum covered earlier in this chapter: a cardinality-one reference type attribute, with entities with appropriately named idents as the enumeration values. Now let's add in the charges and payments, which are called "transactions" here against the account.

[{:db/id #db/id[:db.part/db]
  :db/ident :account/transaction
  :db/cardinality :db.cardinality/many
  :db/valueType :db.type/ref
  :db/isComponent true
  :db/doc "transactions (payments and charges) against the account"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :transaction/type
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/ref
  :db/doc "transaction type (eg charge, payment, adjustment) - an enum"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/db]
  :db/ident :transaction/amount
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/bigdec
  :db/doc "amount of the transaction"
  :db.install/_attribute :db.part/db}
 {:db/id #db/id[:db.part/account]
  :db/ident :transaction.type/charge
  :label "Charge"}
 {:db/id #db/id[:db.part/account]
  :db/ident :transaction.type/payment
  :label "Payment"}
 {:db/id #db/id[:db.part/account]
  :db/ident :transaction.type/adjustment
  :label "Adjustment"}]

You can see the addition of three more attributes. The last two should be familiar. The first is an enum for the transaction type—either charge, payment, or adjustment for those times when you need to correct mistakes. The second is an attribute for the transaction amount. Datomic includes support for a decimal type corresponding with java.math.BigDecimal objects, which you've used earlier to model the transaction amounts so as not to worry about any stray precision errors from floats.

The first attribute, :account/transaction, models transactions as components of accounts. You do this by setting :db/isComponent to true. It has implemented here because these transactions completely depend on the account for them to have any meaning, and if the account entity is retracted it makes no sense for them to remain.

Disabling History

While Datomic is designed around the idea that data is immutable and the complete history of your data should be available, there are some cases where you don't want that to happen. For example, if you have attributes with a high churn rate or represent calculations that are materialized for the sake of convenience, sometimes it's just not worth it to keep a history of every single value the attribute has ever had.

Fortunately, Datomic allows you to disable history on a per-attribute basis. Let's add an attribute that will represent the current balance of the account. This is a value that will change every time there's any transaction, and furthermore it's something that you cancalculate at any point by summing up all of the transactions. It can be convenient to have this value materialized, because then it's easier to do queries about, for example, accounts that have more than a certain balance. Here's how that is modeled in a Datomic schema:

[:db/id #db/id[:db.part/db]
  :db/ident :account/current-balance
  :db/cardinality :db.cardinality/one
  :db/valueType :db.type/bigdec
  :db/doc "current balance of the account"
  :db/index true
  :db/noHistory true
  :db.install/_attribute :db.part/db}]

By setting :db/noHistory to true, you have disabled history. What this means practically is that during every indexing job, the transactor will drop both retraction Datoms and assertions that have been retracted for this attribute, keeping only the most recently asserted value. You can see then that it's possible for some history data about this attribute to be present between indexing jobs, so you shouldn't write your application logic to depend on there never being any history present for the attribute.

Entity ids and Partitions

The creation of Datomic partitions for your application data have been discussed. This is in addition to Datomic's built-in partitions.

  • :db.part/db is the partition for attributes and other schema-related data.
  • :db.part/tx is the partition for transaction entities.
  • :db.part/user is a partition for your use in development.

In addition to providing logical groupings, partitions also have a useful purpose in providing physical groupings. Remember that Datoms are ordered in the indexes, and reducing the number of index segments that are retrieved as part of an operation is key to improving performance. Therefore it's beneficial to have entities that are related and often retrieved together stored in the same segment if possible, which means their entity ids need to be close together. Partitions give you a way to help make that happen.

As discussed earlier, entity ids are Longs. The higher-order bits of the entity id come from the partition, which means that entities inside a partition are grouped together in the indexes as well. If your application is going to create a large number of entities, particularly in one-to-many relationships, it can be beneficial to create multiple partitions and split these entities up logically.

To take an example from the task tracker schema, let's say that later on you extend the system to include enterprise support with various types of metered charges for accounts. This could result in a large number of charge and payment transactions for the account entities. If all of the account and transaction entities remain in a single partition, as they are in the current schema, then the performance might suffer. Since charges and payments are added across time, you could have a situation where it requires many segment fetches to get all of the transactions for an account.

You can solve this problem by creating multiple partitions for your accounts and distributing account entities across these partitions. When adding charge or payment transactions, make sure to put them in the same partition as their account. This will greatly improve the data locality when accessing accounts and their related transactions, because it will both reduce the maximum possible number of required segment fetches, and increase the chance of related transactions being collocated in a single segment.

DATOMIC'S CLOJURE API

Datomic includes a first-class Clojure API that makes it natural and easy to work with when building applications. In this section, you'll learn about how to set up a project and a Datomic database, how to use the API, and how you can write the data access layer for the example task system.

Basic Setup

You can bring in the free-as-in-beer version of the Datomic peer library, which includes an in-memory database you can use for development, as a normal dependency in your project.clj file. Here's your basic project.clj file, including a dependency atom for the current version that is datomic-free.

(defproject chapter-6 "0.1.0-SNAPSHOT"
  :description "Code for Professional Clojure, Chapter 6: Datomic"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.7.0"]
                 [com.datomic/datomic-free "0.9.5344"]
                 [crypto-password "0.1.3"]]
  :profiles {:dev {:source-paths ["dev"]]}})

You'll notice a few extra things here, in addition to the basic project.clj file from lein, that are new with the Datomic dependency. The crypto-password library is included so user passwords can be encrypted. There are also an extra property defined for the dev profile: the dev folder has been added to the source paths, so you can add some code to support your development process that won't conflict with the production code.

Schema and Example Data

First, let's put the schema that was created in the previous chapter in a file in the resources folder of the project, called schema.edn. This way you can easily work with it in the project.

Next, add some example data in resources/example-data.edn. You can use this data to help you while exploring the Datomic API, developing the data access code, and later it can even form the basis of test cases for testing.

;; A test user, with account and charges
[{:db/id #db/id[:db.part/account]
  :user/login "janed"
  ;; bcrypted "totalanon"
  :user/password "$2a$11$W2juQqRpaxVqXt4u..4qz.asyhbfR53K1a3stjQ3wpUYOCcagH8VK"
  :user/email "[email protected]"
  :user/account
  {:account/type :account.type/paid
   :account/transaction
   #{{:transaction/type :transaction.type/charge
      :transaction/amount 7.99M}
     {:transaction/type :transaction.type/payment
      :transaction/amount -7.99M}
     {:transaction/type :transaction.type/charge
      :transaction/amount 2.55M}}
   :account/current-balance 2.55M}}]
;; A couple of tasks
[{:db/id #db/id[:db.part/task]
  :task/title "Write to Robert about Number of the Beast"
  :task/description "He should know the first part was meandering and slow, and the
                     ending was just self indulgent. "Meet Lazarus, live happily
                     ever after." Please!"
  :task/status :task.status/todo
  :task/issue-id "HOME-11"
  :task/tag #{{:tag/name "Home"}
              {:tag/name "Writing"}}
  :task/user {:user/login "janed"}}
 {:db/id #db/id[:db.part/task]
  :task/title "Disappear into anonymity"
  :task/description "Jane Doe can't be a public figure."
  :task/status :task.status/in-progress
  :task/issue-id "WORK-1"
  :task/tag #{{:tag/name "Work"}
              {:tag/name "Important"}}
  :task/user {:user/login "janed"}}
 ;; Add a subtask
 {:db/id #db/id[:db.part/task]
  :task/title "Pack clothes"
  :task/description "Focus on neutral colors, hoodies."
  :task/status :task.status/todo
  :task/issue-id "WORK-2"
  :task/user {:user/login "janed"}
  :task/parent {:task/issue-id "WORK-1"}}]

A test user, “janed,” has been created, so you can continue the saga of Jane Doe. Her password is “totalanon,” but it's been encrypted with bcrypt. She has an account with some transactions, and a few tasks including one that's a subtask.

Setting Up for Development

While developing you often want to quickly iterate and experiment with an in-memory database. Let's create a dev namespace that lets you do just that. Create a dev folder in the project root (remember dev was added to the source-paths) and add a file called dev.clj to that folder, which looks like this:

(ns dev
  (:require [clojure.java.io :as io]
            [clojure.pprint :refer (pprint)]
            [datomic.api :as d])
  (:import datomic.Util))
(def dev-db-uri
  "datomic:mem://dev-db")
(def schema
  (io/resource "schema.edn"))
(def example-data
  (io/resource "example-data.edn"))
(defn read-txs
  [tx-resource]
  (with-open [tf (io/reader tx-resource)]
    (Util/readAll tf)))
(defn transact-all
  ([conn txs]
   (transact-all conn txs nil))
  ([conn txs res]
   (if (seq txs)
     (transact-all conn (rest txs) @(d/transact conn (first txs)))
     res)))
(defn initialize-db
  "Creates db, connects, transacts schema and example data, returns conn."
  []
  (d/create-database dev-db-uri)
  (let [conn (d/connect dev-db-uri)]
    (transact-all conn (read-txs schema))
    (transact-all conn (read-txs example-data))
    conn))
(defonce conn nil)
(defn go
  []
  (alter-var-root #'conn (constantly (initialize-db))))
(defn stop
  []
  (alter-var-root #'conn
                  (fn [c] (when c (d/release c)))))

You can see the usage of the Datomic API, conveniently called datomic.api as a dependency. The entire Clojure API for Datomic is in this namespace, so you will be seeing this dependency quite a bit. The other import is datomic.Util, which includes a handy static method for reading schema and other Datomic data from files.

Next, configure the URI for the in-memory Datomic db, datomic:mem://dev-db. The format of URIs for a Datomic DB varies based on the type of storage medium in use, but the general format is datomic:<type>://<connection params specific to the storage>/<db name>.

We have two utility functions, read-txs and transact-all, that deal with reading transaction data from a file and writing a collection of transactions. We also have a function, initialize-db, that sets up the database for development, adding both the schema and the example data. Let's return to these in a moment.

The two main functions used in our example are go and stop. The go function initializes the in-memory database and connects to it, binding the Datomic connection object to the conn var. The stop function drops the in-memory database and releases the connection.

This provides a nice development workflow: set up a database, work with it in the REPL, and reset it to the starting state.

Experimenting in the REPL

Start up a REPL in the project, switch to the dev namespace, and initialize the development db.

user> (require 'dev)
nil
user> (in-ns 'dev)
#namespace[dev]
dev> (go)
#object[datomic.peer.LocalConnection 0x2117de53
        "datomic.peer.LocalConnection@2117de53"]

Connections and dbs

The Datomic connection object represents Datomic's connection to both the transactor and the storage layer. Since you're using an in-memory database, both of these are in-process. You can get a db value from a connection with the d/db function of the Datomic API.

dev> (def db (d/db conn))

So, what's the difference between a connection and a db? The connection is a handle, and it's used when you are sending writes to the transactor. A db, however, has value semantics. It represents the state of the database at a particular time. When you call d/db you get the latest state of the database that the Datomic peer in your application knows about. You can use the db to do all kinds of different reads.

Highlight Tour of the Read API

You can examine attributes with attribute:

dev> (d/attribute db :task/title)
#AttrInfo{:id 67 :ident :task/title :value-type :db.type/string
          :cardinality :db.cardinality/one :indexed false
          :has-avet false :unique nil :is-component false
          :no-history false :fulltext true}

You can see the transaction number, or t value as the Datomic docs describe it, of the most recent transaction in a db with basis-t:

dev> (d/basis-t db)
1021

You can retrieve the transaction entity id for a given t value with t->tx.

dev> (d/t->tx 1021)
13194139534333

You can get low-level access to the Datoms in a specific index with datoms.

dev> (first (d/datoms db :aevt))
#datom[0 10 :db.part/db 13194139533312 true]
dev> ;; or seek to specific parts of the index
dev> (first (d/datoms db :aevt :db/doc :task/title))
#datom[67 62 "title of the task" 13194139534313 true]

You can then get the entity id for a given ident with entid.

dev> (d/entid db :task/title)
67

And then back again with ident:

dev> (d/ident db 67)
:task/title

The Entity API

One of the most commonly used features in Datomic is the entity API. It gives you access to entity data in a convenient, idiomatic way for Clojure. Let's take a look at Jane Doe. But how do you access her record? She wasn't given an ident, so you can't just use that keyword. You have to access her entity id somehow.

Here you can take advantage of a great feature of identity attributes: they can be used in place of entity ids in every place that accepts them. When used this way, they're called “lookup refs” and the format is:

[attribute-name value]

We know Jane's login from the example data is “janed,” and since login is an identity attribute, you can use that as a lookup ref for her entity. The entity API is accessed with the entity function.

dev> (d/entity db [:user/login "janed"])
{:db/id 285873023222776}

You might be thinking that this looks strange—there's definitely supposed to be more data about Jane than just an id. Now is a good time to step back and learn a few things about the entity maps returned by Datomic's entity API.

They are lazily loaded. Only the :db/id is present at first.

  • They work with most of Clojure's map functions.
  • You navigate to related entities through normal map access.

Here are some examples of how you can treat entities like maps:

dev> (def jane (d/entity db [:user/login "janed"]))
#'dev/jane
dev> (keys jane)
(:user/login :user/password :user/email :user/account)
dev> (:user/login jane)
"janed"
dev> (vals jane)
("janed" "$2a$11$W2juQqRpaxVqXt4u..4qz.asyhbfR53K1a3stjQ3wpUYOCcagH8VK"
 "[email protected]" {:db/id 285873023222777})
dev> (into {} jane)
{:user/login "janed", :user/password "$2a$11$W2j ...<snip>",
 :user/email "[email protected]", :user/account {:db/id 285873023222777}}
dev> (get-in jane [:user/account :account/type])
:account.type/paid

The Datomic API also includes a function that loads in all of the entity's attributes, as well as every component entity, touch. Here it is used on Jane's account entity.

dev> (d/touch (:user/account jane))
{:db/id 285873023222777, :account/type :account.type/paid,
 :account/current-balance 2.55M,
 :account/transaction #{
  {:db/id 285873023222779, :transaction/type :transaction.type/payment,
   :transaction/amount -7.99M}
  {:db/id 285873023222778, :transaction/type :transaction.type/charge,
   :transaction/amount 7.99M}
  {:db/id 285873023222780, :transaction/type :transaction.type/charge,
   :transaction/amount 2.55M}}}

You see how the results contain all of the attributes of Jane's account, plus all of the related transactions since they're components.

The Query API

There are two ways to access the query API: the more traditional q function, and a query function that takes arguments in a slightly different format and supports setting a query timeout. Let's use the q function here, as it's considered more idiomatic for most types queries.

Here's a simple example, adapted from the earlier section on queries.

(d/q '[:find ?login :in $ ?email
       :where [?user :user/email ?email]
              [?user :user/login ?login]]
     db "[email protected]")
#{["janed"]}

This query looks up the login of the user with a given email. You see that q takes the query as the first parameter. After that, there are some new things here. First, you see this :in form that wasn't in the previous queries. This is used to specify the inputs to the query, in the same positional order as they are passed to q.

Here the db is the first argument, and it's called $ in the inputs. This is how you refer to the db or dbs—it's possible to query against more than one db!—that the query is targeting. Much like the reader-macro syntax for Clojure anonymous functions, if there's only one input db you can just call it $, and it's used as the implicit database for all of your where patterns. If there are multiples, you need to delineate them with $db1 $db2, for example, and each pattern needs to begin with the datasource it's targeting.

The second argument is an input binding that can be used anywhere in the query. In this query you're passing in the email address that you're looking for. You can have an arbitrary number of these inputs. You can also pass in collections, in which case you need to slightly change the format of the input binding. Here is the same query, except this time you're looking for a number of email address logins.

(d/q '[:find ?login :in $ [?email ...]
       :where [?user :user/email ?email]
              [?user :user/login ?login]]
      db
      ["[email protected]" "[email protected]" "[email protected]"])

Notice how this uses ?email as the binding form. This is the required format for bindings where you're binding a collection.

The Pull API

Often you want to find some specific entities with a query, and then retrieve more information about those entities. You can do this with the entity API, but that can get somewhat verbose and isn't very declarative and elegant, particularly when compared to the queries. To solve this problem, Datomic introduced the pull API. This let's you describe the “shape” of data you want to return about an entity or entities, and you can even integrate it directly into your queries.

When you use the pull API, normal Clojure data structures are returned: vectors for collections of results, maps for data about entities, and sets for cardinality-many relationships. Let's look at an example, building on the previous query, but returning the user's login and some information about their account.

dev> (d/q '[:find (pull ?user [:user/login
                               {:user/account [{:account/type [:db/ident]}]}])
            :in $ ?email
            :where [?user :user/email ?email]]
           db "[email protected]")
[[{:user/login "janed", :user/account
   {:account/type {:db/ident :account.type/paid}}}]]

In pull specs, vectors specify lists of attributes to return, and the maps specify relationships. Here you are specifying that you want the :user/login, and then in the account via the :user/account relationship, you want the :db/ident of the :account/type.

Using pull specs as part of queries is often the best way to retrieve data that's meant to be returned to an external system, since the results can be directly serialized as edn and sent over the wire without any further transformation.

Transactions

You can perform transactions with the transact function. It accepts a connection and the transaction data, and returns a future containing a transaction result map. Let's take a look at an example, and what the various parts of the result map mean.

dev> (pprint @(d/transact conn [{:db/id (d/tempid :db.part/user)
                                 :task/title "Hello world"}]))
{:db-before datomic.db.Db@c7f1e174,
 :db-after datomic.db.Db@d62413b7,
 :tx-data
 [#datom[13194139534349 50 #inst "2016-02-06T20:22:19.369-00:00"
         13194139534349 true]
  #datom[17592186045454 67 "Hello world" 13194139534349 true]],
 :tempids {-9223350046623220340 17592186045454}}

The result map contains a :db-before key that contains the db value from immediately before the transaction, and a :db-after key that contains the db value immediately after it. You can pull either of these dbs directly out of the map and work with them as with any other db value. Using the :db-after in particular is quite common.

The :tx-data key contains a collection of all of the Datoms written during this transaction. Here you see there are two: one for the :task/title Datom that was asserted, and another for the transaction entity.

Finally, the :tempids key contains a mapping from the tempids created for the transaction to the actual entity ids written in the database. In this case the entity created has an id of 17592186045454. Let's see what happens when you try to assert a duplicate Datom for this entity.

dev> (pprint @(d/transact conn
                          [[:db/add 17592186045454
                            :task/title "Hello world"]]))
{:db-before datomic.db.Db@d62413b7,
 :db-after datomic.db.Db@10110222,
 :tx-data
 [#datom[13194139534351 50 #inst "2016-02-08T20:27:16.653-00:00"
         13194139534351 true]],
 :tempids {}}

You tried to add the Datom [17592186045454 :task/title "Hello world"], but that Datom already existed. Notice that the transaction succeeds, but the :tx-data key only has the Datom about the transaction entity. You can see that the Datom assertion is idempotent. If you did the same experiment in the other direction—that is, we retracted the same Datom twice—you'd see a similar result: only the first transaction would have had that retraction Datom in its :tx-data, and the second would only include the transaction Datom.

Time Travel Will Never Be Impossible Forever

Once confined to fantasy and science fiction, time travel is now simply an engineering problem.

—Michio Kaku

Datomic's treatment of time as first class, and the database as a value, gives you as a developer the power of time travel. Let's look at the more obvious type of time travel in Datomic: journeying to the past.

Let's start by simulating time advancing forward by adding some data. Create a new task with an issue-id, and then use the issue-id to upsert several new values of the task's description. Each time you'll capture the resulting db value.

dev> (def db1 (:db-after @(d/transact conn [{:db/id (d/tempid :db.part/user)
                                        :task/issue-id "Hello"}])))
dev> (def db2 (:db-after @(d/transact conn [{:db/id (d/tempid :db.part/user)
                                        :task/issue-id "Hello"
                                        :task/description "First description"}])))
dev> (def db3 (:db-after @(d/transact conn [{:db/id (d/tempid :db.part/user)
                                        :task/issue-id "Hello"
                                        :task/description "Second description"}])))

It's time to introduce another Datomic API function: as-of, which takes a db value and a t value—the result of calling basis-t, a transaction id, or a date—and returns a new db value that is of that point in time. Now let's see how you can travel back in time.

dev> (def now-db (d/db conn)) ;; get the most current db
dev> (:task/description (d/entity now-db [:task/issue-id "Hello"]))
"Second description"
dev> (:task/description (d/entity (d/as-of now-db (d/basis-t db2))
                                  [:task/issue-id "Hello"]))
"First description"

You can use the db from the past anywhere in the Datomic API that expects a db: even queries.

dev> (def description-query '[:find ?i . :in $ ?desc
                              :where [?i :task/description ?desc]])
dev> (d/q description-query now-db "Second description")
17592186045446
dev> (d/q description-query (d/as-of now-db (d/basis-t db2)) "First description")
17592186045446

What's perhaps more unexpected is that you can travel into the future—at least, into a speculative possible future like the kind that Ebenezer Scrooge was brought to in A Christmas Carol. This is done with the somewhat innocuously named with function in the Datomic API. It takes a db and some transaction data, and returns the result of what would occur if the data were transacted, including a :db-after key with a new db value. This lets you chain multiple simulated transactions together, like so:

dev> (def future-db (-> now-db
                        (d/with [{:db/id (d/tempid :db.part/user)
                                  :task/issue-id "Hello"
                                  :task/description "Third description"}])
                        :db-after
                        (d/with [{:db/id (d/tempid :db.part/user)
                                  :task/issue-id "Hello"
                                  :task/title "Hello world"}])
                        :db-after))
dev> (d/touch (d/entity future-db [:task/issue-id "Hello"]))
{:db/id 17592186045446, :task/description "Third description",
 :task/title "Hello world", :task/issue-id "Hello"}

BUILDING APPLICATIONS WITH DATOMIC

Let's take a look at how to create the data access layer for the task tracker app.

User Functions

Start with the user code. In the application source folder, create a user.clj file with these initial contents:

(ns chapter-6.user
  "Database functions for user operations"
  (:require [datomic.api :as d]
            [crypto.password.bcrypt :as password]))
(defn entity
  "Returns user entity from :db/id, string login, or arg if already
   an entity."
  [db user]
  (cond (instance? datomic.query.EntityMap x) user
        (string? user) (d/entity db [:user/login user])
        :else (d/entity db user)))
(defn id
  "Returns :db/id for a user when passed an entity, login, or long id."
  [db user]
  (:db/id (entity db user)))

This sets up our namespace declaration, including the bcrypt functionality from the crypto.password library that we'll use to encrypt our passwords. You can also see there are two utility functions intended to help support more flexible APIs. The entity function will coerce several types of arguments into a Datomic entity for the user. If it's already an entity, it will return what's passed. If it's a string, it assumes it's a user login and creates an entity using the lookup ref. Otherwise, it assumes it's an entity id or something that can be passed directly to Datomic's entity function.

The id helper function returns the :db/id of a user passed in any of the forms that entity accepts.

Now let's create a function to check if a login and password are correct. This function can take advantage of the login as a lookup ref since it's being passed in, use that to retrieve the user entity and its encrypted password, and check the submitted plaintext password against the crypted one.

(defn check-login
  "Checks if login and password are correct, and if so returns the
   user entity."
  [db login password]
  (when-let [user (d/entity db [:user/login login])]
    (when (password/check password (:user/password user))
      user)))

Often in user registration forms, you want to quickly check if the user's login or email address is already taken, and if so give a message to the user that they're not available. Let's write two simple functions to perform those availability checks.

(defn login-available?
  "Checks if a login is available. Returns false if already used by a
   user."
  [db login]
  (nil? (d/entity db [:user/login login])))
(defn email-available?
"Checks if an email address is available. Returns false if already
   used by a user."
  [db email]
  (nil? (d/q '[:find ?user .
               :in $ ?email
               :where [?user :user/email ?email]]
             db email)))

The login availability check is somewhat simpler to do since you can use login as a lookup ref. The email check is a little more involved, since you have to perform a query. It's a simple query, however, that will return either a single user id matching the email, or nil if none is found.

Ensuring Data Integrity with Transaction Functions

Next, let's add the function that creates users. Since this function needs to perform a transaction, it will take a Datomic connection rather than db. However, there's one nagging problem here. Since user logins are upsertable, you need a way to ensure that the user you're creating doesn't have a login that's already in use.

You could try using the availability check function to see if it already exists. This would cover most of the cases, but if you have a lot of experience with databases you might already see the problem with this approach. Whenever you do a read and then a write in this way, you open yourself up to a race condition where, in this case, a user with that login gets created by someone else between the read and the write.

Relational databases solve this with transactions spanning both reads and writes and complicated locking semantics. Datomic doesn't need that because it solves the problem with transaction functions.

Transaction functions are run by the transactor as part of transaction processing. Datomic comes with two built-in funtions: a compare-and-swap operation, and an entity retraction operation. But neither of these functions quite meets our needs. What is needed is a transaction function that will look to see if the ident value exists at the time of the transaction.

In Datomic, functions are represented as entities with a :db/fn attribute that has the value of a Datomic function object. You can create function objects by calling the Datomic API's function method with a map of information about that function, or with the #db/fn reader literal with a map in that same format. The map describes what language the function is in (either Java or Clojure), what parameters it accepts, a string containing the function body, and optional import and require lists.

Another nifty feature of Datomic function objects is that they are callable as regular Clojure functions. Here's a simple example that adds two numbers:

(def add
  (d/function {:lang "clojure"
               :params '[x y]
               :code "(+ x y)"}))
(add 1 2) ;; => 3

Transaction functions are a special type of Datomic function. They have some additional requirements:

  • The first argument is a Datomic db.
  • They should return transaction data, or throw an exception to fail the transaction.

Let's create a transaction function that will ensure that an identity value for a given attribute doesn't exist, and add it to the schema file.

{:db/id #db/id[:db.part/task]
 :db/ident :add-identity
 :db/fn #db/fn{:lang "clojure"
               :params [db e ident-attr value]
               :code "(if (d/entity db [ident-attr value])
                        (throw (ex-info (str value " already exists for "
                                             ident-attr)
                                        {:e e
                                         :attribute ident-attr
                                         :value value}))
                        [[:db/add e ident-attr value]])"}}

This function takes a db value, an entity id to add the identity for, the identity attribute, and the value. The function looks to see if that identity value already exists, and if so it throws an exception. Otherwise, it returns transaction data that adds the identity to the entity id. Give it an ident, :add-identity, so you can refer to it in transactions.

You can make use of this in the function to create users.

(defn create
  "Attempts to create a new user entity with given login, password,
   and email. If paid? is true, creates a paid account and link to the
   user, otherwise creates a free account. Returns the transaction
   data if succesful. Will throw an exception if the login or email
   already belong to another user."
  ([conn login password email]
   (create conn login password email false))
  ([conn login password email paid?]
   (let [tempid (d/tempid :db.part/account)
         user-tx [{:db/id tempid
                   :user/email email
                   :user/password (password/encrypt password)
                   :user/account {:account/type (if paid?
                                                  :account.type/paid
                                                  :account.type/free)}}
                  ;; db function ensures login doesn't exist already
                  [:add-identity tempid :user/login login]]]
     @(d/transact conn user-tx))))

You can see how the calling syntax for database functions is fairly close to the way the list-syntax for transaction data works. The first list element is the transaction function name, the following elements are the arguments. You didn't need to use this transaction function for emails since the email was made with a unique value attribute, and Datomic throws an exception that aborts the transaction if there's a duplicate.

Account Functions

Let's take a look at some functions that deal with the charges, payments, and balance of the users' accounts in our app. The account model maintains two sources for an account's current balance: the sum of all the charges, payments, and adjustment transactions against the account; and the :account/current-balance attribute. Keeping those in sync can be a challenge.

Here is another place that transaction functions can be helpful. Instead of manually calculating and asserting the new balance each time you add a transaction, and opening up the risk of another possible race condition, you can let the transactor handle this for you. Here are two choices: you can use Datomic's compare-and-swap function, which will fail the transaction if the current balance has changed since the time it was read, or you can create our own transaction function that automatically adjusts the balance by applying the transaction amount.

Using compare-and-swap has the advantage of being built-in, but the disadvantage is that it will be failing transactions when you really don't need to. If you want to use it in the transaction, be sure to add something like this:

[:db.fn/cas account-id :account/current-balance
            expected-current-balance new-balance]

Writing your own transaction function for this gives you something like:

[{:db/id #db/id[:db.part/account]
  :db/ident :account/update-balance
  :db/fn #db/fn{:lang "clojure"
                :params [db a amt]
                :code "(let [acct (d/entity db a)
                             balance (or (:account/current-balance acct) 0)]
                         [[:db/add a :account/current-balance
                           (bigdec (+ balance amt))]])"}}]

If you use this approach, here is the function to add a new transaction:

(defn add-transaction
  "Takes a conn, user or account entity, transaction type (ident of
   the enum), and amount, and adds the transaction to the user's
   account. Returns the transaction data."
  [conn user-or-account trans-type amount]
  (let [account (or (:user/account user-or-account) user-or-account)
        amount (bigdec amount)
        charge-tx [{:db/id (d/tempid :db.part/account)
                    :transaction/type trans-type
                    :transaction/amount amount
                    :account/_transaction (:db/id account)}
                   [:account/update-balance (:db/id account) amount]]]
    @(d/transact conn charge-tx)))

Youo can also add helper functions for each transaction type, such as add-charge, add-payment, and add-adjustment.

Task Functions

Here now is coverage of the heart of the system: tasks. Let's start with the basics of task creation in a new namespace for the task database functions. Create a new file in the project's source folder called tasks.clj and add the basic namespace declaration, along with a few dependencies that are needed.

(ns chapter-6.task
  (:require [clojure.string :as str]
            [datomic.api :as d]
            [chapter-6.user :as user]))

For the task creation API, the function needs to accept a connection, some reference to the user that is creating the task, and various data for the task: a title, description, status, issue-id, tags, and the parent task if it's a subtask.

To keep the number of required data to the minimum, let's say only the user and the title should be required. The rest you can either leave empty, have a useful default, or generate ourselves.

You'll definitely need an issue-id, since that's the identity attribute. If one isn't supplied, let's write a function that will generate it from the issue's title. It will take the first word of the title, capitalize it, and try incremental numeric postfixes until it finds one that doesn't yet exist in the database.

(defn issue-id-from-title
  "Takes a db and task title, returns an issue-id using the first
   word in the title with a numeric postfix that does not already
   exist in the db."
  [db title]
  (->> (range)
       (map (partial str (.toUpperCase (first (str/split title #"s"))) "-"))
       (remove (fn [issue-id] (d/entity db [:task/issue-id issue-id])))
       first))

You can also add an entity function as you had in the user namespace to make the API more flexible, for example, by accepting different forms for the parent task.

(defn entity
  "Returns task entity from :db/id, string issue-id, or arg if already
   an entity."
  [db task]
  (cond (instance? datomic.query.EntityMap task) task
        (string? task) (d/entity db [:task/issue-id task])
        :else (d/entity db task)))

Now you can add the create function. Since you're adding an issue-id identity attribute, and it may suffer from the same kind of race condition problems that you noticed with user logins, you should make use of the database function you created for that.

(defn create
  "Takes a Datomic connection, a user entity, and a map with task
   info, and attempts to create a task in the database. Returns
   transaction data. Task info map has keys:
   * :title (required)
   * :description
   * :status - one of :todo, :in-progress, :done [:todo]
   * :issue-id - defaults to the the word in title with a numeric
      postfix
   * :tags - a set of strings
   * :parent - task entity, issue id, or :db/id"
  [conn user
   {:keys [title description status issue-id tags parent]
    :or {status :todo}}]
  (assert title ":title is required")
  (let [tempid (d/tempid :db.part/task)
        db (d/db conn)
        status (keyword "task.status" (name status))
        issue-id (or issue-id (issue-id-from-title db title))
        tags (some->> tags
                      (map (partial hash-map :tag/name))
                      (set))
        parent (entity db parent)]
    @(d/transact conn
                 [(cond-> {:db/id tempid
                           :task/user (:db/id user)
                           :task/title title
                           :task/issue-id issue-id
                           :task/status status}
                    description (assoc :task/description description)
                    tags (assoc :task/tag tags)
                    parent (assoc :task/parent (:db/id parent)))
                  [:add-identity tempid :task/issue-id issue-id]])))

Notice that since tags are upsertable with :tag/name, you don't have to care about whether they exist or not, and you can simply assert them here and Datomic will either create them or wire up the task with the existing ones. The :add-identity database function is the one created for user logins, but we made it flexible enough to work with any identity attribute.

Next, you can add a straightforward function that assigns a task a new parent, supporting the operation in the example app where the user is re-organizing their tasks.

(defn set-parent
  "Sets parent of task to parent."
  [conn task parent]
  (let [db (d/db conn)]
    @(d/transact conn [[:db/add (:db/id (entity db task))
                        :task/parent (:db/id (entity db task))]])))

You can add similar functions to set task's status, and add and remove tags.

(defn set-status
  "Sets status of task to status, one of :todo :done :in-progress."
  [conn task status]
  (let [db (d/db conn)]
    @(d/transact conn [[:db/add (:db/id (entity db task))
                        :task/status (keyword "task.status" (name status))]])))
(defn add-tag
  "Adds a tag to task. Tag is the string name of the tag."
  [conn task tag]
  (let [db (d/db conn)]
    @(d/transact conn [{:db/id (:db/id (entity db task))
                        :task/tag #{{:tag/name tag}}}])))
(defn remove-tag
  "Removes tag from a task. Tag is the string name of the tag."
  [conn task tag]
  (let [db (d/db conn)]
    @(d/transact conn [[:db/retract (:db/id (entity db task))
                        :task/tag [:tag/name tag]]])))

Now, let's add a query function that returns all of the top-level tasks for a user. You'll add an optional parameter that flags whether the subtasks should be returned. Since you're using the pull API, let's first write the pull-spec.

(def task-pull-spec
  "Basic pull spec for task info."
  [:task/title :task/description {:task/status [:db/ident]}
   :task/issue-id {:task/tag [:tag/name]}])

This pull-spec includes the basic fields for the task, the ident of the task's status, and the name of all the task's tags. Now you can add your query function.

(defn for-user
  "Returns all the top-level tasks for a user. If sub-tasks? is true,
   recursively returns the sub-tasks as well."
  ([db user]
   (for-user db user false))
  ([db user subtasks?]
   (let [pull-spec (cond-> task-pull-spec
                     subtasks? (conj {:task/_parent '...}))
         query '[:find (pull ?task spec)
                 :in $ ?user spec
                 :where [?task :task/user ?user]
                        (not [?task :task/parent _])]]
     (d/q query db (user/id db user) pull-spec))))

If the subtasks flag is true, this function adds something interesting to the pull-spec. The addition of {:task/_parent '...} means that it will apply the entire pull-spec recursively using the parent reverse reference. That is, it will recursively pull all of the task's children, and their children, and so forth. The ... turns it into a recursive spec, or you could have put a number there instead to limit the recursion depth. The result is that the entire subtask tree of a task will be pulled for the app to use or display for the user.

Also notice that you can pass pull-specs in as arguments to the query and bind them as symbols without any special prefix.

Finally, let's look at how you can use the fulltext search index added for the :task/title attribute, to search for tasks by title. You can reuse the same pull-spec created before. The query function looks like this:

(defn search-title
  "Searches for tasks using a fulltext search on the title. Returns
   the matching tasks as well as the match score against the search
   string."
  [db user search]
  (d/q '[:find (pull ?task spec) ?score
         :in $ ?user ?search spec
         :where [(fulltext $ :task/title ?search) [[?task _ _ ?score]]]
                [?task :task/user ?user]]
       db (user/id db user) search task-pull-spec))

Notice the use of the built-in fulltext function in the query. It accepts the db source, the attribute to look in, and the search terms, returning a four-tuple of [?entity ?value ?tx ?score]. Since the focus is on the entity, and the matching score for this function, we ignored the value and the tx with underscores.

Wrapping Up

Notice that it took very little actual code to power a large amount of functionality, as shown in this chapter. Datomic's powerful API, and the investments made in our application data model, really pay off when it comes to building the application and getting a lot of leverage out of concise code.

Testing is something outside of the scope of this chapter, but you can take advantage of the value semantics of Datomic dbs, and the ease with which you can stand up in-memory Datomic databases to write a thorough automated test suite for your application code with a minimum of machinery.

Deployment

In order to deploy a production Datomic system with a transactor and persistent store, you need to download one of the release downloads. As of this writing, there are three tiers of Datomic license:

  • Datomic Free. It is, as the name implies, free of cost and doesn't require registration. This lets you run a transactor using the “dev” storage and a maximum of 2 peers. The “dev” storage is actually an embedded SQL database that the peers access via the transactor.
  • Datomic Pro Starter. This is also free of cost, although it requires registration and an access code to download and use in your project. It allows you to use any of the storage backends with a limited number of peers.
  • Datomic Pro. This is the paid edition of Datomic, and is licensed by the number of peers. It gives access to all of the features, including all of the storage backends, high-availability transactor mode, and memcached.

Once you have a version of Datomic downloaded, you need to choose your storage backend and set up the transactor. The Datomic documentation section on deployment is an excellent guide to this process, and is available here: http://docs.datomic.com/deployment.html.

The Limitations

No system exists that can handle every possible use case and every possible scenario. Every system has tradeoffs, and those tradeoffs impose limitations on how the system can be used. Datomic is no exception. In building Datomic, Rich, Stuart, et al. made design decisions that make Datomic well suited for use as a transactional system for storing domain data, in the way many applications use a relational database today.

Those same design decisions place some limitations on what you can do with Datomic, and it is helpful to know up-front what those are, and what types of use cases are not a good fit for Datomic.

Hard (and Soft) Limits

The most obvious limitation of Datomic is that there is only one writer, the transactor. This means that write scalability is limited to vertical scalability only, and applications built on Datomic can only handle the write volume that the single transactor node can handle. In practice, this is not generally a critical limitation, since it's not difficult to achieve hundreds of sustained writes per second through the transactor, with bursts into the thousand-per-second range.

If your application is working at this kind of data scale, then generally you are already outside the range where you would be able to get good performance out of a relational database without tuning and designing your schema to optimize your read patterns. In other words, you've already given up many of the benefits of a relational database, and one of the other NoSQL databases designed around this kind of data volume might be a better fit than Datomic.

Another limit is the number of “schema” elements. This includes both schema attributes and partitions. There is a hard maximum of 2∧20 of these—a little over 1 million. It is unlikely you'll ever reach this limit—in fact it's likely that something has gone terribly wrong if you have—but it's something to be aware of.

The max number of entity ids in any partition is 2∧42, which is around 4.4 trillion. This is purely a theoretical limitation, included here just so that you don't worry about it, because it is absolutely impossible to get this many entities into Datomic, period. Principally this is due to the final, most imposing limitation.

There is a soft limit on the total number of Datoms in a database of around 10 billion. This includes assertions, retractions, and history. This is a soft limit, because nothing will stop you from adding more, and in fact Cognitect is aware of a production Datomic database with more than 14 billion Datoms. This is a limit to be concerned with just the same for three reasons.

  • Cognitect's automated testing of Datomic includes database sizes of up to 10 billion Datoms, so there's reason to be confident that Datomic will work at that volume. Above that, however, you are officially in uncharted waters.
  • The index data structures start to become quite bloated at this size, such that query performance is likely to suffer and there will be increased resource demands on every peer node.
  • Transactor performance will take a real hit as the number of directory nodes and segments increase, the magnitude of which depends on how varied your write patterns are and how well you have designed your partitions. If the transactor's re-indexing job needs to rewrite an excessively large number of data segments because the indexes have become very large and your writes are spread across many segments, it is possible that it will not be able to complete and your database will simply fall over to the point of requiring a restore from a previous safe backup.

SUMMARY

This chapter covered in detail how Datomic changes the way you build database-backed applications. You can see how Datomic's emphasis on flexibility and expressiveness is enabled by its simple, yet powerful, data model, and innovative ideas about design and immutability. With so many new ideas in one system, you can see how Datomic in many ways represents a fundamental rethinking of the database.

While this chapter aimed to provide enough breadth to see what's possible with Datomic, and enough depth that you're left with the information and understanding you need to begin building applications with it, there is more to cover. The Datomic documentation is immensely valuable, with a huge trove of valuable information. Additionally, there are a number of great presentations, blogs, and videos that cover different aspects of Datomic and case studies.

We hope that you are now aware of the possibilities that Datomic opens up for your applications, and a glimmer of the excitement that we feel about working with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset