Glossary

I can’t believe what a bunch of nerds we are. We’re looking up “money laundering” in a dictionary.

Peter, Office Space

This glossary provides definitions of some of the terms that are important to understand when working with Apache Cassandra. There’s some really good material at http://wiki.apache.org/cassandra, but reading it for the first time can be tricky, as each new term seems to be explained only with other new terms. Many of these concepts are daunting to beginning or even intermediate web developers or database administrators, so they’re presented here in an easy reference. Much of the information in this glossary is repeated and expanded upon in relevant sections throughout this book.

Anti-Entropy

Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring that data on different nodes is updated to the newest version.

Here’s how it works. During a major compaction (see Compaction), the server initiates a TreeRequest/TreeResponse conversation to exchange Merkle trees with neighboring nodes. The Merkle tree is a hash representing the data in that column family. If the trees from the different nodes don’t match, then they have to be reconciled (or “repaired”) in order to determine the latest data values they should all be set to. This tree comparison validation is the responsibility of the org.apache.cassandra.service.AntiEntropyService class. AntiEntropyService implements the Singleton pattern and defines the static Differencer class as well. This class is used to compare two trees, and if it finds any differences, it launches a repair for the ranges that don’t agree.

Anti-entropy is used in Amazon’s Dynamo, and Cassandra’s implementation is modeled on that (see Section 4.7 of the Dynamo paper).

In Dynamo, they use a Merkle tree for anti-entropy (see Merkle Tree). Cassandra does too, but the implementation is a little different. In Cassandra, each column family has its own Merkle tree; the tree is created as a snapshot during a major compaction operation, and it is kept only as long as is required to send it to the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk I/O.

See Read Repair for more information on how these repairs occur.

Async Write

Sometimes called “async writes” in documentation and user lists, this simply means “asynchronous writes” and refers to the fact that Cassandra makes heavy use of java.util.concurrent library components such as ExecutorService and Future<T> for writing data to buffers.

Avro

Avro is (probably) replacing Thrift as the RPC client for interacting with Cassandra. Avro is a subproject of the Apache Hadoop project, created by Doug Cutting (creator of Hadoop and Lucene). It provides functionality similar to Thrift, but is a dynamic data serialization library that has an advantage over Thrift in that it does not require static code generation. Another reason that the project is migrating to Avro is that Thrift was originally created by Facebook and then donated to Apache, but since that time has received little active development attention.

This means that the Cassandra server will be ported from org.apache.cassandra.thrift.CassandraServer to org.apache.cassandra.avro.CassandraServer. As of this writing, this is underway but not yet complete.

You can find out more about Avro at its project page, http://avro.apache.org.

Bigtable

Bigtable is a distributed database created at Google in 2006 as a high-performance columnar database on top of Google File System (GFS). Bigtable and Amazon’s Dynamo database are the direct parents of Cassandra. Cassandra inherits these aspects from Bigtable: sparse array data and disk storage using an SSTable.

Yahoo!’s HBase is a Bigtable clone.

You can read the complete Google Bigtable paper at http://labs.google.com/papers/bigtable.html.

Bloom Filter

In simple terms, a Bloom filter is a very fast, nondeterministic algorithm for testing whether an element is a member of a set. These algorithms are nondeterministic because it is possible to get a false-positive read but not a false-negative. Bloom filters work by mapping the values in a dataset into a bit array and condensing a larger dataset into a digest string. The digest, by definition, uses a much smaller amount of memory than the original data would.

Cassandra uses Bloom filters to reduce disk access, which can be expensive, on key lookups. Every SSTable has an associated Bloom filter; when a query is performed, the Bloom filter is checked first before accessing disk. Because false-negatives are not possible, if the filter indicates that the element does not exist in the set, it certainly doesn’t; if the filter thinks that the element is in the set, the disk is accessed to make sure.

Although it is a disadvantage that false-positives are possible with Bloom filters, their advantage is that they can be very fast because they use space efficiently, due to the fact that (unlike simple arrays, hashtables, or linked lists) they do not store their elements completely. Instead, Bloom filters make heavy use of memory and reduce disk access. One result is that the number of false-positives increases as the number of elements increases.

Bloom filters are used by Apache Hadoop, Google Bigtable, and Squid Proxy Cache. They are named for their inventor, Burton Bloom.

Cassandra

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. She was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would accurately predict everything that would happen, yet no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her.

The data store itself is an Apache project available at http://cassandra.apache.org. It started in incubator status in January of 2009. It has the following key properties: it is decentralized, elastic, fault-tolerant, tuneably consistent, highly available, and designed to massively scale on commodity servers spread across different data centers. It is in use at companies such as Digg, Facebook, Twitter, Cloudkick, Cisco, IBM, Reddit, Rackspace, SimpleGeo, Ooyala, and OpenX.

Cassandra was originally written at Facebook to solve their Inbox Search problem. The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Facebook engineer on the Search Team Prashant Malik as key engineers. The code was released as an open source Google Code project in July of 2008. In March of 2009, it was moved to an Apache Incubator project, and on February 17 of that year, it was voted into a top-level project.

A central paper on Cassandra by Facebook’s Lakshman and Malik called “A Decentralized Structured Storage System” is available at http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf.

A blog post from 2008 by Avinash Lakshman describes how they were using Cassandra at Facebook: http://www.facebook.com/note.php?note_id=24413138919&id=9445547199&index=9.

It is easy to see why the Cassandra database is aptly named: its community asserts that Cassandra and other related NoSQL databases are the future. Despite widespread use of eventually consistent databases at companies such as Amazon, Google, Facebook, and Twitter, there remain many skeptics (“nonbelievers”) of such a model. It is further speculated that by naming the database Cassandra, after the Greek seer of the future, its creators are making a kind of joking reference to the Oracle database.

The Java client Hector by Ran Tavory is named for Cassandra’s brother.

Chiton

In ancient Greece, a chiton was a cloth garment, typically sleeveless, worn by both men and women. It is the namesake for the open source project Chiton by Brandon Williams, which is a Python GTK-based browser for Apache Cassandra. It is currently hosted at http://github.com/driftx/chiton.

A related project is Telephus, a low-level client API for Cassandra written in Twisted Python. It is currently hosted at http://github.com/driftx/Telephus.

Cluster

A cluster is two or more Cassandra instances acting in concert. These instances communicate with one another using Gossip.

When you configure a new instance to introduce to your cluster, you’ll need to do a few things. First, indicate a Seed Node. Next, indicate the ports on which Cassandra should listen for two things: Gossip and the Thrift interface. Once your cluster is configured, use the Node Tool to verify that it is set up correctly.

Column

A column is the most basic unit of representation in the Cassandra data model. A column is a triplet of a name (sometimes referred to as a “key”), a value, and a timestamp. A column’s values, including the timestamp, are all supplied by the client. The data type for the name and value are Java byte arrays. The data type for the timestamp is a long primitive. Columns are immutable in order to prevent multithreading issues.

Columns are organized into column families.

The column is defined in Cassandra by the org.apache.cassandra.db.IColumn interface, which allows a variety of operations, including getting the value of the column as a byte[] or getting its subcolumns as a Collection<IColumn>, and finding the time of the most recent change.

Columns are sorted by their type, which is one of AsciiType, BytesType, LexicalUUIDType, LongType, TimeUUIDType, UTF8Type.

See also Column Family.

Column Family

A column family is roughly analogous to a table in a relational model. It is a container for an ordered collection of columns.

Because each column family is stored in a separate file, be sure to define columns that you are likely to access together in the same column family.

You define your application’s column families in the Cassandra configuration file. You can supply global (per-keyspace) values for the size of the row cache, the size of the key cache, and the “read repair chance.” Column families can be one of two types: standard or super.

See also Column, Keyspace, Super Column.

Column Name

The name part of the name/value pair stored within a Row.

Column Value

The value part of the name/value pair stored within a Row. The size of a column value is limited by machine memory.

Commit Log

The commit log is responsible for all of the write operations in Cassandra. When you perform a write, it first enters the commit log so the data won’t be lost in the event of failure; then the value is populated in the memtable so it’s available in memory for performance. Once the Memtable fills up, the data is flushed to the SSTable.

It is represented by the org.apache.cassandra.db.commitlog.CommitLog class. On every write or delete, an entry in the form of a RowMutation object is serialized and appended to the commit log. These objects are organized into commit log segments. By default, commit logs roll once they reach a size threshold of 128MB; when a new commit log is created, it accepts writes in transit. This setting is configurable.

Compaction

Compaction is the process of freeing up space by merging large accumulated datafiles. This is roughly analogous to rebuilding a table in the relational world. On compaction, the merged data is sorted, a new index is created over the sorted data, and the freshly merged, sorted, and indexed data is written to a single new file.

The operations that are performed during compaction to free up space include merging keys, combining columns, and deleting tombstones. This process is managed by the class org.apache.cassandra.db.CompactionManager. CompactionManager implements an MBean interface so it can be introspected.

There are different types of compaction in Cassandra.

A major compaction is triggered in one of two ways: via a node probe or automatically. A node probe sends a TreeRequest message to the nodes that neighbor the target. When a node receives a TreeRequest, it immediately performs a read-only compaction in order to validate the column family.

A read-only compaction has the following steps:

  1. Get the key distribution from the column family.

  2. Once the rows have been added to the validator, if the column family needs to be validated, it will create the Merkle tree and broadcast it to the neighboring nodes.

  3. The Merkle trees are brought together in a “rendezvous” as a list of Differencers (trees that need validating or comparison).

  4. The comparison is executed by the StageManager class, which is responsible for handling concurrency issues in executing jobs. In this case, the StageManager uses an anti-entropy stage. This uses the org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor class, which executes the compaction within a single thread and makes the operation available as an MBean for inspection.

Compression

Data compression on return is on the road map for future versions, but as of 0.6 it is not yet supported.

Consistency

Consistency means that a transaction does not leave the database in an illegal state, and that no integrity constraints are violated. This is considered a crucial aspect of transactions in relational databases and is one of the ACID properties (Atomic, Consistent, Isolated, Durable). In Cassandra, the relative degree of consistency can be calculated by the following:

N = the number of nodes that store replicas of the data

W = the number of replicas that must acknowledge receipt of a write before it can be said to be successful

R = the number of replicas that are contacted when a data object is accessed in a read operation

W + R > N = strong consistency

W + R <= N = eventual consistency

Consistency Level

This configurable setting allows you to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation in order to be considered successful. The consistency level is set according to your stated Replication Factor, not the raw number of nodes in the cluster.

There are multiple levels of consistency that you can tune for performance. The best-performing level has the lowest consistency level. They mean different things for writing and reading. This is covered in Chapter 7.

For write operations:

  • ZERO: Write operations will be handled in the background, asynchronously. This is the fastest way to write data, and the one that offers the least confidence that your operations will succeed.

  • ANY: This level was introduced in Cassandra 0.6 and means that you can be assured that your write operation was successful on at least one node, even if the acknowledgment is only for a hint (see Hinted Handoff). This is a relatively weak level of consistency.

  • ONE: Ensures that the write operation was written to at least one node, including its commit log and memtable. If a single node responds, the operation is considered successful.

  • QUORUM: A quorum is a number of nodes that represents consensus on an operation. It is determined by <ReplicationFactor> / 2 + 1. So if you have a replication factor of 10, then 6 replicas would have to acknowledge the operation to gain a quorum.

  • DCQUORUM: A version of quorum that prefers replicas in the same data center in order to balance the high consistency level of quorum with the lower latency of preferring to perform operations on replicas in the same data center.

  • ALL: Every node as specified in your <ReplicationFactor> configuration entry must successfully acknowledge the write operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of consistency and the lowest level of performance.

For read operations:

  • ONE: This returns the value on the first node that responds. Performs a read repair in the background.

  • QUORUM: Queries all nodes and returns the record with the most recent timestamp after a quorum of nodes have responded, where a “quorum” is (n /2) + 1.

  • DCQUORUM: Ensures that only nodes in the same data center are queried. Applicable when using Rack-Aware placement strategy.

  • ALL: Queries all nodes and returns the value with the most recent timestamp. This level waits for all nodes to respond, and if one doesn’t, it fails the read operation.

Note that there is no such thing as READ ZERO, as it doesn’t make sense to specify that you want to read some data and don’t need any node to respond.

Data Center Shard Strategy

See Replication Strategy.

Decentralized

Cassandra is considered decentralized because it defines no master server, and instead uses a peer-to-peer approach in order to prevent bottlenecks and single points of failure. Decentralization is important in Cassandra because it is what allows it to scale up and also to scale down; peers can enter or exit the cluster as they like, with minimal disruption.

Denormalization

In relational databases, denormalization, or the creation of redundant data, is sometimes applied in order to improve performance of read-mostly applications, such as in online analytical processing (OLAP). In Cassandra, it is typical to see denormalized data, as this improves performance and helps account for the fact that data is structured according to the queries you’ll need, in distinction to standard relational databases where the data is typically structured around the object model independently.

Durability

When a database is durable, it means that writes will permanently survive, even in the event of a server crash or sudden power failure.

Cassandra accomplishes durability by appending writes to the end of the commit log, allowing the server to avoid having to seek the location in the data file. Only the commit log needs to be synced with the file system, and this happens either periodically or in a specified batch window.

When working in a single server node, Cassandra does not immediately synchronize a file’s in-core state with the storage device. That can mean that if the server is shut down immediately after a write is performed, the write may not be present on restart. Note that a single server node is not recommended for production.

See also Commit Log.

Dynamo

Created in 2006 by Amazon and, along with Google’s Bigtable, a primary basis for Cassandra. From Dynamo, Cassandra gets the following: a key-value store, a symmetric peer-to-peer architecture, gossip-based discovery, eventual consistency, and tunability per operation.

You can read the complete paper “Dynamo: Amazon’s Highly Available Key-Value Store” at http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html.

Elastic

Read and write throughput can increase linearly as more machines are added to the cluster.

Eventual Consistency

Consistency is the property that describes the internal integrity of the data following an operation. In practical terms for a strongly consistent database, this means that once a client has performed a write operation, all readers will immediately see the new value. In eventual consistency, the database will not generally be consistent immediately, but rather eventually (where “eventually” is typically a matter of the small number of milliseconds it takes to send the new value to all replicas, relative to the amount of data, the number of nodes, and the geographical distribution of those nodes). DNS is an example of a popular eventually consistent architecture. Eventual consistency is sometimes called “weak consistency.”

Eventual consistency has become popular in the last few years because it offers the ability to support massive scalability. Although it is possible to achieve high scalability in traditional fully consistent databases, the management overhead can become a burden. Of course, eventual consistency presents certain disadvantages, such as additional complexity in the programming model.

Though the design of eventual consistency in Cassandra is based on how it is used in Amazon’s Dynamo, Cassandra is probably better characterized as “tuneably” consistent, rather than eventually consistent. That is, Cassandra allows you to configure the Consistency Level across the spectrum—including ensuring that Cassandra blocks until all replicas are readable (which is equivalent to full consistency).

Riak, Voldemort, MongoDB, Yahoo!’s HBase, CouchDB, Microsoft’s Dynomite, and Amazon’s SimpleDB/Dynamo are other eventually consistent data stores.

Failure Detection

Failure detection is the process of determining which nodes in a distributed fault-tolerant system have failed. Cassandra’s implementation is based on the idea of Accrual Failure Detection, first advanced by the Advanced Institute of Science and Technology in Japan in 2004. Accrual failure detection is based on two primary ideas: that failure detection should be flexible by being decoupled from the application being monitored, and outputting a continuous level of “suspicion” regarding how confident the monitor is that a node has failed. This is desirable because it can take into account fluctuations in the network environment. Suspicion offers a more fluid and proactive indication of the weaker or stronger possibility of failure based on interpretation (the sampling of heartbeats), as opposed to a simple binary assessment.

Failure detection is implemented in Cassandra by the org.apache.cassandra.gms.FailureDetector class.

You can read the original Phi Accrual Failure Detection paper at http://ddg.jaist.ac.jp/pub/HDY+04.pdf.

Fault Tolerance

Fault tolerance is the system’s ability to continue operating in the event of a failure of one or more of its components. Fault tolerance is also referred to as graceful degradation, meaning that if the system operation degrades following a failure, the degraded performance is relative only to the failed component(s).

Gossip

The gossiper is responsible for ensuring that all of the nodes in a cluster are aware of the important state information in the other nodes. The gossiper runs every second to ensure that even nodes that have failed or are not yet online are able to receive node states. It is designed to perform predictably, even at sharply increased loads. The gossip protocol supports rebalancing of keys across the nodes and supports Failure Detection. Gossip is an important part of the anti-entropy strategy.

The state information that the gossiper shares is structured as key/value pairs. In Cassandra, the gossip protocol continues to gossip state information to other nodes until it is made obsolete by newer data.

When a server node is started, it registers itself with the gossiper. For more information, check out the org.apache.cassandra.service.StorageService class.

Also see the Amazon paper on gossip at http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf.

Hector

An open source project created by Ran Tavory of Outbrain and hosted at GitHub, Hector is a Cassandra client written in Java. It wraps Thrift and offers JMX, connection pooling, and failover.

Hinted Handoff

This is a mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a node that is intended to receive that write goes down, a note (the “hint”) is given (“handed off”) to a different live node to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online, and it improves write performance in lower consistency levels. That is, a hinted handoff does not count as a sufficient acknowledgment for a write operation if the consistency level is set to ONE, QUORUM, or ALL. A hint does count as a write for consistency level ANY, however. Another way of putting this is that hinted writes are not readable in and of themselves.

The node that received the hint will know very quickly when the unavailable node comes back online again, because of Gossip. If, for some reason, the hinted handoff doesn’t work, the system can still perform a read repair.

Key

See Row Key.

Keyspace

A keyspace is a container for column families. It is roughly analogous to the database in the relational model, used in Cassandra to separate applications. Where a relational database is a collection of tables, a keyspace is an ordered collection of column families. You define your application’s keyspace in the Cassandra configuration file or by using the definition methods in the API. When you define a keyspace, you can also define its replication factor and its replica placement strategy. Within a given Cassandra cluster, you can have one or more keyspaces, typically one for each application.

See also Column Family.

Lexicographic Ordering

Lexicographic ordering is the natural (alphabetic) ordering of the product of two ordered Cartesian sets.

Memtable

An in-memory representation of data that has been recently written. Once the memtable is full, it is flushed to disk as an SSTable.

Merkle Tree

Perhaps better known as a “hash tree,” a Merkle tree is a binary tree data structure that summarizes in short form the data in a larger dataset. In a hash tree, the leaves are the data blocks (typically files on a filesystem) to be summarized. Every parent node in the tree is a hash of its direct child node, which tightly compacts the summary.

In Cassandra, the Merkle tree is implemented in the org.apache.cassandra.utils.MerkleTree class.

Merkle trees are used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed. They are used in cryptography as well to verify the contents of files and transmissions, and are used in the Google Wave product. They are named for their inventor, Ralph Merkle.

Multiget

Query by column name for a set of keys.

Multiget Slice

Query to get a subset of columns for a set of keys.

Node

An instance of Cassandra. Typically a Cassandra cluster will have many nodes, sometimes collectively called the node ring, or just “the ring.” A node refers to any Cassandra server in a cluster, whereas “replica” refers to a node that specifically has a copy of some data from another node.

Node Tool

This is an executable file with the path bin/nodetool that inspects a cluster to determine whether it is properly configured and performs a variety of maintenance operations. The commands available on nodetool are cleanup, clearsnapshot, compact, cfstats, decommission, drain, flush, info, loadbalance, move, repair, ring, snapshot [snapshotname], removetoken, and tpstats.

For example, you can use nodetool drain to prevent the commit log from accepting any new writes.

NoSQL

“NoSQL” is a general name for the collection of databases that do not use Structured Query Language (SQL) or a relational data model. It is sometimes used to mean “Not Only SQL” to indicate that the proponents of various nonrelational databases do not suggest that relational databases are a bad choice—but rather that they are not the only choice for data storage. This term was coined by Cassandra committer Eric Evans of Rackspace, but he has since distanced himself from the term in preference of the term “Big Data” to highlight the fact that this family of nonrelational databases is defined not by what they’re not (implementations of SQL), but rather by what they do (handle huge data loads). This term, in my view, has reached the end of its useful life, because it’s confusing. It has proven too tempting to discuss a set of databases together that really have few goals, design decisions, or features in common. Let Cassandra be Cassandra, let CouchDB be CouchDB, and let Riak be Riak.

Order-Preserving Partitioner

This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort order. Configuring your column family to use order-preserving partitioning allows you to perform range slices, meaning that Cassandra knows which nodes have which keys.

This partitioner is somewhat the opposite of the Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly distributing keys.

The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra.dht.OrderPreservingPartitionerclass.

There is a special kind of OPP called the collating order-preserving partitioner (COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the org.apache.cassandra.dht.CollatingOrderPreservingPartitioner class.

This is implemented in Cassandra by org.apache.cassandra.dht.OrderPreservingPartitioner.

See also Token.

Partition

In general terms, a partition refers to a network partition, which is a break in the network that prevents one machine from interacting directly with another. A partition can be caused by failed switches, routers, or network interfaces. Consider a cluster of five machines {A, B, C, D, E} where {A, B} are on one subnet and {C, D, E} are on a second subnet. If the switch to which {C, D, E} are connected fails, then you have a network partition that isolates the two subclusters {A, B} and {C, D, E}.

Cassandra is a fault-tolerant database, and network partitions are one such fault. As such, it is able to continue operating in the face of a network partition and merge data in replication once the partition is healed again.

Partitioner

The partitioner controls how your data is distributed over your nodes. In order to find a set of keys, Cassandra must know what nodes have the range of values you’re looking for. There are three types of partitioner: random partitioner, which is the default; order-preserving partitioner; and collating order-preserving partitioner. You configure this in storage-conf.xml or cassandra.yaml (for 0.7) using the <Partitioner> element: <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>. Note that partitioning applies to the sorting of row keys, not columns.

Once you have chosen a partitioner type, you cannot change it without destroying your data (because an SSTable is immutable).

See also Order-Preserving Partitioner and Random Partitioner.

Quorum

A majority of nodes that respond to an operation. This is a configurable consistency level. In a quorum read, the proxy waits for a majority of nodes to respond with the same value. This makes for a slower read operation, but also helps ensure that you don’t get returned stale data.

Rack-Aware Strategy

See Replication Strategy.

Random Partitioner

This is a kind of Partitioner that uses a BigIntegerToken with an MD5 hash to determine where to place the keys on the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of causing inefficient range queries. This is the default partitioner.

See also Partitioner and Order-Preserving Partitioner.

Range Slice

Query to get a subset of columns for a range of keys.

Read Repair

This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the nodes with stale data to get them up to date with the newer data returned from the original read operation. It does this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.

The method for reconciliation is the org.apache.cassandra.streaming package.

Replication

In general distributed systems terms, replication refers to storing multiple copies of data on multiple machines so that if one machine fails or becomes unavailable due to a Partition, the cluster can still make data available. Caching is a simple form of replication. In Cassandra, replication is a means of providing high performance and availability/fault-tolerance.

Replication Factor

Cassandra offers a configurable replication factor, which allows you essentially to decide how much you want to pay in performance to gain more consistency. That is, your consistency level for reading and writing data is based on the replication factor, as it refers to the number of nodes across which you have replicated data. The replication factor is set in the configuration file or the API.

See also Consistency Level.

Replication Strategy

The replication strategy, sometimes referred to as the placement strategy, determines how replicas will be distributed. The first replica is always placed in the node claiming the key range of its Token. All remaining replicas are distributed according to a configurable replication strategy.

The Gang of Four Strategy pattern is employed to allow a pluggable means of replication, but Cassandra comes with three out of the box. Choosing the right replication strategy is important because in determining which nodes are responsible for which key ranges, you’re also determining which nodes should receive write operations; this has a big impact on efficiency in different scenarios. The variety of pluggable strategies allows you greater flexibility, so that you can tune Cassandra according to your network topology and needs.

Replication strategies are an extension of the org.apache.cassandra.locator.AbstractReplicationStrategy class. You can write your own replication strategy if you like by extending that class.

The replication placement strategy is configured per keyspace using the <ReplicaPlacementStrategy> element. They are discussed in depth in Chapter 6.

Row

In a column family, a row is a sorted map that matches column names to column values. In a super column, a row is a sorted map that matches super column names to maps matching column names to column values. The Row Key defines the individual row, and the row defines the name/value pairs of the columns. The size of a single row cannot exceed the amount of space on disk.

Rows are sorted by their Partitioner, which is one of these types: RandomPartitioner, OrderPreservingPartitioner, or CollatingOrderPreservingPartitioner.

Rows are defined by the class org.apache.cassandra.db.Row.

See also Row Key.

Row Key

Sometimes called simply “key,” a row key is analogous to a primary key for an object in the relational model. It represents a way to identify a single row of columns and is an arbitrary length string.

In the Thrift interface, the Java client always assumes that row keys are encoded as UTF-8, but this is not the case for clients in other languages, where you may need to manually encode ASCII strings as UTF-8.

SEDA (Staged Event-Driven Architecture)

Cassandra employs a Staged Event-Driven Architecture to gain massive throughput under highly concurrent conditions. SEDA attempts to overcome the overhead associated with threads. This overhead is due to scheduling, lock contention, and cache misses. The effect of SEDA is that work is not started and completed by the same thread; this can make a more complex code base, but also yield better performance. Therefore, much of the key work in Cassandra—such as reading, mutation, gossiping, memtable flushing, and compaction—are performed as stages (the “S” in SEDA). A stage is essentially a separated event queue.

As events enter the incoming queue, the event handler supplied by the application is invoked. The controller is capable of dynamically tuning the number of threads allocated to each stage as demand dictates.

The advantages of SEDA are higher concurrency and better management of CPU, disk, and network resources.

You can read more about SEDA as it was originally proposed by Matt Welsh, David Culler, and Eric Brewer at http://www.eecs.harvard.edu/~mdw/proj/seda.

See also Stage.

Seed Node

A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running. The newly added node can start gossiping with the seed node to get state information and learn the topology of the node ring. There may be many seeds in a cluster.

Slice

This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use get_range_slice() to return a subset of columns for a range of keys.

Snitch

A snitch is Cassandra’s way of mapping a node to a physical location in the network. It helps determine the location of a node relative to another node in order to assist with discovery and ensure efficient request routing. There are different kinds of snitches. The EndpointSnitch (or RackInferringSnitch), for instance, determines whether two nodes are in the same data center or the same rack. Its strategy for doing so is essentially to guess at the relative distance of two nodes in a data center and rack based on reading the second and third octets of their IP addresses.

The DataCenterEndpointSnitch allows you to specify IP subnets for racks, grouped by which data center the racks are in.

The PropertyFileSnitch allows you to map IP addresses to rack and data centers in a properties file called cassandra-rack.properties.

The snitch strategy classes can be found in the org.apache.cassandra.locator package.

Sparse

In the relational model, every data type (table) must have a value for every column, even if that value is sometimes null. Cassandra, on the other hand, represents a sparse or “schema-free” data model, which means that rows may have values for as many or as few of the defined columns as you like. This allows for a degree of efficiency. For example, consider a 1000  × 1000 cell spreadsheet, similar to a relational table. If many of the cells have empty values, the storage model is inefficient.

SSTable

SSTable stands for Sorted String Table. Inherited from Google’s Bigtable, an SSTable is how data is stored on disk in Cassandra. It is a log that allows only appending. In-memory tables (memtables) are used in front of SSTables for buffering and sorting data. SSTables allow for high performance on writes and can be compacted.

SSTables are immutable. Once a memtable is flushed to disk as an SSTable, it cannot be changed by the application; Compaction changes only their on-disk representation.

To import or export data from JavaScript Object Notation (JSON), check out the classes org.apache.cassandra.tools.SSTableImporter and SSTableExporter.

Stage

Part of Cassandra’s Staged Event-Driven Architecture (SEDA), a stage wraps a basic unit of work. A single operation can flow between various stages to complete, rather than getting completed in the same thread that started the work.

A stage consists of an incoming event queue, an event handler, and an associated thread pool. Stages are managed by a controller that determines scheduling and thread allocation; Cassandra implements this kind of concurrency model using the thread pool java.util.concurrent.ExecutorService. To see specifically how this works, check out the org.apache.cassandra.concurrent.StageManager class.

A few additional operations are implemented as stages too, including working with memtables in the ColumnFamilyStore class, and the consistency manager is a stage in the StorageService.

An operation may start with one thread, which then hands off the work to another thread, which may hand it off to other threads. This handing-off is not directly between threads, however; it occurs between stages.

See also SEDA (Staged Event-Driven Architecture).

Strong Consistency

For reads, strong consistency means that if it is detected that a repair needs to be made, first perform the read repair, then return the result.

Super Column

A super column is a column whose value is not a string, but instead a named list of other columns, which in this context are called subcolumns. The subcolumns are ordered, and the number of columns you can define is unbounded. Super columns also differ from regular columns in that they do not have an associated timestamp.

Super columns are not recursive; that is, they go only one level deep. A super column can hold a map only of other columns, and not a map of more super columns.

They are defined in SuperColumn.java, which implements both the IColumn and IColumnContainer interfaces. The interface allows you to perform a variety of operations, including the following: get all of the subcolumns in a super column, get a single subcolumn by name, add a subcolumn, remove a subcolumn, check the number of subcolumns in the super column, and check when a subcolumn was last modified.

super columns were one of the updates added by Facebook to the original data model of Google’s Bigtable.

See also Column Family.

Thrift

Thrift is the name of the RPC client used to communicate with the Cassandra server. It statically generates an interface for serialization in a variety of languages, including C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml. It is this mechanism that allows you to interact with Cassandra from any of these client languages.

It was created in April 2007 at Facebook and donated to Apache as an incubator project in May 2008. At the time of this writing, the Thrift interface is most likely being replaced by the newer and more active Apache project Avro. Another advantage of Avro is that it does not require static code generation.

You can read more about Thrift on its project page at http://incubator.apache.org/thrift.

Timestamp

In Cassandra, timestamps for column values are supplied by the client, so it is important to synchronize client clocks. The timestamp is by convention the number of microseconds since the Unix epoch (midnight, January 1, 1970).

Token

Each node in the node ring has a single token that is used to claim a range of keys, based on the value of the token in the previous node in the ring. You can specify your own token or let Cassandra generate one for you. The representation of the token is dependent on the kind of partitioner used.

With a Random Partitioner, the token is an integer in the range 0-2127, generated by applying an MD5 hash on keys. This is represented by the org.apache.cassandra.dht.BigIntegerToken class.

With an Order-Preserving Partitioner, the token is a UTF-8 string, based on the key. This is represented by the org.apache.cassandra.dht.StringToken class.

Tokens are represented in Cassandra by the org.apache.cassandra.dht.Token class.

Tombstone

Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a “tombstone,” an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be propagated to other replicas.

Tombstones are discarded on major Compaction.

Vector Clock

Vector clocks allow for partial, causal ordering of events in a distributed system. A vector clock maintains an array of logical clocks, one for each process, and each process contains a local copy of the clock.

In order to keep the entire set of processes in a consistent logical state, one process will send its clock to another process, which is then updated. In order to ensure consistency, some version of the following steps are typically followed.

All clocks start at 0. Each time a process experiences an event, its clock is incremented by 1. Each time a process prepares to send a message, this too counts as an event, so it increments its clock by 1 and then sends its entire vector to the external process along with the message. Each time a process receives a message, this too counts as an event, so it updates its own clock by 1 and then compares its vector to the vector wrapped in the incoming message from the external process. It updates its own clock with the maximum value from the comparison.

A vector clock event synchronization strategy will likely be introduced in a future version of Cassandra.

Weak Consistency

For reads, weak consistency improves performance by first returning results, and afterward performing any necessary Read Repair.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset