For those among us who like instant gratification, we’ll start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as we walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and understand Cassandra in its larger context.
Cassandra is available for download from the Web at http://cassandra.apache.org. Just click the link on the home page to download the latest release version as a gzipped tarball. The prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download is around 10MB.
The simplest way to get started is to download the prebuilt binary. You can unpack the compressed file using any regular ZIP utility. On Linux, GZip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware. You can download the freeware program 7-Zip from http://www.7-zip.org.
Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.
Once you decompress the tarball, you’ll see that the Cassandra binary distribution includes several directories. Let’s take a moment to look around and see what we have.
This directory contains the executables to run Cassandra and
the command-line interface (CLI) client. It also has scripts to
run the nodetool
, which is a utility for inspecting a
cluster to determine whether it is properly configured, and to
perform a variety of maintenance operations. We look at
nodetool
in depth later. It also has scripts for
converting SSTables (the datafiles) to JSON and back.
This directory, which is present in the source version at this location under the package root, contains the files for configuring your Cassandra instance. There are three basic functions: the storage-conf.xml file allows you to create your data store by configuring your keyspace and column families; there are files related to setting up authentication; and finally, the log4j properties let you change the logging levels to suit your needs. We see how to use all of these when we discuss configuration in Chapter 6.
For versions 0.6 and earlier, this directory contains a single file, called cassandra.thrift. This file represents the Remote Procedure Call (RPC) client API that Cassandra makes available. The interface is defined using the Thrift syntax and provides an easy means to generate clients. For a quick way to see all of the operations that Cassandra supports, open this file in a regular text editor. You can see that Cassandra supports clients for Java, C++, PHP, Ruby, Python, Perl, and C# through this interface.
This directory contains a documentation website generated using Java’s JavaDoc tool. Note that JavaDoc reflects only the comments that are stored directly in the Java code, and as such does not represent comprehensive documentation. It’s helpful if you want to see how the code is laid out. Moreover, Cassandra is a wonderful project, but the code contains precious few comments, so you might find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc, open the javadoc/index.html file in a browser.
This directory contains all of the external libraries that Cassandra needs to run. For example, it uses two different JSON serialization libraries, the Google collections project, and several Apache Commons libraries. This directory includes the Thrift and Avro RPC libraries for interacting with Cassandra.
Cassandra uses Apache Ant for its build scripting language and the Ivy plug-in for dependency management.
You can download Ant from http://ant.apache.org. You don’t need to download Ivy separately just to build Cassandra.
Ivy requires Ant, and building from source requires the complete JDK, version 1.6.0_20 or better, not just the JRE. If you see a message about how Ant is missing tools.jar, either you don’t have the full JDK or you’re pointing to the wrong path in your environment variables.
If you want to download the most cutting-edge builds, you can get the source from Hudson, which the Cassandra project uses as its Continuous Integration tool. See http://hudson.zones.apache.org/hudson/job/Cassandra/ for the latest builds and test coverage information.
If you are a Git fan, you can get a read-only trunk version of the Cassandra source using this command:
>git clone git://git.apache.org/cassandra.git
Git is a source code management system created by Linus Torvalds to manage development of the Linux kernel. It’s increasingly popular and is used by projects such as Android, Fedora, Ruby on Rails, Perl, and many Cassandra clients (as we’ll see in Chapter 8). If you’re on a Linux distribution such as Ubuntu, it couldn’t be easier to get Git. At a console, just type >apt-get install git and it will be installed and ready for commands. For more information, visit http://git-scm.com/.
Because Ivy takes care of all the dependencies, it’s easy to build
Cassandra once you have the source. Just make sure you’re in the root
directory of your source download and execute the ant
program, which will look for a file called build.xml
in the current directory and execute the default build target. Ant and Ivy
take care of the rest. To execute the Ant program and start compiling the
source, just type:
>ant
That’s it. Ivy will retrieve all of the necessary dependencies, and
Ant will build the nearly 350 source files and execute the tests. If all
went well, you should see a BUILD SUCCESSFUL
message. If all
did not go well, make sure that your path settings are all correct,
that you have the most recent
versions of the required programs, and that you downloaded a stable
Cassandra build. You can check the Hudson report to make sure that the
source you downloaded actually can compile.
If you want to see detailed information on what is happening
during the build, you can pass Ant the -v
option to cause
it to output verbose details regarding each operation it
performs.
To compile the server, you can simply execute ant as shown previously. But there are a couple of other targets in the build file that you might be interested in:
Users will probably find this the most helpful, as it executes the battery of unit tests. You can also check out the unit test sources themselves for some useful examples of how to interact with Cassandra.
This target generates the Apache Thrift client interface for interacting with the database in Java.
This target generates the Thrift client interface for Python users.
To create a Java Archive (JAR) file for distribution, execute the command >ant jar. This will perform a complete build and output a file into the build directory called apache-cassandra-x.x.x.jar.
The original authors of Cassandra apparently didn’t care much for Maven, so the early releases did not include any Maven POM file. But because so many Java developers have begun to favor Maven over Ant, and the tooling support in IDEs for Maven has become so strong, there’s a pom.xml contribution to the project so you can build from Maven if you prefer.
To build the source from Maven, navigate to <cassandra-home>/contrib/maven and execute this command:
$ mvn clean install
If you have any difficulties building with Maven, you may have to get some of the required JARs manually. As of version 0.6.3, the Maven POM doesn’t work out of the box because some dependencies, such as the libthrift.jar file, are unavailable in a repository.
Few developers are using Maven with Cassandra, so Maven lacks strong support. Which is to say, use caution, because the Maven POM is often broken.
In earlier versions of Cassandra, before you could start the server there was a bit of fiddling to be done with Ivy and setting environment variables. But the developers have done a terrific job of making it very easy to start using Cassandra immediately.
Cassandra requires Java Standard Edition JDK 6. Preferably, use
1.6.0_20 or greater. It has been tested on both the Open JDK and Sun’s
JDK. You can check your installed Java version by opening a command
prompt and executing >java -version
. If you need a JDK,
you can get one at http://java.sun.com/javase/downloads.
Once you have the binary or the source downloaded and compiled, you’re ready to start the database server.
You also might need to set your JAVA_HOME
environment variable. To do this on Windows 7, click the Start button and then
right-click on . Click
, and then click the
button. Click
to create a new system variable. In the
Variable Name field, type
JAVA_HOME
. In the Variable
Value
field, type the path to your JDK installation. This is probably
something like C:Program FilesJavajdk1.6.0_20.
Remember that if you create a new environment variable, you’ll need to
reopen any currently open terminals in order for the system to become
aware of the new variable. To make sure your environment variable is set
correctly and that Cassandra can
subsequently find Java on Windows, execute this command in a new
terminal: >echo %JAVA_HOME%. This prints the value
of your environment variable.
Once you’ve started the server for the first time, Cassandra will add two directories to your system. The first is C:varlibcassandra, which is where it will store its data in files called commitlog. The other is C:varlogcassandra; logs will be written to a file called system.log. If you encounter any difficulties, consult the files in these directories to see what might have happened. If you’ve been trying different versions of the database and aren’t worried about losing data, you can delete these directories and restart the server as a last resort.
The process on Linux is similar to that on Windows. Make sure that
your JAVA_HOME
variable is properly set to version
1.6.0_20 or better. Then, you need to extract the Cassandra gzipped
tarball using gunzip. Finally, create a couple of
directories for Cassandra to store its data and logs, and give them the
proper permissions, as shown here:
ehewitt@morpheus$ cd /home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1 ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra
Instead of ehewitt
, of course, substitute your own
username.
To start the Cassandra server on any OS, open a command prompt or terminal window, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the following command to start your server. In a clean installation, you should see some log statements like this:
eben@morpheus$ bin/cassandra -f
INFO 13:23:22,367 DiskAccessMode 'auto' determined to be standard, indexAccessMode
is standard
INFO 13:23:22,475 Couldn't detect any schema definitions in local storage.
INFO 13:23:22,476 Found table data in data directories.
Consider using JMX to call org.apache.cassandra.service.StorageService
.loadSchemaFromYaml().
INFO 13:23:22,497 Cassandra version: 0.7.0-beta1
INFO 13:23:22,497 Thrift API version: 10.0.0
INFO 13:23:22,498 Saved Token not found. Using qFABQw5XJMvs47lg
INFO 13:23:22,498 Saved ClusterName not found. Using Test Cluster
INFO 13:23:22,502 Creating new commitlog segment /var/lib/cassandra/commitlog/
CommitLog-1282508602502.log
INFO 13:23:22,507 switching in a fresh Memtable for LocationInfo at CommitLogContext(
file='/var/lib/cassandra/commitlog/CommitLog-1282508602502.log', position=276)
INFO 13:23:22,510 Enqueuing flush of Memtable-LocationInfo@29857804(178 bytes,
4 operations)
INFO 13:23:22,511 Writing Memtable-LocationInfo@29857804(178 bytes, 4 operations)
INFO 13:23:22,691 Completed flushing /var/lib/cassandra/data/system/
LocationInfo-e-1-Data.db
INFO 13:23:22,701 Starting up server gossip
INFO 13:23:22,750 Binding thrift service to localhost/127.0.0.1:9160
INFO 13:23:22,752 Using TFramedTransport with a max frame size of 15728640 bytes.
INFO 13:23:22,753 Listening for thrift clients...
INFO 13:23:22,792 mx4j successfuly loaded
HttpAdaptor version 3.0.2 started on port 8081
Using the -f
switch tells Cassandra to stay in the
foreground instead of running as a background process, so that all of
the server logs will print to standard out and you can see them in
your terminal window, which is useful for testing.
Congratulations! Now your Cassandra server should be up and running with a new single node cluster called Test Cluster listening on port 9160.
The committers work hard to ensure that data is readable from one minor dot release to the next and from one major version to the next. The commit log, however, needs to be completely cleared out from version to version (even minor versions).
If you have any previous versions of Cassandra installed, you may want to clear out the data directories for now, just to get up and running. If you’ve messed up your Cassandra installation and want to get started cleanly again, you can delete the folders in /var/lib/cassandra and /var/log/cassandra.
Now that you have a Cassandra installation up and running, let’s give it a quick try to make sure everything is set up properly. On Linux, running the command-line interface just works. On Windows, you might have to do a little additional work.
On Windows, navigate to the Cassandra home directory and open a new terminal in which to run our client process:
>bincassandra-cli
It’s possible that on Windows you will see an error like this when starting the client:
Starting Cassandra Client Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/cassandra/cli/CliMain
This probably means that you started Cassandra directly from within
the bin directory, and it therefore sets up its Java
classpath incorrectly and can’t find the CliMain file
to start the client. You can define an environment variable called
CASSANDRA_HOME
that points to the top-level directory
where you have placed or built Cassandra, so you don’t have to pay as much
attention to where you’re starting Cassandra from.
For a little reminder on setting environment variables on Windows, see the section On Windows.
To run the command-line interface program on Linux, navigate to the
Cassandra home directory and run the cassandra-cli
program in
the bin directory:
>bin/cassandra-cli
The Cassandra client will start:
eben@morpheus$ bin/cassandra-cli Welcome to cassandra CLI. Type 'help' or '?' for help. Type 'quit' or 'exit' to quit. [default@unknown]
You now have an interactive shell at which you can issue commands.
Note, however, that if you’re used to Oracle’s SQL*Plus or similar command-line database clients, you may become frustrated. The Cassandra CLI is not intended to be used as a full-blown client, as it’s really for development. That makes it a good way to get started using Cassandra, because you don’t have to write lots of code to test interactions with your database and get used to the environment.
Before we get too deep into how Cassandra works, let’s get an overview of the client API so that you can see what kinds of commands you can send to the server. We’ll see how to use the basic environment commands and how to do a round trip of inserting and retrieving some data.
To get help for the command-line interface, type help or ? to see the list of available commands. The following list shows only the commands related to metadata and configuration; there are other commands for getting and setting values that we explore later.
[default@Keyspace1] help List of all CLI commands: ? Display this message. help Display this help. help <command> Display detailed, command-specific help. connect <hostname>/<port> Connect to thrift service. use <keyspace> [<username> 'password'] Switch to a keyspace. describe keyspace <keyspacename> Describe keyspace. exit Exit CLI. quit Exit CLI. show cluster name Display cluster name. show keyspaces Show list of keyspaces. show api version Show server API version. create keyspace <keyspace> [with <att1>=<value1> [and <att2>=<value2> ...]] Add a new keyspace with the specified attribute and value(s). create column family <cf> [with <att1>=<value1> [and <att2>=<value2> ...]] Create a new column family with the specified attribute and value(s). drop keyspace <keyspace> Delete a keyspace. drop column family <cf> Delete a column family. rename keyspace <keyspace> <keyspace_new_name> Rename a keyspace. rename column family <cf> <new_name> Rename a column family.
Starting the client this way does not automatically connect to a Cassandra server instance. So to connect to a particular server after you have started Cassandra this way, use the connect command:
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/cassandra-cli Welcome to cassandra CLI. Type 'help' or '?' for help. Type 'quit' or 'exit' to quit. [default@unknown] connect localhost/9160 Connected to: "Test Cluster" on localhost/9160 [default@unknown]
As a shortcut, you can start the client and connect to a particular server instance by passing the host and port parameters at startup, like this:
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/ cassandra-cli localhost/9160 Welcome to cassandra CLI. Type 'help' or '?' for help. Type 'quit' or 'exit' to quit. [default@unknown]
If you see this error while trying to connect to a server:
Exception connecting to localhost/9160 - java.net.ConnectException: Connection refused: connect
make sure that a Cassandra instance is started at that host and port and that you can ping the host you’re trying to reach. There may be firewall rules preventing you from connecting. Also make sure that you’re using the new 0.7 syntax as described earlier, as it has changed from previous versions.
The CLI indicates that you’re connected to a Cassandra server
cluster called “Test Cluster”. That’s because this cluster of one node
at localhost
is set up for you by default.
In a production environment, be sure to remove the Test Cluster from the configuration.
After connecting to your Cassandra instance Test Cluster, if you’re using the binary distribution, an empty keyspace, or Cassandra database, is set up for you to test with.
To see the name of the current cluster you’re working in, type:
[default@unknown] show cluster name
Test Cluster
To see which keyspaces are available in the cluster, issue this command:
[default@unknown] show keyspaces
system
If you have created any of your own keyspaces, they will be shown
as well. The system
keyspace is used
internally by Cassandra, and isn’t for us to put data into. In this way,
it’s similar to the master and temp databases in Microsoft SQL Server.
This keyspace contains the schema definitions and is aware of any
modifications to the schema made at runtime. It can propagate any
changes made in one node to the rest of the cluster based on
timestamps.
To see the version of the API you’re using, type:
[default@Keyspace1] show api version
10.0.0
There are a variety of other commands with which you can experiment. For now, let’s add some data to the database and get it back out again.
A Cassandra keyspace is sort of like a relational database. It defines one or more column families, which are very roughly analogous to tables in the relational world. When you start the CLI client without specifying a keyspace, the output will look like this:
>bin/cassandra-cli --host localhost --port 9160 Starting Cassandra Client Connected to: "Test Cluster" on localhost/9160 Welcome to cassandra CLI. Type 'help' or '?' for help. Type 'quit' or 'exit' to quit. [default@unknown]
Your shell prompt is for default@unknown
because you
haven’t authenticated as a particular user (which we’ll see how to do in
Chapter 6) and you didn’t specify a
keyspace.
This authentication scheme is familiar if you’ve used MySQL before. Authentication and authorization are very much works in progress at the time of this writing. The recommended deployment is to put a firewall around your cluster.
Let’s create our own keyspace so we have something to write data to:
[default@unknown] create keyspace MyKeyspace with replication_factor=1 ab67bad0-ae2c-11df-b642-e700f669bcfc
Don’t worry about the replication_factor
for
now. That’s a setting we’ll look at in detail later. After you have
created your own keyspace, you can switch to it in the shell by
typing:
[default@unknown] use MyKeyspace Authenticated to keyspace: MyKeyspace [default@MyKeyspace]
We’re “authorized” to the keyspace because MyKeyspace doesn’t require credentials.
Now we can create a column family in our keyspace. To do this on the CLI, use the following command:
[default@MyKeyspace] create column family User 991590d3-ae2e-11df-b642-e700f669bcfc [default@MyKeyspace]
This creates a new column family called “User” in our current keyspace, and takes the defaults for column family settings. We can use the CLI to get a description of a keyspace using the describe keyspace command, and make sure it has our column family definition, as shown here:
[default@MyKeyspace] describe keyspace MyKeyspace Keyspace: MyKeyspace Column Family Name: User Column Family Type: Standard Column Sorted By: org.apache.cassandra.db.marshal.BytesType flush period: null minutes ------ [default@MyKeyspace]
We’ll worry about the Type
, Sorted By
,
and flush period
settings later. For now, we have enough to
get started.
Now that we have a keyspace and a column family, we’ll write some data to the database and read it back out again. It’s OK at this point not to know quite what’s going on. We’ll come to understand Cassandra’s data model in depth later. For now, you have a keyspace (database), which has a column family. For our purposes here, it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.
To write a value, use the set
command:
[default@MyKeyspace] set User['ehewitt']['fname']='Eben' Value inserted. [default@MyKeyspace] set User['ehewitt']['email']='[email protected]' Value inserted. [default@MyKeyspace]
Here we have created two columns for the key ehewitt
,
to store a set of related values. The column names are
fname
and email
. We can use the
count
command to make sure that we have written two columns
for our single key:
[default@MyKeyspace] count User['ehewitt'] 2 columns
Now that we know the data is there, let’s read it, using the
get
command:
[default@MyKeyspace] get User['ehewitt'] => (column=666e616d65, value=Eben, timestamp=1282510290343000) => (column=656d61696c, [email protected], timestamp=1282510313429000) Returned 2 results.
You can delete a column using the del
command. Here
we will delete the email
column for the
ehewitt
row key:
[default@MyKeyspace] del User['ehewitt']['email'] column removed.
Now we’ll clean up after ourselves by deleting the entire row. It’s the same command, but we don’t specify a column name:
[default@MyKeyspace] del User['ehewitt'] row removed.
To make sure that it’s removed, we can query again:
[default@Keyspace1] get User['ehewitt'] Returned 0 results.