Chapter 2. Installing Cassandra

For those among us who like instant gratification, we’ll start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as we walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and understand Cassandra in its larger context.

Installing the Binary

Cassandra is available for download from the Web at http://cassandra.apache.org. Just click the link on the home page to download the latest release version as a gzipped tarball. The prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download is around 10MB.

Extracting the Download

The simplest way to get started is to download the prebuilt binary. You can unpack the compressed file using any regular ZIP utility. On Linux, GZip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware. You can download the freeware program 7-Zip from http://www.7-zip.org.

Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.

What’s In There?

Once you decompress the tarball, you’ll see that the Cassandra binary distribution includes several directories. Let’s take a moment to look around and see what we have.

bin

This directory contains the executables to run Cassandra and the command-line interface (CLI) client. It also has scripts to run the nodetool, which is a utility for inspecting a cluster to determine whether it is properly configured, and to perform a variety of maintenance operations. We look at nodetool in depth later. It also has scripts for converting SSTables (the datafiles) to JSON and back.

conf

This directory, which is present in the source version at this location under the package root, contains the files for configuring your Cassandra instance. There are three basic functions: the storage-conf.xml file allows you to create your data store by configuring your keyspace and column families; there are files related to setting up authentication; and finally, the log4j properties let you change the logging levels to suit your needs. We see how to use all of these when we discuss configuration in Chapter 6.

interface

For versions 0.6 and earlier, this directory contains a single file, called cassandra.thrift. This file represents the Remote Procedure Call (RPC) client API that Cassandra makes available. The interface is defined using the Thrift syntax and provides an easy means to generate clients. For a quick way to see all of the operations that Cassandra supports, open this file in a regular text editor. You can see that Cassandra supports clients for Java, C++, PHP, Ruby, Python, Perl, and C# through this interface.

javadoc

This directory contains a documentation website generated using Java’s JavaDoc tool. Note that JavaDoc reflects only the comments that are stored directly in the Java code, and as such does not represent comprehensive documentation. It’s helpful if you want to see how the code is laid out. Moreover, Cassandra is a wonderful project, but the code contains precious few comments, so you might find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc, open the javadoc/index.html file in a browser.

lib

This directory contains all of the external libraries that Cassandra needs to run. For example, it uses two different JSON serialization libraries, the Google collections project, and several Apache Commons libraries. This directory includes the Thrift and Avro RPC libraries for interacting with Cassandra.

Building from Source

Cassandra uses Apache Ant for its build scripting language and the Ivy plug-in for dependency management.

Note

You can download Ant from http://ant.apache.org. You don’t need to download Ivy separately just to build Cassandra.

Ivy requires Ant, and building from source requires the complete JDK, version 1.6.0_20 or better, not just the JRE. If you see a message about how Ant is missing tools.jar, either you don’t have the full JDK or you’re pointing to the wrong path in your environment variables.

Note

If you want to download the most cutting-edge builds, you can get the source from Hudson, which the Cassandra project uses as its Continuous Integration tool. See http://hudson.zones.apache.org/hudson/job/Cassandra/ for the latest builds and test coverage information.

If you are a Git fan, you can get a read-only trunk version of the Cassandra source using this command:

>git clone git://git.apache.org/cassandra.git

Note

Git is a source code management system created by Linus Torvalds to manage development of the Linux kernel. It’s increasingly popular and is used by projects such as Android, Fedora, Ruby on Rails, Perl, and many Cassandra clients (as we’ll see in Chapter 8). If you’re on a Linux distribution such as Ubuntu, it couldn’t be easier to get Git. At a console, just type >apt-get install git and it will be installed and ready for commands. For more information, visit http://git-scm.com/.

Because Ivy takes care of all the dependencies, it’s easy to build Cassandra once you have the source. Just make sure you’re in the root directory of your source download and execute the ant program, which will look for a file called build.xml in the current directory and execute the default build target. Ant and Ivy take care of the rest. To execute the Ant program and start compiling the source, just type:

>ant

That’s it. Ivy will retrieve all of the necessary dependencies, and Ant will build the nearly 350 source files and execute the tests. If all went well, you should see a BUILD SUCCESSFUL message. If all did not go well, make sure that your path settings are all correct, that you have the most recent versions of the required programs, and that you downloaded a stable Cassandra build. You can check the Hudson report to make sure that the source you downloaded actually can compile.

Note

If you want to see detailed information on what is happening during the build, you can pass Ant the -v option to cause it to output verbose details regarding each operation it performs.

Additional Build Targets

To compile the server, you can simply execute ant as shown previously. But there are a couple of other targets in the build file that you might be interested in:

test

Users will probably find this the most helpful, as it executes the battery of unit tests. You can also check out the unit test sources themselves for some useful examples of how to interact with Cassandra.

gen-thrift-java

This target generates the Apache Thrift client interface for interacting with the database in Java.

gen-thrift-py

This target generates the Thrift client interface for Python users.

build-jar

To create a Java Archive (JAR) file for distribution, execute the command >ant jar. This will perform a complete build and output a file into the build directory called apache-cassandra-x.x.x.jar.

Building with Maven

The original authors of Cassandra apparently didn’t care much for Maven, so the early releases did not include any Maven POM file. But because so many Java developers have begun to favor Maven over Ant, and the tooling support in IDEs for Maven has become so strong, there’s a pom.xml contribution to the project so you can build from Maven if you prefer.

To build the source from Maven, navigate to <cassandra-home>/contrib/maven and execute this command:

$ mvn clean install

If you have any difficulties building with Maven, you may have to get some of the required JARs manually. As of version 0.6.3, the Maven POM doesn’t work out of the box because some dependencies, such as the libthrift.jar file, are unavailable in a repository.

Note

Few developers are using Maven with Cassandra, so Maven lacks strong support. Which is to say, use caution, because the Maven POM is often broken.

Running Cassandra

In earlier versions of Cassandra, before you could start the server there was a bit of fiddling to be done with Ivy and setting environment variables. But the developers have done a terrific job of making it very easy to start using Cassandra immediately.

Note

Cassandra requires Java Standard Edition JDK 6. Preferably, use 1.6.0_20 or greater. It has been tested on both the Open JDK and Sun’s JDK. You can check your installed Java version by opening a command prompt and executing >java -version. If you need a JDK, you can get one at http://java.sun.com/javase/downloads.

On Windows

Once you have the binary or the source downloaded and compiled, you’re ready to start the database server.

You also might need to set your JAVA_HOME environment variable. To do this on Windows 7, click the Start button and then right-click on Computer. Click Advanced System Settings, and then click the Environment Variables... button. Click New... to create a new system variable. In the Variable Name field, type JAVA_HOME. In the Variable Value field, type the path to your JDK installation. This is probably something like C:Program FilesJavajdk1.6.0_20. Remember that if you create a new environment variable, you’ll need to reopen any currently open terminals in order for the system to become aware of the new variable. To make sure your environment variable is set correctly and that Cassandra can subsequently find Java on Windows, execute this command in a new terminal: >echo %JAVA_HOME%. This prints the value of your environment variable.

Once you’ve started the server for the first time, Cassandra will add two directories to your system. The first is C:varlibcassandra, which is where it will store its data in files called commitlog. The other is C:varlogcassandra; logs will be written to a file called system.log. If you encounter any difficulties, consult the files in these directories to see what might have happened. If you’ve been trying different versions of the database and aren’t worried about losing data, you can delete these directories and restart the server as a last resort.

On Linux

The process on Linux is similar to that on Windows. Make sure that your JAVA_HOME variable is properly set to version 1.6.0_20 or better. Then, you need to extract the Cassandra gzipped tarball using gunzip. Finally, create a couple of directories for Cassandra to store its data and logs, and give them the proper permissions, as shown here:

ehewitt@morpheus$ cd /home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1
ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra
ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra

Instead of ehewitt, of course, substitute your own username.

Starting the Server

To start the Cassandra server on any OS, open a command prompt or terminal window, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the following command to start your server. In a clean installation, you should see some log statements like this:

eben@morpheus$ bin/cassandra -f
 INFO 13:23:22,367 DiskAccessMode 'auto' determined to be standard, indexAccessMode 
is standard
 INFO 13:23:22,475 Couldn't detect any schema definitions in local storage.
 INFO 13:23:22,476 Found table data in data directories. 
Consider using JMX to call org.apache.cassandra.service.StorageService
.loadSchemaFromYaml().
 INFO 13:23:22,497 Cassandra version: 0.7.0-beta1
 INFO 13:23:22,497 Thrift API version: 10.0.0
 INFO 13:23:22,498 Saved Token not found. Using qFABQw5XJMvs47lg
 INFO 13:23:22,498 Saved ClusterName not found. Using Test Cluster
 INFO 13:23:22,502 Creating new commitlog segment /var/lib/cassandra/commitlog/
CommitLog-1282508602502.log
 INFO 13:23:22,507 switching in a fresh Memtable for LocationInfo at CommitLogContext(
file='/var/lib/cassandra/commitlog/CommitLog-1282508602502.log', position=276)
 INFO 13:23:22,510 Enqueuing flush of Memtable-LocationInfo@29857804(178 bytes, 
4 operations)
 INFO 13:23:22,511 Writing Memtable-LocationInfo@29857804(178 bytes, 4 operations)
 INFO 13:23:22,691 Completed flushing /var/lib/cassandra/data/system/
LocationInfo-e-1-Data.db
 INFO 13:23:22,701 Starting up server gossip
 INFO 13:23:22,750 Binding thrift service to localhost/127.0.0.1:9160
 INFO 13:23:22,752 Using TFramedTransport with a max frame size of 15728640 bytes.
 INFO 13:23:22,753 Listening for thrift clients...
 INFO 13:23:22,792 mx4j successfuly loaded
HttpAdaptor version 3.0.2 started on port 8081

Note

Using the -f switch tells Cassandra to stay in the foreground instead of running as a background process, so that all of the server logs will print to standard out and you can see them in your terminal window, which is useful for testing.

Congratulations! Now your Cassandra server should be up and running with a new single node cluster called Test Cluster listening on port 9160.

Note

The committers work hard to ensure that data is readable from one minor dot release to the next and from one major version to the next. The commit log, however, needs to be completely cleared out from version to version (even minor versions).

If you have any previous versions of Cassandra installed, you may want to clear out the data directories for now, just to get up and running. If you’ve messed up your Cassandra installation and want to get started cleanly again, you can delete the folders in /var/lib/cassandra and /var/log/cassandra.

Running the Command-Line Client Interface

Now that you have a Cassandra installation up and running, let’s give it a quick try to make sure everything is set up properly. On Linux, running the command-line interface just works. On Windows, you might have to do a little additional work.

On Windows, navigate to the Cassandra home directory and open a new terminal in which to run our client process:

>bincassandra-cli

It’s possible that on Windows you will see an error like this when starting the client:

Starting Cassandra Client
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/cassandra/cli/CliMain

This probably means that you started Cassandra directly from within the bin directory, and it therefore sets up its Java classpath incorrectly and can’t find the CliMain file to start the client. You can define an environment variable called CASSANDRA_HOME that points to the top-level directory where you have placed or built Cassandra, so you don’t have to pay as much attention to where you’re starting Cassandra from.

Note

For a little reminder on setting environment variables on Windows, see the section On Windows.

To run the command-line interface program on Linux, navigate to the Cassandra home directory and run the cassandra-cli program in the bin directory:

>bin/cassandra-cli

The Cassandra client will start:

eben@morpheus$ bin/cassandra-cli
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown] 

You now have an interactive shell at which you can issue commands.

Note, however, that if you’re used to Oracle’s SQL*Plus or similar command-line database clients, you may become frustrated. The Cassandra CLI is not intended to be used as a full-blown client, as it’s really for development. That makes it a good way to get started using Cassandra, because you don’t have to write lots of code to test interactions with your database and get used to the environment.

Basic CLI Commands

Before we get too deep into how Cassandra works, let’s get an overview of the client API so that you can see what kinds of commands you can send to the server. We’ll see how to use the basic environment commands and how to do a round trip of inserting and retrieving some data.

Help

To get help for the command-line interface, type help or ? to see the list of available commands. The following list shows only the commands related to metadata and configuration; there are other commands for getting and setting values that we explore later.

[default@Keyspace1] help
List of all CLI commands:
?                                                          Display this message.
help                                                          Display this help.
help <command>                          Display detailed, command-specific help.
connect <hostname>/<port>                             Connect to thrift service.
use <keyspace> [<username> 'password']                     Switch to a keyspace.
describe keyspace <keyspacename>                              Describe keyspace.
exit                                                                   Exit CLI.
quit                                                                   Exit CLI.
show cluster name                                          Display cluster name.
show keyspaces                                           Show list of keyspaces.
show api version                                        Show server API version.
create keyspace <keyspace> [with <att1>=<value1> [and <att2>=<value2> ...]]
                   Add a new keyspace with the specified attribute and value(s).
create column family <cf> [with <att1>=<value1> [and <att2>=<value2> ...]]
           Create a new column family with the specified attribute and value(s).
drop keyspace <keyspace>                                      Delete a keyspace.
drop column family <cf>                                  Delete a column family.
rename keyspace <keyspace> <keyspace_new_name>                Rename a keyspace.
rename column family <cf> <new_name>                     Rename a column family.

Connecting to a Server

Starting the client this way does not automatically connect to a Cassandra server instance. So to connect to a particular server after you have started Cassandra this way, use the connect command:

eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/cassandra-cli
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown] connect localhost/9160
Connected to: "Test Cluster" on localhost/9160
[default@unknown] 

As a shortcut, you can start the client and connect to a particular server instance by passing the host and port parameters at startup, like this:

eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/
cassandra-cli localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown] 

Note

If you see this error while trying to connect to a server:

Exception connecting to localhost/9160 - java.net.ConnectException:
Connection refused: connect

make sure that a Cassandra instance is started at that host and port and that you can ping the host you’re trying to reach. There may be firewall rules preventing you from connecting. Also make sure that you’re using the new 0.7 syntax as described earlier, as it has changed from previous versions.

The CLI indicates that you’re connected to a Cassandra server cluster called “Test Cluster”. That’s because this cluster of one node at localhost is set up for you by default.

Note

In a production environment, be sure to remove the Test Cluster from the configuration.

Describing the Environment

After connecting to your Cassandra instance Test Cluster, if you’re using the binary distribution, an empty keyspace, or Cassandra database, is set up for you to test with.

To see the name of the current cluster you’re working in, type:

[default@unknown] show cluster name
Test Cluster

To see which keyspaces are available in the cluster, issue this command:

[default@unknown] show keyspaces
system

If you have created any of your own keyspaces, they will be shown as well. The system keyspace is used internally by Cassandra, and isn’t for us to put data into. In this way, it’s similar to the master and temp databases in Microsoft SQL Server. This keyspace contains the schema definitions and is aware of any modifications to the schema made at runtime. It can propagate any changes made in one node to the rest of the cluster based on timestamps.

To see the version of the API you’re using, type:

[default@Keyspace1] show api version
10.0.0

There are a variety of other commands with which you can experiment. For now, let’s add some data to the database and get it back out again.

Creating a Keyspace and Column Family

A Cassandra keyspace is sort of like a relational database. It defines one or more column families, which are very roughly analogous to tables in the relational world. When you start the CLI client without specifying a keyspace, the output will look like this:

>bin/cassandra-cli --host localhost --port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown]

Your shell prompt is for default@unknown because you haven’t authenticated as a particular user (which we’ll see how to do in Chapter 6) and you didn’t specify a keyspace.

Note

This authentication scheme is familiar if you’ve used MySQL before. Authentication and authorization are very much works in progress at the time of this writing. The recommended deployment is to put a firewall around your cluster.

Let’s create our own keyspace so we have something to write data to:

[default@unknown] create keyspace MyKeyspace with replication_factor=1
ab67bad0-ae2c-11df-b642-e700f669bcfc

Don’t worry about the replication_factor for now. That’s a setting we’ll look at in detail later. After you have created your own keyspace, you can switch to it in the shell by typing:

[default@unknown] use MyKeyspace
Authenticated to keyspace: MyKeyspace
[default@MyKeyspace]

We’re “authorized” to the keyspace because MyKeyspace doesn’t require credentials.

Now we can create a column family in our keyspace. To do this on the CLI, use the following command:

[default@MyKeyspace] create column family User
991590d3-ae2e-11df-b642-e700f669bcfc
[default@MyKeyspace]

This creates a new column family called “User” in our current keyspace, and takes the defaults for column family settings. We can use the CLI to get a description of a keyspace using the describe keyspace command, and make sure it has our column family definition, as shown here:

[default@MyKeyspace] describe keyspace MyKeyspace
Keyspace: MyKeyspace

Column Family Name: User
Column Family Type: Standard
Column Sorted By: org.apache.cassandra.db.marshal.BytesType
flush period: null minutes
------
[default@MyKeyspace]

We’ll worry about the Type, Sorted By, and flush period settings later. For now, we have enough to get started.

Writing and Reading Data

Now that we have a keyspace and a column family, we’ll write some data to the database and read it back out again. It’s OK at this point not to know quite what’s going on. We’ll come to understand Cassandra’s data model in depth later. For now, you have a keyspace (database), which has a column family. For our purposes here, it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.

To write a value, use the set command:

[default@MyKeyspace] set User['ehewitt']['fname']='Eben'
Value inserted.
[default@MyKeyspace] set User['ehewitt']['email']='[email protected]'   
Value inserted.
[default@MyKeyspace]

Here we have created two columns for the key ehewitt, to store a set of related values. The column names are fname and email. We can use the count command to make sure that we have written two columns for our single key:

[default@MyKeyspace] count User['ehewitt']
2 columns

Now that we know the data is there, let’s read it, using the get command:

[default@MyKeyspace] get User['ehewitt'] 
=> (column=666e616d65, value=Eben, timestamp=1282510290343000)
=> (column=656d61696c, [email protected], timestamp=1282510313429000)
Returned 2 results.

You can delete a column using the del command. Here we will delete the email column for the ehewitt row key:

[default@MyKeyspace] del User['ehewitt']['email']
column removed.

Now we’ll clean up after ourselves by deleting the entire row. It’s the same command, but we don’t specify a column name:

[default@MyKeyspace] del User['ehewitt']         
row removed.

To make sure that it’s removed, we can query again:

[default@Keyspace1] get User['ehewitt']
Returned 0 results.

Summary

Now you should have a Cassandra installation up and running. You’ve worked with the CLI client to insert and retrieve some data, and you’re ready to take a step back and get the big picture on Cassandra before really diving into the details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset