Preface

Why Apache Cassandra?

Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems.

Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.

Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a schema-free data model.

Is This Book for You?

This book is intended for a variety of audiences. It should be useful to you if you are:

  • A developer working with large-scale, high-volume websites, such as Web 2.0 social applications

  • An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores

  • A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a fault-tolerant, eventually consistent data store

  • A manager who wants to understand the advantages (and disadvantages) of Cassandra and related columnar databases to help make decisions about technology strategy

  • A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options

This book is a technical guide. In many ways, Cassandra represents a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well-versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is very different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be).

Using Cassandra does not mean that you have to be a Java developer. However, Cassandra is written in Java, so if you’re going to dive into the source code, a solid understanding of Java is crucial. Although it’s not strictly necessary to know Java, it can help you to better understand exceptions, how to build the source code, and how to use some of the popular clients. Many of the examples in this book are in Java. But because of the interface used to access Cassandra, you can use Cassandra from a wide variety of languages, including C#, Scala, Python, and Ruby.

Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned developer or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ivy is used to build Cassandra, and a popular client (Hector) is available via Git. In cases where I speculate that you’ll need to do a little setup of your own in order to work with the examples, I try to support that.

What’s in This Book?

This book is designed with the chapters acting, to a reasonable extent, as standalone guides. This is important for a book on Cassandra, which has a variety of audiences and is changing rapidly. To borrow from the software world, I wanted the book to be “modular”—sort of. If you’re new to Cassandra, it makes sense to read the book in order; if you’ve passed the introductory stages, you will still find value in later chapters, which you can read as standalone guides.

Here is how the book is organized:

Chapter 1, Introducing Cassandra

This chapter introduces Cassandra and discusses what’s exciting and different about it, who is using it, and what its advantages are.

Chapter 2, Installing Cassandra

This chapter walks you through installing Cassandra on a variety of platforms.

Chapter 3, The Cassandra Data Model

Here we look at Cassandra’s data model to understand what columns, super columns, and rows are. Special care is taken to bridge the gap between the relational database world and Cassandra’s world.

Chapter 4, Sample Application

This chapter presents a complete working application that translates from a relational model in a well-understood domain to Cassandra’s data model.

Chapter 5, The Cassandra Architecture

This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more.

Chapter 6, Configuring Cassandra

This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices.

Chapter 7, Reading and Writing Data

This is the moment we’ve been waiting for. We present an overview of what’s different about Cassandra’s model for querying and updating data, and then get to work using the API.

Chapter 8, Clients

There are a variety of clients that third-party developers have created for many different languages, including Java, C#, Ruby, and Python, in order to abstract Cassandra’s lower-level API. We help you understand this landscape so you can choose one that’s right for you.

Chapter 9, Monitoring

Once your cluster is up and running, you’ll want to monitor its usage, memory patterns, and thread patterns, and understand its general activity. Cassandra has a rich Java Management Extensions (JMX) interface baked in, which we put to use to monitor all of these and more.

Chapter 10, Maintenance

The ongoing maintenance of a Cassandra cluster is made somewhat easier by some tools that ship with the server. We see how to decommission a node, load-balance the cluster, get statistics, and perform other routine operational tasks.

Chapter 11, Performance Tuning

One of Cassandra’s most notable features is its speed—it’s very fast. But there are a number of things, including memory settings, data storage, hardware choices, caching, and buffer sizes, that you can tune to squeeze out even more performance.

Chapter 12, Integrating Hadoop

In this chapter, written by Jeremy Hanna, we put Cassandra in a larger context and see how to integrate it with the popular implementation of Google’s Map/Reduce algorithm, Hadoop.

Appendix A

Many new databases have cropped up in response to the need to scale at Big Data levels, or to take advantage of a “schema-free” model, or to support more recent initiatives such as the Semantic Web. Here we contextualize Cassandra against a variety of the more popular nonrelational databases, examining document-oriented databases, distributed hashtables, and graph databases, to better understand Cassandra’s offerings.

Glossary

It can be difficult to understand something that’s really new, and Cassandra has many terms that might be unfamiliar to developers or DBAs coming from the relational application development world, so I’ve included this glossary to make it easier to read the rest of the book. If you’re stuck on a certain concept, you can flip to the glossary to help clarify things such as Merkle trees, vector clocks, hinted handoffs, read repairs, and other exotic terms.

Note

This book is developed against Cassandra 0.6 and 0.7. The project team is working hard on Cassandra, and new minor releases and bug fix releases come out frequently. Where possible, I have tried to call out relevant differences, but you might be using a different version by the time you read this, and the implementation may have changed.

Finding Out More

If you’d like to find out more about Cassandra, and to get the latest updates, visit this book’s companion website at http://www.cassandraguide.com.

It’s also an excellent idea to follow me on Twitter at @ebenhewitt.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip, suggestion, or general note.

Caution

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Cassandra: The Definitive Guide by Eben Hewitt. Copyright 2011 Eben Hewitt, 978-1-449-39041-9.”

If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at .

Safari® Enabled

Note

Safari Books Online is an on-demand digital library that lets you easily search over 7,500 technology and creative reference books and videos to find the answers you need quickly.

With a subscription, you can read any page and watch any video from our library online. Read books on your cell phone and mobile devices. Access new titles before they are available for print, and get exclusive access to manuscripts in development and post feedback for the authors. Copy and paste code samples, organize your favorites, download chapters, bookmark key sections, create notes, print out pages, and benefit from tons of other time-saving features.

O’Reilly Media has uploaded this book to the Safari Books Online service. To have full digital access to this book and others on similar topics from O’Reilly and other publishers, sign up for free at http://my.safaribooksonline.com

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707 829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://oreilly.com/catalog/0636920010852/

To comment or ask technical questions about this book, send email to:

For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at:

http://www.oreilly.com

Acknowledgments

There are many wonderful people to whom I am grateful for helping bring this book to life.

Thanks to Jeremy Hanna, for writing the Hadoop chapter, and for being so easy to work with.

Thank you to my technical reviewers. Stu Hood’s insightful comments in particular really improved the book. Robert Schneider and Gary Dusbabek contributed thoughtful reviews.

Thank you to Jonathan Ellis for writing the foreword.

Thanks to my editor, Mike Loukides, for being a charming conversationalist at dinner in San Francisco.

Thank you to Rain Fletcher for supporting and encouraging this book.

I’m inspired by the many terrific developers who have contributed to Cassandra. Hats off for making such a pretty and powerful database.

As always, thank you to Alison Brown, who read drafts, gave me notes, and made sure that I had time to work; this book would not have happened without you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset