Tuesday, April 26, 2011

Getting Started With Cassandra tutorial

Based on Ronald Mathies’ intro articles to Cassandra and a few other resources I’ve been gathering, I thought I should put together a detailed guide to getting started with Cassandra. As one would expect the ☞ first post is briefly introducing Cassandra and covers the distribution details and installation steps. It should be noted that Windows may not be the best environment to install Cassandra. Also if after the brief intro you’d like to see more details about it, you should check Gary Dusbabek’s presentation on Cassandra or watch Eric Evan’s Cassandra presentation at FOSDEM.

The ☞ second article is focusing on Cassandra data model. If you are not familiar with it, this is the part you’ll want to focus on.

Column
A column is also referred to as a tuple (triplet) that contains a name, value and a timestamp. This is the smallest data container there is.
SuperColumn:
A SuperColumn is a tuple with a name and a value, it doesn’t have a timestamp like the Column tuple. Notice that the value is in this case not a binary value but more of a Map style container. The map contains key / column combinations. What is important here is that the key has the same value as the name of the Columnit refers to. So to put it simple, a SuperColumn is a container for one or more Columns. You will see that it will also make a big difference later on when we discuss the ColumnFamily and SuperColumnFamily.
ColumnFamily:
ColumnFamily is a structure that can keep an infinite number of rows, for most people with an RDBMS background, this is the structure that resembles a Table the most. When you look at the diagram you can see that a ColumnFamily has a name (comparable to the name of a Table), A map with a key (comparable to a row identifier) and a value (which is a Map containing Columns). The map with the columns have the same rules as the SuperColumn, the key has the same value as the name of the Column it refers to.
SuperColumnFamily:
Finally we have the largest container, the SuperColumnFamily, if you understand the ColumnFamily then this construction isn’t much harder, instead of having Columns in the inner most Map we have SuperColumns. So it just adds an extra dimension. As displayed in the image, the Key of the Map which contain the SuperColumns must be the same as the name of the SuperColumn (just like with the ColumnFamily).
Keyspace:
Keyspaces are quite simple again, from an RDBMS point of view you can compare this to your schema, normally you have one per application. A keyspace contains the ColumnFamilies. Note however there is no relationship between the ColumnFamiliies, they are just separate containers.
Probably the best explanation of the Cassandra data model can be found in Arin Sarkissian’s ☞ WTF is a SuperColumn?. There are other recommended resources about Cassandra and Jonathan Ellis, Cassandra project chair, has a suggested Cassandra reading list.

☞ Third article in the series is focusing on Cassandra sorting capabilities:

By default Cassandra sorts the data as soon as you store it in the database and it remains sorted. This gives you an enormous performance boost, however you need to think before you start storing data.

Sorting can be specified on the ColumnFamily CompareWith attribute, these are the options you can choose from (it is possible to create custom sorting behavior but we will cover that later):

BytesType
UTF8Type
LexicalUUIDType
TimeUUIDType
AsciiType
LongType
And there is also a way to define your own custom Cassandra sorting types described in ☞ post.

By now you should be ready to start using Cassandra and this is exactly the subject of the ☞ part 4 and ☞ part 5 of the series which cover the Thrift Cassandra client. Understanding how writes and reads are performed might be useful, so you should check Cassandra write operation and Cassandra read operation which also talk about the performance of these operation.

While initially you might not have enough data to have to decide how to partition a Cassandra cluster, once you’ll get to that point I’m pretty sure you’ll appreciate some more details on Cassandra partitioning strategies.

Last, but not least, here is a list of known Cassandra usecases that might give you a good idea of where Cassandra will fit in your next app and then you should be absolutely ready to experiment with Cassandra.

References
A Cassandra Glossary
Cassandra: Tunning for Performance
Cassandra Reads Performance Explained
Cassandra: Modeling A Facebook-Style Messenger
Presentation: Introduction to Cassandra
Cassandra Installation Guide for Ubuntu and Debian
Presentation: Cassandra Basics - Indexing
RESTful Cassandra
Tutorial: MapReduce with Riak
Why Redis? And Memcached, Cassandra, Lucene, ElasticSearch
Cassandra Gets (Better) Documentation
Cassandra reading list
Your Chance to Review the FOSDEM NoSQL Event
Cassandra Usecases: Survey Results
Cassandra Partitioning Strategies
Cassandra Write Operation Performance Explained
Getting Started with Cassandra on Windows
Presentation: Gary Dusbabek (Rackspace) on Cassandra

No comments:

Post a Comment