Sunday, June 26, 2011

Cassandra Keyspaces - what are they ?



Keyspaces
A cluster is a container for keyspaces—typically a single keyspace. A keyspace is the
outermost container for data in Cassandra, corresponding closely to a relational database.
Like a relational database, a keyspace has a name and a set of attributes that define
keyspace-wide behavior. Although people frequently advise that it’s a good idea to
create a single keyspace per application, this doesn’t appear to have much practical
basis. It’s certainly an acceptable practice, but it’s perfectly fine to create as many keyspaces
as your application needs. Note, however, that you will probably run into trouble
creating thousands of keyspaces per application.
Depending on your security constraints and partitioner, it’s fine to run multiple keyspaces
on the same cluster. For example, if your application is called Twitter, you
would probably have a cluster called Twitter-Cluster and a keyspace called Twitter.
To my knowledge, there are currently no naming conventions in Cassandra for such
items.
In Cassandra, the basic attributes that you can set per keyspace are:
Replication factor
In simplest terms, the replication factor refers to the number of nodes that will act
as copies (replicas) of each row of data. If your replication factor is 3, then three
nodes in the ring will have copies of each row, and this replication is transparent
to clients.
The replication factor essentially allows you to decide how much you want to pay
in performance to gain more consistency. That is, your consistency level for reading
and writing data is based on the replication factor.
Replica placement strategy
The replica placement refers to how the replicas will be placed in the ring. There
are different strategies that ship with Cassandra for determining which nodes
will get copies of which keys. These are SimpleStrategy (formerly known as
RackUnawareStrategy), OldNetworkTopologyStrategy (formerly known as Rack-
AwareStrategy), and NetworkTopologyStrategy (formerly known as Datacenter-
ShardStrategy).
Column families
In the same way that a database is a container for tables, a keyspace is a container
for a list of one or more column families. A column family is roughly analagous to
a table in the relational model, and is a container for a collection of rows. Each row
contains ordered columns. Column families represent the structure of your data.
Each keyspace has at least one and often many column families.

I mention the replication factor and replica placement strategy here because they are
set per keyspace. However, they don’t have an immediate impact on your data model
per se.
It is possible, but generally not recommended, to create multiple keyspaces per application.
The only time you would want to split your application into multiple keyspaces
is if you wanted a different replication factor or replica placement strategy for some of
the column families. For example, if you have some data that is of lower priority,
you could put it in its own keyspace with a lower replication factor so that Cassandra
doesn’t have to work as hard to replicate it. But this may be more complicated than it’s
worth. It’s probably a better idea to start with one keyspace and see whether you really
need to tune at that level.


1 comment:

  1. Based on my application requirements, it makes sense to have a keyspace per user. What kind of troubles would I run into if I was to create thousands of keyspaces per application?

    ReplyDelete