Sunday, June 26, 2011

Cassandra Column Families - what are they ?


Column Families
A column family is a container for an ordered collection of rows, each of which is itself
an ordered collection of columns. In the relational world, when you are physically
creating your database from a model, you specify the name of the database (keyspace),
the names of the tables (remotely similar to column families, but don’t get stuck on the
idea that column families equal tables—they don’t), and then you define the names of
the columns that will be in each table.
There are a few good reasons not to go too far with the idea that a column family is like
a relational table. First, Cassandra is considered schema-free because although the
column families are defined, the columns are not. You can freely add any column to
any column family at any time, depending on your needs. Second, a column family has
two attributes: a name and a comparator. The comparator value indicates how columns
will be sorted when they are returned to you in a query—according to long, byte, UTF8,
or other ordering.
In a relational database, it is frequently transparent to the user how tables are stored
on disk, and it is rare to hear of recommendations about data modeling based on how
the RDBMS might store tables on disk. That’s another reason to keep in mind that a
column family is not a table. Because column families are each stored in separate files
on disk, it’s important to keep related columns defined together in the same column
family.
Another way that column families differ from relational tables is that relational tables
define only columns, and the user supplies the values, which are the rows. But in Cassandra,
a table can hold columns, or it can be defined as a super column family. The
benefit of using a super column family is to allow for nesting.
For standard column families, which is the default, you set the type to Standard; for a
super column family, you set the type to Super.


When you write data to a column family in Cassandra, you specify values for one or
more columns. That collection of values together with a unique identifier is called a
row. That row has a unique key, called the row key, which acts like the primary key
unique identifier for that row. So while it’s not incorrect to call it column-oriented, or
columnar, it might be easier to understand the model if you think of rows as containers
for columns. This is also why some people refer to Cassandra column families as similar
to a four-dimensional hash:
[Keyspace][ColumnFamily][Key][Column]
We can use a JSON-like notation to represent a Hotel column family, as shown here:
Hotel {
key: AZC_043 { name: Cambria Suites Hayden, phone: 480-444-4444,
address: 400 N. Hayden Rd., city: Scottsdale, state: AZ, zip: 85255}
key: AZS_011 { name: Clarion Scottsdale Peak, phone: 480-333-3333,
address: 3000 N. Scottsdale Rd, city: Scottsdale, state: AZ, zip: 85255}
key: CAS_021 { name: W Hotel, phone: 415-222-2222,
address: 181 3rd Street, city: San Francisco, state: CA, zip: 94103}
key: NYN_042 { name: Waldorf Hotel, phone: 212-555-5555,
address: 301 Park Ave, city: New York, state: NY, zip: 10019}
}


In this example, the row key is a unique primary key for the hotel, and the columns are
name, phone, address, city, state, and zip. Although these rows happen to define values
for all of the same columns, you could easily have one row with 4 columns and another
row in the same column family with 400 columns, and none of them would have to
overlap.


We can query a column family such as this one using the CLI, like this:
cassandra> get Hotelier.Hotel['NYN_042']
=> (column=zip, value=10019, timestamp=3894166157031651)
=> (column=state, value=NY, timestamp=3894166157031651)
=> (column=phone, value=212-555-5555, timestamp=3894166157031651)
=> (column=name, value=The Waldorf=Astoria, timestamp=3894166157031651)
=> (column=city, value=New York, timestamp=3894166157031651)



=> (column=address, value=301 Park Ave, timestamp=3894166157031651)
Returned 6 results.
This indicates that we have one hotel in New York, New York, but we see six results
because the results are column-oriented, and there are six columns for that row in the
column family. Note that while there are six columns for that row, other rows might
have more or fewer columns.
Column Family Options
There are a few additional parameters that you can define for each column family. These
are:
keys_cached
The number of locations to keep cached per SSTable. This doesn’t refer to
column name/values at all, but to the number of keys, as locations of rows per
column family, to keep in memory in least-recently-used order.
rows_cached
The number of rows whose entire contents (the complete list of name/value pairs
for that unique row key) will be cached in memory.
comment
This is just a standard comment that helps you remember important things about
your column family definitions.
read_repair_chance
This is a value between 0 and 1 that represents the probability that read repair
operations will be performed when a query is performed without a specified quorum,
and it returns the same row from two or more replicas and at least one of the
replicas appears to be out of date. You may want to lower this value if you are
performing a much larger number of reads than writes.
preload_row_cache
Specifies whether you want to prepopulate the row cache on server startup.




No comments:

Post a Comment