Tuesday, June 28, 2011

Cassandra Data Model Pitfalls


For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.
If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.
So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.
On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.
However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.
I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.
Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.
The next tricky bit is secondary indexes - Cassandra doesn't have any - so if you need to look up X by Y, you need to insert the data under both keys, or have a pointer. Likewise, these pointers may need to be cleaned up when the thing they point to doesn't exist, but there's no easy way of querying stuff on this basis, so your app needs to Just Remember.
And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.
None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

No comments:

Post a Comment