Getting started with cassandra: Bulk Insert Cassandra

Monday, June 27, 2011

Bulk Insert Cassandra

Binary Memtable is the name of Cassandra's bulk-load interface. It avoids several kinds of overhead associated with the normal Thrift API:

Converting to Thrift from the internal structures and back
Routing (copying) from a coordinator node to the replica nodes
Writing to the commitlog
Serializing the internal structures to on-disk format

The tradeoff you make is that it is considerably less convenient to use than Thrift:

You must use the StorageProxy API, only available as Java code
You must pre-serialize the rows yourself
The rows you send are not live for querying until a flush occurs (either normally because the Binary Memtable fills up, or because you request one with nodetool)
You must write an entire row at once

There is an example of using Hadoop to load data through the Binary Memtable interface at https://svn.apache.org/repos/asf/cassandra/trunk/contrib/bmt_example/. .

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)