Monday, June 27, 2011

Bulk Insert Cassandra


Binary Memtable is the name of Cassandra's bulk-load interface. It avoids several kinds of overhead associated with the normal Thrift API:
  • Converting to Thrift from the internal structures and back
  • Routing (copying) from a coordinator node to the replica nodes
  • Writing to the commitlog
  • Serializing the internal structures to on-disk format
The tradeoff you make is that it is considerably less convenient to use than Thrift:
  • You must use the StorageProxy API, only available as Java code
  • You must pre-serialize the rows yourself
  • The rows you send are not live for querying until a flush occurs (either normally because the Binary Memtable fills up, or because you request one with nodetool)
  • You must write an entire row at once
There is an example of using Hadoop to load data through the Binary Memtable interface at https://svn.apache.org/repos/asf/cassandra/trunk/contrib/bmt_example/.

No comments:

Post a Comment