Tuesday, September 18, 2012

Apache Cassandra - Load Balancing the cluster


Load balancing


When adding new nodes to the cluster, the data does not automatically get shared across new nodes equally and share load proportionately. This will make the cluster completely unbalanced.
 In order to make the data get shared equally we need to shift the token range some by using the nodetool move command.
Token ranges must be calculated in a way that will make sharing of data almost equal in each of the node.
Here's a python program which can be used to calculate new tokens for the nodes. 
  • def tokens(nodes):
    • for x in xrange(nodes):
      • print 2 ** 127 / nodes * x
Run this py program for each node and try the nodetool ring command . This will tell you the load on each of the node connected to the cluster. Run the program on each node again if needed until you see the nodetool ring display the load to be shared equally between each node.

In versions of Cassandra 0.7.* and lower, there's also nodetool loadbalance: essentially a convenience over decommission + bootstrap, only instead of telling the target node where to move on the ring it will choose its location based on the same heuristic as Token selection on bootstrap. You should not use this as it doesn't rebalance the entire ring.
The status of move and balancing operations can be monitored using nodetool with the netstat argument. (Cassandra 0.6.* and lower use the streams argument).

Apache Cassandra - Moving a node to a different token range


Moving nodes


nodetool move: move the target node to a given Token. Moving is both a convenience over and more efficient than decommission + bootstrap. After moving a node, nodetool cleanup should be run to remove any unnecessary data.
As with bootstrap, see Streaming for how to monitor progress.


Streaming :

Transfer


The following steps occur for Stream Transfers.
  1. Source has a list of ranges it must transfer to another node.
  2. Source copies the data in those ranges to sstable files in preparation for streaming. This is called anti-compaction (because compaction merges multiple sstable files into one, and this does the opposite).
  3. Source builds a list of PendingFile's which contains information on each sstable to be transfered.
  4. Source starts streaming the first file from the list, followed by the log "Waiting for transfer to $some_node to complete". The header for the stream contains information on the streamed file for the Destination to interpret what to do with the incoming stream.
  5. Destination receives the file writes it to disk and sends a FileStatus.
  6. On successful transfer the Source streams the next file until its done, on error it re-streams the same file.

Request


  1. Destination compiles a list of ranges it needs from another node.
  2. Destination sends a StreamRequestMessage to the Source node with the list of ranges.
  3. Source prepares the SSTables for those ranges and creates the PendingFile's.
  4. Source starts streaming the first file in the list. The header for the first stream contains info of the current stream and a list of the remaining PendingFile's that fall in the requested ranges.
  5. Destination receives the stream and writes it to disk, followed by the log message "Streaming added org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard1-e-1-Data.db')".
  6. Destination then takes the lead and requests the remaining files one at a time. If an error occurs it re-requests the same file, if not continues with the next file until done.
  7. Source streams each of the requested files. The files are already anti-compacted, so it just streams them to the Destination.

Apache Cassandra - How to remove nodes from a live cluster

If you have decided to remove a node from a live cluster you can follow one of the two options based on which one suits you better.

Nodetool is available in the cassandra home directory

{CASSANDRA_HOME}/bin/nodetool decommission
{CASSANDRA_HOME}/bin/nodetool removetoken

Removing nodes entirely


* Decommission :

You can take a node out of the cluster with nodetool decommission to a live node. This will assign the ranges the old node was responsible for to other nodes, and replicate the appropriate data there. If decommission is used, the data will stream from the decommissioned node. 
No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at a different token on the ring, it should be removed manually.

* Remove Token :

You can execute nodetool removetoken on any available machine to remove a dead one. This will assign the token range the old node was applicable to other nodes. If removetoken is used, the data will stream from the remaining replicas.