Tuesday, August 7, 2012

Cassandra replication


About Replication in Cassandra

Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance.
Cassandra stores copies, called replicas, of each row based on the row key. You set the number of replicas when you create a keyspace using the replica placement strategy. In addition to setting the number of replicas, this strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology.
The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary ormaster replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met.
To determine the physical location of nodes and their proximity to each other, the replication strategy also relies on the cluster-configured snitch, which is described below.

Replication Strategy

The available strategies are:

SimpleStrategy

Use SimpleStrategy for simple single data center clusters. This strategy is the default replica placement strategy when creating a keyspace using the Cassandra CLI. See Creating a Keyspace. When using the Cassandra Query Language interface, you must explicitly specify a strategy. See CREATE KEYSPACE.
SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering rack or data center location.


NetworkTopologyStrategy

Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple data centers. This strategy specify how many replicas you want in each data center.
When deciding how many replicas to configure in each data center, the two primary considerations are (1) being able to satisfy reads locally, without incurring cross-datacenter latency, and (2) failure scenarios. The two most common ways to configure multiple data center clusters are:
  • Two replicas in each data center. This configuration tolerates the failure of a single node per replication group and still allows local reads at a consistency level of ONE.
  • Three replicas in each data center. This configuration tolerates the failure of a one node per replication group at a strong consistency level of LOCAL_QUORUM or tolerates multiple node failures per data center using consistency level ONE.
Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.
The NetworkTopologyStrategy determines replica placement independently within each data center as follows:
  • The first replica is placed according to the partitioner (same as with SimpleStrategy).
  • Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.
NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) can fail at the same time due to power, cooling, or network issues.

No comments:

Post a Comment