Tuesday, June 28, 2011

Cassandra Hector Failover


Hector Dev's have added a very simple load balancing feature, as well as improved failover behavior to Hector. Hector is a Java Cassandra client, to read more about it please see my previous post Hector – a Java Cassandra client.
In version 0.5.0-6 I added poor-man’s load balancing as well as improved failover behavior.
The interface CassandraClientPool used to have this method for obtaining clients:
/**
 * Borrows a client from the pool defined by url:port
 * @param url
 * @param port
 * @return
 */
CassandraClient borrowClient(String url, int port)
    throws IllegalStateException, PoolExhaustedException, Exception;
Now with the added LB and failover it has:
/**
 * Borrow a load-balanced client, a random client from the array of given client addresses.
 *
 * This method is typically used to allow load balancing b/w the list of given client URLs. The
 * method will return a random client from the array of the given url:port pairs.
 * The method will try connecting each host in the list and will only stop when there's one
 * successful connection, so in that sense it's also useful for failover.
 *
 * @param clientUrls An array of "url:port" cassandra client addresses.
 *
 * @return A randomly chosen client from the array of clientUrls.
 * @throws Exception
 */
CassandraClient borrowClient(String[] clientUrls) throws Exception;
And usage looks like that:
// Get a connection to any of the hosts cas1, ca2 or cas3
CassandraClient client = pool.borrowClient(new String[] {"cas1:9160", "cas2:9160", "cas3:9160"});
So, when calling borrowClient(String[]) the method randomly chooses any of the clients in the array and connects to it. That’s what I call poor man’s load balancing, just plain dumb random, not real load balancing. By all means, true load balancing which takes into account performance measurements such as response time and throughput is infinitely better than the plain random selection I’m employing here and in my opinion should be left out for your ops folks to deal with and not to the program, however, if you only need a very simplistic approach of random selection, then this method may suite your needs.
A nice side effect of using this method is improved failover. In previous versions hector implemented failover, but in order to find out about the ring structure it had to connect to at least one host in the ring first and query it to learn about the rest. The result was that if a new connection is made and it’s so unfortunate that this new connections is made to unavailable host, then this new client cannot connect to the host to learn about other live hosts so it fails right away. With this new method which sends an array of hosts the client keeps connecting to hosts in the list in random order until it finds one that’s up. In the example above the client may choose to connect to cas2 first; if cas2 is down it’ll try to connect to (say) cas3 and if cas3 is also down it’ll try to connect to cas1; only if all three hosts are down will it give up and return an error. Failing to connect to hosts is considered an error, but a recoverable error, so it’s transparent to the client of hector but is reported to JMX and has its own special counter (RecoverableLoadBalancedConnectErrors).

No comments:

Post a Comment