Tuesday, June 28, 2011

Cassandra Hector throwing error even after Cassandra is up again


Hector state when a cassandra node goes down

Hi,

I am using hector -0.7.0-26 in my application persisting data over the
network to Cassandra cluster of 3 nodes.
Everything works well until i manually bring down one of the cassandra
nodes. i need to bring down a cassandra node as part of doing negative
testing on my application. After the node goes down, i see hector
exceptions on the client side and it seems that hector is trying to
connect to the downed host.
Below is the exception trace and this is repeatedly seen in the logs.
My question is, when hector sees that the node is down, doesn't hector
close the connections to the node and stop trying again until it
detects the node to be up again?
What should be done at the client side (while using Hector) to ensure
that hector cleans up the connections to a dead node and stops trying
to reuse it.

[pool-1-thread-1] ERROR (HThriftClient.java:88) - Unable to open
transport to asp.corp.apple.com(17.108.122.70):9162
org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection refused
 at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
 at
org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:
81)
 at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
84)
 at me.prettyprint.cassandra.connection.CassandraHostRetryService
$RetryRunner.verifyConnection(CassandraHostRetryService.java:114)
 at me.prettyprint.cassandra.connection.CassandraHostRetryService
$RetryRunner.run(CassandraHostRetryService.java:94)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
441)
 at java.util.concurrent.FutureTask
$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
 at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
 at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
 at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.ConnectException: Connection refused
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:
195)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
 ... 13 more

Actually this is working as anticipated. The following line from the
stack trace:

me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:

indicates this is the host retry service (running in a background
thread) attempting to connect to the downed host every 10 seconds (by
default). What was just fixed in master and tip of 0.7.0 was an issue
with incorrect handling of UnavailableException on a consistency level
failure. This will be released at some point today (marked as
0.7.0-28).

On looking at the above trace again, that error could probably be more
clear about what is going on. I'll clean that up today as well.

No comments:

Post a Comment