Cassandra nodetool repair

Cassandra nodetool provides a command to run repair on a specified endpoint. The repair command uses Anti-Entropy service which detects inconsistencies and repairs data across all replicas of the specified endpoint. Cassandra provides high availability and durability by storing multiple copies (replicas) of data. Data inconsistencies can occur due to server, disk failures or network partitions. 

nodetool repair is executed on a specified endpoint in the cluster and provides several options to control the amount of data to be repaired. Data is organized in to keyspaces and column families and distributed to the nodes using data partitioning scheme. User can execute repair on a specified column family or all column families in a specified keyspace or repair all column families from all keyspaces.

At storage layer, each column family data is stored separately in SSTables. A column family contains a set of rows. Each row is mapped to a token using its key and it is assigned to a node using data partitioning scheme. Rows in a column family are sorted based on their token order in SSTables. Rows are replicated to other nodes using replication strategy and the number of replicas are determined by replication factor. Each keyspace can have different replication strategies, so Cassandra repairs each keyspace data separately.

In order to perform repair, we need to identify replicas of a row.  Each row can have a different set of replicas based on its token. But all the rows in a token range will have same set of replicas. So instead of repairing each row separately, Cassandra repairs rows of a token range together.
command t
Cassandra uses token ranges to split a node repair activity in to small repair jobs. By default all token ranges assigned to a given node are repaired. These token ranges include primary partition ranges as well as replication ranges. nodetool repair -pr command can be used to repair only primary partition ranges of a node.

The target node which receives repair request from nodetool repair command is also called Repair coordinator node.  Repair coordinator node creates repair sessions for each token range being repaired and each repair session is assigned with a unique UUID.
A repair session (RepairSession) is responsible to perform repair on all the column families in a specified keyspace. it creates repair jobs (RepairJob) for each column family to be repaired and adds jobs to a queue. A repair job repairs a single token range across all replicas associated to its token range. Repair job contains two phases. The first phase is Validation phase in which Merkle trees are built at each replica. The second phase is called Sync phase where each replica's Merkle tree is compared with each other replica's Merkle tree and the differing rows are synced. Repair jobs in a repair session are either can be executed sequentially or in parallel.

Validation Phase

In validation phase, repair coordinator node requests Merkle tree from each replica for the token range being repaired.  Each replica builds a Merkle tree that contains hash values of rows from the requested column family and token range. A validation compaction is triggered for the requested column family. For each row, a MD5 hash value of its contents is computed and added to Merkle tree. It is not feasible to construct a Merkle tree to its maximum hash depth to include each row hash value in a separate leaf node. Cassandra builds Merkle tree of variable depth based on the estimated number of keys in the requested token range. It limits the max depth of Merkle tree to 20. Each leaf node's hash value is computed by mixing hash values of rows that belong to its sub token range. If a leaf node's sub token range do not contain any rows then an empty hash value is used as its hash value.

Sync Phase

In this phase, received Merkle trees are compared. Each replica's Merkle tree is compared with that of each other replica and differing sub token ranges identified between replicas. If root hash values of two Merkle trees match then the data in the token range being compared is consistent. If root hash values are not equal then recursively hash values of left branches are compared followed by hash values of right branches until all the nodes in trees are traversed. For each differing sub token range, streaming repair is performed between the replicas to sync up the data.

No comments:

Post a Comment