CockroachDB - Introduction - Basic Concepts

  • Keyspace - basic model of cluster data - ordered set of key value pairs with full path to each record in the key part of each entry, including where it lives and primary key of the row
  • Keyspace is divided into ranges - when a range grows beyond certain limit, 64MB default, it gets split further and so on 
  • Replica - copies of each range which are distributed among nodes in the cluster, to keep the cluster balanced
  • Replication factor - 3 by default in Cockroach db, number of replicas of each range in the cluster
  • Replication factor in general will be an "odd number" to facilitate availability and polling and it is the number of times a range is replicated in a cluster.
  • Replication factor can be different to each table/database



  • Raft is an algorithm that allows a distributed set of servers to agree (arrive on consensus) on any values without losing the record of that value, even in the face of node failure. CockroachDB uses it to perform all writes. 

  • Recall that CockroachDB organizes its data into a keyspace which is divided into ranges and distributes replicas of each range throughout the cluster based on the replication factor.
  • For CockroachDB each range defines a Raft group. 
  • The cluster has seven ranges so there there will be seven Raft groups. (Number of ranges = Number of raft groups)

  • CockroachDB has a concept of something called a lease which it assigns to one of these replicas called the leaseholder.

  • Leaseholder job will be to serve reads on its own, bypassing Raft, but also keeping track of write commits, so it knows not to show writes until they're durable. 

  • Replicas are of two types
    1. Leaders - "coordinates" the distributed write process
    2. Followers - "assists" in distributed write process. Replicates the command from Raft log on

CockroachDB's "Distributed Write process"

  1. Leaseholder(appends writes to Raft log and tells Leader to begin write process) --> Leader
  2. Leader (Leader first appends the command to it's Raft log and then proposes Write to followers) --> Follower
  3. Leader (Leader first appends the command to it's Raft log and then proposes Write to followers) --> Follower
  4. Follower (Each Follower will then replicate the command on it's own Raft log)
If a follower doesn't see a heartbeat from a leader, it'll get a randomized timeout, declare itself a candidate, and call for an election. Majority vote makes it a leader. This election process takes seconds.

  • Raft log - kind of txn log, an ordered set of commands on disk appended as and when writes occur. Every Replica will have it's own Raft log

  • Writes are kicked off by the leaseholder which tells the leader to begin the process, whenever there is an "INSERT/Write" 
  • The leader first appends the command to its Raft log. 
  • The leader then proposes the write to the followers. 
  • Each follower will replicate the command on its own Raft log.


Some important terms/keywords:
##########################################
Raft
Leader
Follower
Lease
Leaseholder - ensures that readers only see committed writes and 
              that replicas arranged together form a Raft group that elects one leader. 
Node: A single system in the Cluster
Cluster: Set of Nodes
Range:
Gateway: Which ever node the client connects in a Cluster and it routes the queries to appropriate nodes
Load Balancer: Avoids a problem if finding a new gateway
Fragile state
Up replicate
under replicated ranges:
##########################################
Node is declared dead after 5 minutes by default.
Cluster recovery to resilient state through up-replication
Increase replication factor by 5+ empowers cluster resiliency

Raft concepts were explained through a wonderful animation in this link - The Secret Lives of Data 






####################
Node: 
smallest level of compute unit of s/w required.
node can be given "locality" which helps in data residency and helps in how best to store/query data distributed across geographies

Cluster:
group of interconnected nodes working together to balance/rebalance data by providing read write capacity to customer
A node can be given locality

Range: aka shard in other database is a chunk of data stored as key value (kv) pairs
Continuous block/chunk of table data and default size is 512MiB. 
Table is split into ranges and distributed across nodes in cluster by way of range partitioning and NOT hash partitioning. 
No explicit sharding is needed

Replica: just a copy of range
Each range is replicated at least 3 times across different nodes.
Number of copies of ranges is determined by "replication factor", can be configured at db, table, index, or partition level
"replication factor" is 3 for single region cluster
"replication factor" is 5 for multi  region cluster

Each range/replica can be given a role called "lease holder" which controls reads and writes.

Gateway Node: Can be any node after the load balancer

Authoritative: Data from lease holder 
Non Authoritative: Data from non lease holder

Client connection: READ
1. Client Connection is established to "Gateway Node"
2. Gateway Node sends read request to "lease holder"
3. "lease holder" retrieves data and responds to "Gateway Node"
4. "Gateway Node" responds to "Client"

Client connection: WRITE
1. Client Connection is established to "Gateway Node"
2. "Gateway Node" sends write request to "lease holder"
3. "lease holder" sends write to "Replicas"/"followers" 
4. 1st Replica sends ACK to "lease holder"
5. "lease holder" informs "Gateway Node" that write is complete. Synchronous till here
6. "Gateway Node" responds to "Client"
7. Remaining "replicas" are updated with this write. Asynchronous from here for other "replicas" write

If a node is down/lost, it will trigger rebalancing of data.
CRDB usually waits for 5 mins by default (configurable) and CRDB will declare the node as dead. 
Once node is dead, the under replicated ranges are replicated to a new node to satisfy "num_replicas constraint/replication factor"

In CRDB, not only data, SQL execution is also distributed. Usually execution takes place as close to data as possible.
CRDB has cost based optimizer which is aware of distributed SQL and geography
Pipelined parallel execution during writes which makes it faster

CRDB uses HLC (Hybrid Logical Clock) which has a physical clock(number of seconds elapsed) and logical component(count event like number of msgs)
######################


No comments:

Post a Comment