Q4: How is this guaranteed, can there be intermittent failures when a host which owns shard 1 goes down? How is that handled? How long can be this delay? During this delay the new workflows on this shard can’t be created I assume.
We use a gossip-based membership library ringpop. It is responsible for detecting host failures and updating the cluster membership in a timely manner. The delay usually within a second. The front end host retries a request to another host if shard membership changed during the request and usually the retry goes through. The SDK client code also retries requests if needed. We’ve seen use cases that had pretty tight SLA on workflow starts and we were able to ensure high availability during cluster deployments with a 3-second external client timeout.
Q5: If a history host goes down, then who get’s the shards on that host? Also how are the shards balanced if we scale up/scale down hosts?
Whatever host owns it based on the consistent hashing algorithm. So every time the cluster membership changes the shards are rebalanced. The consistent hashing is used to ensure that majority of the shards stay at their current hosts during resharding.
Q6: On shard ownership, how is guaranteed that the range id will be unique for each host? If there’s conflict there, it could lead to multiple writers for a workflow.
The shard hashing and the cluster membership based on gossip can lead to situations when more than one host believes that it owns the shard. To avoid inconsistencies all updates to the history service database are protected by a conditional update on the so-called range-id value. So Temporal uses DB to ensure 100% consistency in these situations. This video contains some details about this mechanism.
Q7: What happens if a shard writes the workflow data onto the persistence layer and moves the task to the transfer Queue, but before it could move the task to the Marcher’s Tasklist it dies? Are the Cadence workers meant to handle these scenarios?
The new host is going to pick up the shard and then process all unacknowledged tasks from the transfer queue. So the task is going to be redelivered to the matching engine. Note that it doesn’t mean that workers will receive the duplicate as tasks are deduped before being sent to a worker.
Q8: In the video, Samar says, if a task is lost, it’ll cause a timeout and the workflows would need to handle those scenarios. What timeout field is being talked about here? Can we control it? What happens is that value is very high say 10 years? Does the workflow stay stuck for 10 years?
It depends on the task type. The relevant activity timeout is StartToClose
. All the timeouts are configurable by the workflow code. If an activity StartToClose timeout is set to 10 years (and workflow run timeout is set to a same or larger value as well) the workflow will be waiting for the timeout for 10 years.
Q9: Why is Cadence called a write heavy system? Should it not be read heavy due to fetching workflow state, history, on async polls?
Because of caching. Almost everything is cached on worker and history service nodes. So fetching of workflow states and full histories is only needed when a state is lost either due to process restarts or data being pushed out of caches in LRU manner.