I have a production Temporal cluster. Each service runs on an EC2 instance. We started with 2 each of frontend, matching, worker and history nodes. The persistence is Cassandra. Eventually we hit a bottle-neck at history persistence so the number of history nodes was doubled to 4 and we doubled the size of the Cassandra cluster as well.
That helped a lot. However, there are still occasional spurts of metrics like shard_item_created, shard_closed. Every time it happens there are service error metrics “ResourceExhausted” with cause “BusyWorkflow”. What is causing this? Are the “shard created” and “shard closed” metrics to be expected, or only as part of history node re-balancing?
Can you share your graphs for
sum(rate(sharditem_created_count{service_name=”history”}[1m]))
sum(rate(sharditem_removed_count{service_name=”history”}[1m]))
sum(rate(sharditem_closed_count{service_name=”history”}[1m]))
along with
sum(restarts{})
expected shard movement can happen during cluster scaling and restarts so would check if you see unexpected shard movement
ResourceExhausted” with cause “BusyWorkflow”
BusyWorkflow resource exhausted cause basically means that workflow lock could not be obtained within 500ms. You can check workflow lock latency via:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))
Typical reasons to see this is scheduling larger numbers of async activities/child workflows in a single workflow execution.
Note before server version 1.26.0 it could also be caused by using workflow reset on the base execution of a workflow chain (continueasnew use case), see issue here.
Thanks so much for the input tihomir. I’ll check on these when I can. The issue hasn’t happened again yet, although it had happened occasionally before.