How to investigate or solve occasional shard operations causing ResourceExhausted errors

TonyVonCloud · December 26, 2024, 10:06pm

I have a production Temporal cluster. Each service runs on an EC2 instance. We started with 2 each of frontend, matching, worker and history nodes. The persistence is Cassandra. Eventually we hit a bottle-neck at history persistence so the number of history nodes was doubled to 4 and we doubled the size of the Cassandra cluster as well.

That helped a lot. However, there are still occasional spurts of metrics like shard_item_created, shard_closed. Every time it happens there are service error metrics “ResourceExhausted” with cause “BusyWorkflow”. What is causing this? Are the “shard created” and “shard closed” metrics to be expected, or only as part of history node re-balancing?

tihomir · January 14, 2025, 2:03am

Can you share your graphs for

sum(rate(sharditem_created_count{service_name=”history”}[1m]))
sum(rate(sharditem_removed_count{service_name=”history”}[1m]))
sum(rate(sharditem_closed_count{service_name=”history”}[1m]))

along with
sum(restarts{})

expected shard movement can happen during cluster scaling and restarts so would check if you see unexpected shard movement

ResourceExhausted” with cause “BusyWorkflow”

BusyWorkflow resource exhausted cause basically means that workflow lock could not be obtained within 500ms. You can check workflow lock latency via:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))
Typical reasons to see this is scheduling larger numbers of async activities/child workflows in a single workflow execution.
Note before server version 1.26.0 it could also be caused by using workflow reset on the base execution of a workflow chain (continueasnew use case), see issue here.

TonyVonCloud · January 14, 2025, 3:50pm

Thanks so much for the input tihomir. I’ll check on these when I can. The issue hasn’t happened again yet, although it had happened occasionally before.

Topic		Replies	Views
How to read Grafana Performance metrics Community Support	6	64	January 28, 2025
Frontend error: "shard status unknown" Server Deployment go-sdk , history	1	1532	January 17, 2023
Temporal Shards ( configured in 'shards' table ) Community Support java-sdk	6	240	February 10, 2025
Resource exhausted BusyWorkflow Community Support	4	1878	June 10, 2024
Persistent store operation Failure Community Support java-sdk	3	1976	January 14, 2022

How to investigate or solve occasional shard operations causing ResourceExhausted errors

Related topics