Could you provide full history for your execution:
tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json
Did you have the chance look through the worker tuning guide in docs?
Could you provide info on your sync match rate:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)
another thing to look at is sdk task_schedule_to_start_latency
metric, can you measure this latency as well? a high latency would indicate to add more workers.
- NoHistoryShards changed to 32
i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.
Another thing to look at are persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution
, UpdateWorkflowExecution
, UpdateShard
Following is my temporal setup on GKE single node cluster.
How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup