Workflow task timed out on GKE

Could you provide full history for your execution:

tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json

Did you have the chance look through the worker tuning guide in docs?

Could you provide info on your sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)

another thing to look at is sdk task_schedule_to_start_latency metric, can you measure this latency as well? a high latency would indicate to add more workers.

  1. NoHistoryShards changed to 32

i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.

Another thing to look at are persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard

Following is my temporal setup on GKE single node cluster.

How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup