Workflow task timed out on GKE

tihomir · June 8, 2022, 4:18am

Could you provide full history for your execution:

tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json

Did you have the chance look through the worker tuning guide in docs?

Could you provide info on your sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)

another thing to look at is sdk task_schedule_to_start_latency metric, can you measure this latency as well? a high latency would indicate to add more workers.

NoHistoryShards changed to 32

i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.

Another thing to look at are persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard

Following is my temporal setup on GKE single node cluster.

How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup

Topic		Replies	Views
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	2754	July 19, 2022
Workflow Performance with Java SDK Community Support java-sdk	1	627	February 20, 2023
Performance test on GKE Community Support java-sdk , deployment	2	1149	May 27, 2022
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	8050	November 13, 2021
Workflow Task Schedule To Start Latency High Community Support java-sdk , deployment	8	3257	July 8, 2022

Workflow task timed out on GKE

Related topics