Workflow task timed out on GKE


Following is my temporal setup on GKE single node cluster.

  1. Temporal cluster pointing to REMOTE Cassandra DB(TLS)
  2. NoHistoryShards changed to 32
  3. Match read/write partitions - 8
  4. maxConns(Temporal): 200
  5. Activity pollers: 50
  6. Workflow pollers: 50
  7. Frontend.NamespaceCount: 3000

My workflow consist of 3 activities which makes http calls to microservices present in same cluster(assume latency of 40ms)

Following were my readings for while running a load of 50 TPS for 1 min:
avg=235.59ms min=215.6ms med=218.31ms max=520.25ms p(90)=317.25ms p(95)=319.59ms

Following were my readings when i increase the TPS to 60:
avg=3.09s min=215.34ms med=518.56ms max=11.06s p(90)=10.23s p(95)=10.25s

Following are the metric graphs captured for 60 TPS:

As you can see temporal_workflow_task_schedule_to_start_latency_seconds_bucket, temporal_workflow_task_execution_latency_seconds_sum, temporal_workflow_endtoend_latency_seconds_bucket, temporal_activity_schedule_to_start_latency_seconds_sum, temporal_activity_execution_latency_seconds_sum are very low. But still as you see in web ui, all the workflow tasks are getting timed out. Can you please help?

Could you provide full history for your execution:

tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json

Did you have the chance look through the worker tuning guide in docs?

Could you provide info on your sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)

another thing to look at is sdk task_schedule_to_start_latency metric, can you measure this latency as well? a high latency would indicate to add more workers.

  1. NoHistoryShards changed to 32

i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.

Another thing to look at are persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard

Following is my temporal setup on GKE single node cluster.

How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup

Following is the required json output. Note that I have removed certain fields like input and results which has sensitive data:

    1  WorkflowExecutionStarted    {WorkflowType:{Name:IRequestOtpWorkflow}, ParentInitiatedEventId:0, TaskQueue:{Name:RequestOtpWorkflowTaskQueue, Kind:Normal},       
                                  Input:[], WorkflowExecutionTimeout:0s, WorkflowRunTimeout:0s, WorkflowTaskTimeout:10s,                                             
                                  Initiator:Unspecified, OriginalExecutionRunId:66fa450b-ced0-4475-bc36-78728d7d8fb1,                                                  
                                  FirstExecutionRunId:66fa450b-ced0-4475-bc36-78728d7d8fb1, Attempt:1, FirstWorkflowTaskBackoff:0s,                                    
   2  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
   3  WorkflowTaskStarted         {ScheduledEventId:2,                                                                                                                 
   4  WorkflowTaskCompleted       {ScheduledEventId:2, StartedEventId:3,                                                                                               
   5  MarkerRecorded              {MarkerName:LocalActivity, Details:map{activityId:["08d4dbf4-366f-32e3-a5b6-80911d97ff20"],                                          
                                  input:[{}], time:[1654609796207],                      
                                  type:["Validate"]}, WorkflowTaskCompletedEventId:4}                                                                                  
   6  ActivityTaskScheduled       {ActivityId:8a240c73-55c2-30ba-b0fe-22ee16c5090c, ActivityType:{Name:GenerateOTP},                                                   
                                  TaskQueue:{Name:OtpProcessorActivityTaskQueue, Kind:Normal},                                                    
                                  ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s,                                                        
                                  HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:4, RetryPolicy:{InitialInterval:1s,                                                
                                  BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1,                                                                      
   7  ActivityTaskStarted         {ScheduledEventId:6,                                                                                                                 
                                  RequestId:8d92c17a-46cf-4d15-91be-d84e9c679b9e, Attempt:1}                                                                           
   8  ActivityTaskCompleted       {Result:[{"errorDetail":"Mock                                                                                                        
                                  ScheduledEventId:6, StartedEventId:7,                                                                                                
   9  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  10  WorkflowTaskTimedOut        {ScheduledEventId:9,                                                                                                                 
  11  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
  12  WorkflowTaskStarted         {ScheduledEventId:11,                                                                                                                                               
  13  WorkflowTaskCompleted       {ScheduledEventId:11, StartedEventId:12,                                                                                                                                                  
  14  ActivityTaskScheduled       {ActivityId:a6cf7ea8-9f4a-3430-bed3-ac612ba5a95b, ActivityType:{Name:DeliverOtp},                                                    
                                  TaskQueue:{Name:OtpCommunicationActivityTaskQueue, Kind:Normal}, ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s,                                              
                                  HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:13, RetryPolicy:{InitialInterval:1s,                                               
                                  BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1,                                                                      
  15  ActivityTaskStarted         {ScheduledEventId:14,                                                                                                                
                                  RequestId:b0b92566-d116-429f-ba5f-792df6251738, Attempt:1}                                                                           
  16  ActivityTaskCompleted       {Result:[{"responseCode":"00","message":"Success"}],                                                                                 
                                  ScheduledEventId:14, StartedEventId:15,                                                                                              
  17  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  18  WorkflowTaskTimedOut        {ScheduledEventId:17,                                                                                                                
  19  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
  20  WorkflowTaskStarted         {ScheduledEventId:19,                                                                                                                
  21  WorkflowTaskCompleted       {ScheduledEventId:19, StartedEventId:20,                                                                                             
  22  ActivityTaskScheduled       {ActivityId:fc7968a0-1fd7-3271-ad20-b5f9cadf8eb5, ActivityType:{Name:SendOtpResponse}, TaskQueue:{Name:RequestOtpActivityTaskQueue,  
                                  ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s, HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:21,  
                                  RetryPolicy:{InitialInterval:1s, BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1, NonRetryableErrorTypes:[]}}         
  23  ActivityTaskStarted         {ScheduledEventId:22,                                                                                                                
                                  RequestId:d0e106e2-0b3e-4d8f-bd74-11f57af91d70, Attempt:1}                                                                           
  24  ActivityTaskCompleted       {Result:[], ScheduledEventId:22, StartedEventId:23,                                                                               
  25  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  26  WorkflowTaskTimedOut        {ScheduledEventId:25,                                                                                                                
  27  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
  28  WorkflowTaskStarted         {ScheduledEventId:27,                                                                                                                                                
  29  WorkflowTaskCompleted       {ScheduledEventId:27, StartedEventId:28,                                                                                                                                                   
  30  WorkflowExecutionCompleted  {Result:[{}],                 

Yes Tihomir, we have gone through this link and we had tried for different configs, but whenever we go above 50 tps, we start to see very high latency.

Uploading: Screenshot 2022-06-08 at 12.33.55 PM.png…

When you say “add more workers/ increase worker capacity”, does that mean increasing concurrent pollers/ increasing execution size/ increasing pods of application where we create workers or something else?

We had tried this with 512 shards, and had observed low latency with default matching partitions, but as we try the same run with 8 matching service partitions, the latency again increases. We want to understand why this is happening.

We have gone through this link as well. Currently there are only 1 pod per temporal service. We wanted to see with this deployment how much TPS can be achieved.

@tihomir Also, some followup questions regarding above setup.

  1. We were able to achieve decent response time of workflow execution till 60 tps…but it goes very worse above that despite of changing no of shards, increasing temporal services pods, increasing no. of partitions. No luck yet.
  2. Another thing, looks like when it goes above 60 tps…workflow active thread count drastically increases, sync match rate drops and the service latency that invokes async workflow execution increases. Is there any parameter to check the latency of a workflow invocation?

Please do help and guide us.

  1. NoHistoryShards changed to 32
  2. Match read/write partitions - 8
  3. maxConns(Temporal): 200

try not to use single k8s pod, but one pod for frontend / matching / history
try 256 as history shards, 1 read / write partition (matching)

what is maxConns?

how many workflows are running concurrently?

@Wenquan_Xing We do have one pod each per temporal service i.e frontend, history, matching etc
MaxConns is defined as 200…it is the value provided as part of yaml file for default and visibility databases…any recommendation you have for this value?
We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.

  1. are you using advance visibility?
  2. MaxConns is defined as 200, try 8?

We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.

TPS: what is your definition? 70 workflow completion per second?
a workflow with 3 sequential activities roughly translate to 15 DB transactions
you may want to check if your DB is overloaded already