Workflow task timed out on GKE

Ruchir · June 7, 2022, 3:14pm

Hi,

Following is my temporal setup on GKE single node cluster.

Temporal cluster pointing to REMOTE Cassandra DB(TLS)
NoHistoryShards changed to 32
Match read/write partitions - 8
maxConns(Temporal): 200
Activity pollers: 50
Workflow pollers: 50
Frontend.NamespaceCount: 3000
MAX_CONC_WRK_TASK_EXEC_SIZE: 200
MAX_CONC_ACT_EXEC_SIZE: 200

My workflow consist of 3 activities which makes http calls to microservices present in same cluster(assume latency of 40ms)

Following were my readings for while running a load of 50 TPS for 1 min:
avg=235.59ms min=215.6ms med=218.31ms max=520.25ms p(90)=317.25ms p(95)=319.59ms

Following were my readings when i increase the TPS to 60:
avg=3.09s min=215.34ms med=518.56ms max=11.06s p(90)=10.23s p(95)=10.25s

Following are the metric graphs captured for 60 TPS:

As you can see temporal_workflow_task_schedule_to_start_latency_seconds_bucket, temporal_workflow_task_execution_latency_seconds_sum, temporal_workflow_endtoend_latency_seconds_bucket, temporal_activity_schedule_to_start_latency_seconds_sum, temporal_activity_execution_latency_seconds_sum are very low. But still as you see in web ui, all the workflow tasks are getting timed out. Can you please help?

tihomir · June 8, 2022, 4:18am

Could you provide full history for your execution:

tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json

Did you have the chance look through the worker tuning guide in docs?

Could you provide info on your sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)

another thing to look at is sdk task_schedule_to_start_latency metric, can you measure this latency as well? a high latency would indicate to add more workers.

NoHistoryShards changed to 32

i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.

Another thing to look at are persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard

Following is my temporal setup on GKE single node cluster.

How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup

Ruchir · June 8, 2022, 7:33am

Following is the required json output. Note that I have removed certain fields like input and results which has sensitive data:

    1  WorkflowExecutionStarted    {WorkflowType:{Name:IRequestOtpWorkflow}, ParentInitiatedEventId:0, TaskQueue:{Name:RequestOtpWorkflowTaskQueue, Kind:Normal},       
                                  Input:[], WorkflowExecutionTimeout:0s, WorkflowRunTimeout:0s, WorkflowTaskTimeout:10s,                                             
                                  Initiator:Unspecified, OriginalExecutionRunId:66fa450b-ced0-4475-bc36-78728d7d8fb1,                                                  
                                                                                             
                                  FirstExecutionRunId:66fa450b-ced0-4475-bc36-78728d7d8fb1, Attempt:1, FirstWorkflowTaskBackoff:0s,                                    
                                  }                                                                                                
   2  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
                                  Attempt:1}                                                                                                                           
   3  WorkflowTaskStarted         {ScheduledEventId:2,                                                                                                                 
                                                                                                     
                                  RequestId:fed0d40d-19d2-43e8-8d1c-33e58b82c568}                                                                                      
   4  WorkflowTaskCompleted       {ScheduledEventId:2, StartedEventId:3,                                                                                               
                                                                                                    
   5  MarkerRecorded              {MarkerName:LocalActivity, Details:map{activityId:["08d4dbf4-366f-32e3-a5b6-80911d97ff20"],                                          
                                  input:[{}], time:[1654609796207],                      
                                  type:["Validate"]}, WorkflowTaskCompletedEventId:4}                                                                                  
   6  ActivityTaskScheduled       {ActivityId:8a240c73-55c2-30ba-b0fe-22ee16c5090c, ActivityType:{Name:GenerateOTP},                                                   
                                  TaskQueue:{Name:OtpProcessorActivityTaskQueue, Kind:Normal},                                                    
                                  ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s,                                                        
                                  HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:4, RetryPolicy:{InitialInterval:1s,                                                
                                  BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1,                                                                      
                                  NonRetryableErrorTypes:[com.fyndna.common.type.CustomCheckedException]}}                                                             
   7  ActivityTaskStarted         {ScheduledEventId:6,                                                                                                                 
                                                                                                    
                                  RequestId:8d92c17a-46cf-4d15-91be-d84e9c679b9e, Attempt:1}                                                                           
   8  ActivityTaskCompleted       {Result:[{"errorDetail":"Mock                                                                                                        
                                  Adapter","refNo":"02032016344072","statusCode":"00"}],                                                                               
                                  ScheduledEventId:6, StartedEventId:7,                                                                                                
                                                                                                    
   9  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  10  WorkflowTaskTimedOut        {ScheduledEventId:9,                                                                                                                 
                                  StartedEventId:0,                                                                                                                    
                                  TimeoutType:ScheduleToStart}                                                                                                         
  11  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
                                  Attempt:1}                                                                                                                           
  12  WorkflowTaskStarted         {ScheduledEventId:11,                                                                                                                                               
                                  RequestId:2fd8d1cb-ab48-4a7b-a38b-e748b2e544b2}                                                                                      
  13  WorkflowTaskCompleted       {ScheduledEventId:11, StartedEventId:12,                                                                                                                                                  
  14  ActivityTaskScheduled       {ActivityId:a6cf7ea8-9f4a-3430-bed3-ac612ba5a95b, ActivityType:{Name:DeliverOtp},                                                    
                                  TaskQueue:{Name:OtpCommunicationActivityTaskQueue, Kind:Normal}, ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s,                                              
                                  HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:13, RetryPolicy:{InitialInterval:1s,                                               
                                  BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1,                                                                      
                                  NonRetryableErrorTypes:[com.fyndna.common.type.CustomCheckedException]}}                                                             
  15  ActivityTaskStarted         {ScheduledEventId:14,                                                                                                                
                                                                                                  
                                  RequestId:b0b92566-d116-429f-ba5f-792df6251738, Attempt:1}                                                                           
  16  ActivityTaskCompleted       {Result:[{"responseCode":"00","message":"Success"}],                                                                                 
                                  ScheduledEventId:14, StartedEventId:15,                                                                                              
                                                                                                  
  17  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  18  WorkflowTaskTimedOut        {ScheduledEventId:17,                                                                                                                
                                  StartedEventId:0,                                                                                                                    
                                  TimeoutType:ScheduleToStart}                                                                                                         
  19  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
                                  Attempt:1}                                                                                                                           
  20  WorkflowTaskStarted         {ScheduledEventId:19,                                                                                                                
                                                                                         
                                  RequestId:47cd83bf-270e-4f2f-8e03-f78ebc7cc6af}                                                                                      
  21  WorkflowTaskCompleted       {ScheduledEventId:19, StartedEventId:20,                                                                                             
                                                                                                 
  22  ActivityTaskScheduled       {ActivityId:fc7968a0-1fd7-3271-ad20-b5f9cadf8eb5, ActivityType:{Name:SendOtpResponse}, TaskQueue:{Name:RequestOtpActivityTaskQueue,  
                                  Kind:Normal},,                                                                                                
                                  Input:[],                 
                                  ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s, HeartbeatTimeout:0s, WorkflowTaskCompletedEventId:21,  
                                  RetryPolicy:{InitialInterval:1s, BackoffCoefficient:2, MaximumInterval:1m40s, MaximumAttempts:1, NonRetryableErrorTypes:[]}}         
  23  ActivityTaskStarted         {ScheduledEventId:22,                                                                                                                
                                                                                                  
                                  RequestId:d0e106e2-0b3e-4d8f-bd74-11f57af91d70, Attempt:1}                                                                           
  24  ActivityTaskCompleted       {Result:[], ScheduledEventId:22, StartedEventId:23,                                                                               
  25  WorkflowTaskScheduled       {TaskQueue:{Name:1@issuer-switch-upi-registration-switch-66db64df-jw5c6:28f2f188-f52f-4129-b21b-06b3d2076318,                        
                                  Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}                                                                                    
  26  WorkflowTaskTimedOut        {ScheduledEventId:25,                                                                                                                
                                  StartedEventId:0,                                                                                                                    
                                  TimeoutType:ScheduleToStart}                                                                                                         
  27  WorkflowTaskScheduled       {TaskQueue:{Name:RequestOtpWorkflowTaskQueue,                                                                                        
                                  Kind:Normal}, StartToCloseTimeout:10s,                                                                                               
                                  Attempt:1}                                                                                                                           
  28  WorkflowTaskStarted         {ScheduledEventId:27,                                                                                                                                                
                                  RequestId:1a2fdd82-1012-4edd-916a-50a82fa54e6e}                                                                                      
  29  WorkflowTaskCompleted       {ScheduledEventId:27, StartedEventId:28,                                                                                                                                                   
  30  WorkflowExecutionCompleted  {Result:[{}],                 
                                  WorkflowTaskCompletedEventId:29}

Yes Tihomir, we have gone through this link and we had tried for different configs, but whenever we go above 50 tps, we start to see very high latency.

Uploading: Screenshot 2022-06-08 at 12.33.55 PM.png…

When you say “add more workers/ increase worker capacity”, does that mean increasing concurrent pollers/ increasing execution size/ increasing pods of application where we create workers or something else?

We had tried this with 512 shards, and had observed low latency with default matching partitions, but as we try the same run with 8 matching service partitions, the latency again increases. We want to understand why this is happening.

We have gone through this link as well. Currently there are only 1 pod per temporal service. We wanted to see with this deployment how much TPS can be achieved.

poojabhutada · June 8, 2022, 4:35pm

@tihomir Also, some followup questions regarding above setup.

We were able to achieve decent response time of workflow execution till 60 tps…but it goes very worse above that despite of changing no of shards, increasing temporal services pods, increasing no. of partitions. No luck yet.
Another thing, looks like when it goes above 60 tps…workflow active thread count drastically increases, sync match rate drops and the service latency that invokes async workflow execution increases. Is there any parameter to check the latency of a workflow invocation?

Please do help and guide us.

Wenquan_Xing · June 8, 2022, 5:20pm

NoHistoryShards changed to 32

Match read/write partitions - 8

maxConns(Temporal): 200

try not to use single k8s pod, but one pod for frontend / matching / history
try 256 as history shards, 1 read / write partition (matching)

what is maxConns?

how many workflows are running concurrently?

poojabhutada · June 8, 2022, 5:44pm

@Wenquan_Xing We do have one pod each per temporal service i.e frontend, history, matching etc
MaxConns is defined as 200…it is the value provided as part of yaml file for default and visibility databases…any recommendation you have for this value?
We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.

Wenquan_Xing · June 8, 2022, 5:50pm

are you using advance visibility?
MaxConns is defined as 200, try 8?

We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.

TPS: what is your definition? 70 workflow completion per second?
a workflow with 3 sequential activities roughly translate to 15 DB transactions
you may want to check if your DB is overloaded already

Topic		Replies	Views
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	3020	July 19, 2022
Temporal performance with golang microservice, Cassandra & Elasticsearch Community Support go-sdk , elasticsearch , cassandra , docker , performance	14	3484	February 1, 2023
Performance of SignalWithStartWorkflow() Community Support	12	956	December 3, 2020
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	9187	November 13, 2021
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2501	July 13, 2023

Workflow task timed out on GKE

Related topics