High latency on Starting Workflow/Activity execution

Dhanraj · March 1, 2023, 4:11pm

Hi Team,

We have developed basic POC app that orchestrates between activities(11 in total). These activities are one liners and only logs a simple message to console.

We deployed this app to azure-westeurope region in k8s cluster close to our self hosted temporal cluster(westeurope). Upon executing the requests, we could see the entire workflow taking close to 3sec to complete execution. This is very high latency considering simple use case as this. The same application in non-temporal env takes less than 100ms to execute. We tried Local Activity and workflow could complete execution in 400ms. This will be our last resort as we plan to have dedicated workers for activities & leverage rate-limiting.

We enabled sdk tracing and this is how it looks:

As we can see from traces, most of the time is spent on Starting workflow & activity execution.

Our CPU & Mem where all normal, that rules out worker is overutilized

Below are workflow/activity schedule_to_start latencies & grpc request_latency

We could see schedule_to_start is comparatively low than request_latency. The request_latency is on par with overall latencies that we observed on traces. How schedule_to_start latency is measured? Does request_latency also included schedule_to_start interval?

We also tried to increase setMaxConcurrentActivityTaskPollers from default 5, but that didn’t yield any difference in response latencies.

Temporal server version - 1.18.4
Java SDK version - 1.18

Could you please help troubleshoot this issue?

Thanks,
Dhanraj

tihomir · March 1, 2023, 7:42pm

How schedule_to_start latency is measured?

This is measured from the time a workflow/activity task is placed on a matching host task queue partition until your worker picks it up for processing.

Does request_latency also included schedule_to_start interval?

request_latency
long_request_latency

measure only the gRPC level latencies so how long client/worker requests themselves take. Schedule to start latencies deal with how long it takes for tasks to be dispatched to your workers.

tihomir · March 1, 2023, 7:48pm

Regarding your rpc latencies for StartWorkflowExecution, can you share your persistence latencies (Server metrics):

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

And frontend service latencies (server metrics):

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type="frontend"}[5m])) by (operation, le))

We also tried to increase setMaxConcurrentActivityTaskPollers from default 5, but that didn’t yield any difference in response latencies.

Share sync match rate please:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

Dhanraj · March 2, 2023, 12:46pm

Hi @tihomir ,

Please find query results for the metrics that you asked

Persistence latencies

Frontend service latencies

Sync match rate

Please let me know if you need more info.

@Andrey_Dubnik - fyi

Thanks,
Dhanraj

Andrey_Dubnik · March 4, 2023, 9:29pm

We would be really grateful if someone can help us to understand what does StartActivity and RunActivity in the distributed trace corresponds to. If StartActivity is the schedule to start and Run Activity is the execution duration then the there is an unexplained gap between the end of the StartActivity and RunActivity measurements.

tihomir · March 5, 2023, 5:50am

@Dhanraj
Poll latency being high is something expected, if its low it means your workers rarely need to wait for tasks (meaning you probably don’t have enough workers)
PollWorkflowExecutionHistory is a long poll request too so it being high is imo ok too.

Regarding high latencies for StartWorkflowExecution , can you check if your frontend and/or history hosts have high cpu use?

Andrey_Dubnik · March 6, 2023, 5:19pm

There is no load on frontend or history.

Dhanraj · March 15, 2023, 1:17pm

@tihomir - As Andrey mentioned, there was no load on our Frontend & History hosts. Could you please help us troubleshoot this issue?
Please let me know if you need any details/metrics.

Topic		Replies	Views
Why is there a high latency to start a workflow Community Support go-sdk , general-impl	6	2397	January 22, 2023
Troubles shoot of workflow execution latency Community Support performance	1	924	August 11, 2022
Temporal Latency Community Support	3	2800	December 15, 2020
Very high Workflow Task Schedule To Start Latency Community Support	0	335	July 13, 2024
High Activity Latency Community Support	2	527	March 21, 2021

High latency on Starting Workflow/Activity execution

Related topics