Hi Team,
We have developed basic POC app that orchestrates between activities(11 in total). These activities are one liners and only logs a simple message to console.
We deployed this app to azure-westeurope region in k8s cluster close to our self hosted temporal cluster(westeurope). Upon executing the requests, we could see the entire workflow taking close to 3sec to complete execution. This is very high latency considering simple use case as this. The same application in non-temporal env takes less than 100ms to execute. We tried Local Activity and workflow could complete execution in 400ms. This will be our last resort as we plan to have dedicated workers for activities & leverage rate-limiting.
We enabled sdk tracing and this is how it looks:
As we can see from traces, most of the time is spent on Starting workflow & activity execution.
Our CPU & Mem where all normal, that rules out worker is overutilized
Below are workflow/activity schedule_to_start latencies & grpc request_latency
We could see schedule_to_start is comparatively low than request_latency. The request_latency is on par with overall latencies that we observed on traces. How schedule_to_start latency is measured? Does request_latency also included schedule_to_start interval?
We also tried to increase setMaxConcurrentActivityTaskPollers from default 5, but that didn’t yield any difference in response latencies.
Temporal server version - 1.18.4
Java SDK version - 1.18
Could you please help troubleshoot this issue?
Thanks,
Dhanraj
How schedule_to_start latency is measured?
This is measured from the time a workflow/activity task is placed on a matching host task queue partition until your worker picks it up for processing.
Does request_latency also included schedule_to_start interval?
request_latency
long_request_latency
measure only the gRPC level latencies so how long client/worker requests themselves take. Schedule to start latencies deal with how long it takes for tasks to be dispatched to your workers.
Regarding your rpc latencies for StartWorkflowExecution, can you share your persistence latencies (Server metrics):
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
And frontend service latencies (server metrics):
histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type="frontend"}[5m])) by (operation, le))
We also tried to increase setMaxConcurrentActivityTaskPollers from default 5, but that didn’t yield any difference in response latencies.
Share sync match rate please:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
Hi @tihomir ,
Please find query results for the metrics that you asked
Persistence latencies
Frontend service latencies
Sync match rate
Please let me know if you need more info.
@Andrey_Dubnik - fyi
Thanks,
Dhanraj
We would be really grateful if someone can help us to understand what does StartActivity and RunActivity in the distributed trace corresponds to. If StartActivity is the schedule to start and Run Activity is the execution duration then the there is an unexplained gap between the end of the StartActivity and RunActivity measurements.
@Dhanraj
Poll latency being high is something expected, if its low it means your workers rarely need to wait for tasks (meaning you probably don’t have enough workers)
PollWorkflowExecutionHistory is a long poll request too so it being high is imo ok too.
Regarding high latencies for StartWorkflowExecution , can you check if your frontend and/or history hosts have high cpu use?
There is no load on frontend or history.
@tihomir - As Andrey mentioned, there was no load on our Frontend & History hosts. Could you please help us troubleshoot this issue?
Please let me know if you need any details/metrics.