Seeing high latencies between two subsequent activity task executions

From the provided info it looks as latencies are happening due to your workers being under-provisioned.

Also it looks from description that you are shooting for > 100K state transitions per second as far as performance goes, is this true?

Each workflow execution takes 250 ms, there are 8 worker pods with 50 pollers each for workflow and activity, concurrent execution size of 200(we even tried with 300), and we are hitting 250 Workflows per second with 3 activities each activity taking 40-45 ms as can be seen from the trace attached. Which means we are effectively giving ~30 workflows per pod, with the given latencies and pollers and threads(600), I don’t believe the workers are underprovisioned. And as per the trace clearly the time between activity to activity execution where temporal executes workflows task seems to increase. But the workflow task schedule to start and execution latencies are a few 10s of ms. So it’s unclear where are the latencies inside temporal.
Please let me know if I can provide any other metrics or information.

Some of the things we tried since we posted the query -

  1. Increased pollers on the application pods
  2. Increased application pods where worker are running
  3. Increased worker pods from 1 to 2
  4. Increase concurrent execution size of workflows and activities to 600
  5. Increase cache size and max workflow thread count to 1200
  6. Increased matching queue partitions to 16
  7. Tried to create 250 tps load with two different workflows having seperate queues 4 each(1 for workflow and 3 for activities)
  8. Create a new workflow factory instance for each new worker

None of the above helped us moved past the bottleneck of 250 TPS. @samar Can you please suggest what else are we missing to configure/check?

@tihomir to your query, so this workflow has about 30 steps in workflow execution history(1 local activity and 3 activities) on web ui and wenquan had mentioned that comes upto roughly 15 db queries on one of our old queries. So on this setup once we achieve ~250 ms at 250-300 TPS, we want to be able to add another node and hit 600 TPS as our immediate next step of scaling. And we want to be able to scale to 3K workflows/second by the end of the year and to 10K w/s by the next year.

Hi Samar @samar
We were finally able to get past the bottleneck of 250 when we increased out matching pods from 4 to 6. We could only ever go upto 200 with 4 matching pods, now able to move past and hit upto 300 with deterioration nonetheless but no bottleneck atleast when we scaled out to 6 pods.

Is there any a limiting threshold of 50 workflows per pod on matching? Like i mentioned above we tried everything in the book like increasing cpu, memory, pollers, concurrent execution size, cache size, max thread count, partitions and maxdb connections. Nothing seemed to help for the 250 workflows bottleneck except adding more pods of matching service.

Also wanted to check are the partitions of a single task queue shared across the matching pods?

It seems to me like worker under provision issue. You can verify this by looking at your server side metrics: persistence_requests with tag operation=CreateTask, this metrics indicate that task fail to deliver to a poller and was written to database and become backlog. It will later be dispatched to a poller when available.
You can also verify that by looking at async match: sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m])), ideally it should stay at 0.

Thanks for the response Yimin. @Yimin_Chen I do see peaks in those metrics during the bottleneck. Attaching screenshots of our 15-20 minute runs for your reference. We run 50,100,150,200,250,300 TPS runs for 2 mins each.
This bottleneck was seen at 200 TPS at 5.36 PM -

And this one seen at 7.10 PM when we hit the system at 300 TPS after we increased the matching service pods from 4 to 6 -

So we were able to move from 200 to 300 workflows/second without any change to the pods hosting our workflow and activity workers. We have 6 pods of workers having 20 pollers each, default 200 concurrent execution size, 600 max threads and cache size. So thats a total of 1200 worker slots per workflow/activity. Do you believe that is still under provisioned?
We also see temporal_worker_task_slots_available is almost always at 1190 except during bottleneck we see it at 1175…does this mean even though threads and slots are available, the workflows arent actually getting executed in parallel? or since our execution latencies are as low as 1-2 ms, we dont see enough slots filled up at any given point? Want to clarify that I mention 1-2 ms here but the activity execution latencies in the graphs pasted on first post show upto 50 ms latency…post that we removed all latencies(120ms) in our application pods to see if we are able to go past the 250 tps bottleneck if end-to-end latency is reduced.

You can increase your poller count per each worker pod (for example from 20 to 40), and see if that would move the needle.

We have tried moving it up to 64, did not change the outcome at 250 TPS. @Yimin_Chen As mentioned the time between activity executions where workflow task is executed steadily increasing at higher TPS, but workflow task schedule to start and execution latencies remain low. It’s the time after an activity is completed and before the workflow task is put in the matching queue that seems to grow. There is no metric or trace that captures that latency. How can we check the latencies before the task is scheduled? I.e the status of completed activity is updated and the next workflow task is added to the queue.

Hello @Yimin_Chen we took a closer look at what you suggested in provisioning more pollers and also added more worker pods to reduce the async match count. What we noticed happens is even though there are 10 pods running with 50 pollers and 200 maxconcurrentexecutionsize, we dont see 10*concurrent_execution_size showing up in temporal_worker_task_slots_available metric on grafana. We tried looking for it in prometheus and see random number of pods show up with workflow workers not equal to pod count but sometimes 8 or 1 or 2 or even 0. That may have been the root cause the whole time, even though we are starting 6 worker pods, it still doesnt have sufficient pollers/workers at just 200 workflows per second each taking only 100 ms on avg. Attaching a screenshot when we dont see any slots available at all.

@maxim Can you please suggest if this is bug or expected behavior or something we need to check in our code?
Why does this metric keep shifting up and down like seen in the graph…can you please explain…

We see activity workers are consistently created on the activity task queue for each pod, but dont see that consistent behavior with workflow workers on the workflow task.

As long as you see persistence_request {operation=“CreateTask”} it means there is no poller waiting when the task is dispatched.
Did you take a look at this doc: How to tune Workers | Temporal Documentation

Yes the issue we are facing is all pods don’t seem to create workflow workers. So even if we increase worker pods which ideally would have created more pollers and worker slots, they don’t seem to be getting initialised as can be verified from Prometheus. Activity workers are always created equal to no of pods but not workflow workers. Seeing this consistently whenever we restart pods, only a random number of pods are seen in worker slots metric. Wanted to check what could cause that?
When there are sufficient workflow workers created, even if it’s only 6 we are able to hit 150 tps without degradation in latency(100 ms). But sometimes even with 10 pods, 150 tps goes into seconds latency as behind the scenes we see only one of the pods had created a workflow worker.
@Yimin_Chen due to this even though we have 10 pods with 100 pollers each, we see bottleneck at 200 workflows per second. Is Prometheus metric a reliable way to check for number of pollers and slots?

What SDK do you use? Could you share your worker initialization code?

Hello @Yimin_Chen
Temporak Sdk 1.13.0, Server 1.15.0
we did take a dive into our initialization code after you pointed it out…what we had was code similar to GitHub - applicaai/spring-boot-starter-temporal: The driver making it convenient to use Temporal with Spring Boot.. So we had all initiation code in one library and shared across our microservices containing workers. We removed all code from the library and all other task queues except the one we are testing, and simplified the iniitialization code in our springboot service project. Now we see that the slots show up perfectly i.e #pods*executionsize. We will add back all other workflows code and check exactly what made the jvm create the workers correctly on Monday.
But one question though is…even though we have 6 pods of workers with 100 pollers each and 600 concurrent execution, during the run, we only ever see 4-8 slots getting used in grafana at a given point. Wanted to understand why that is so? We are tracking the temporal_worker_task_slots_available metric. So after the code changes it did help us move from 200 tps bottleneck but when we try 250 workflows per second, even though we scaled our pods provided 400 more pollers and 1200 more execution size, it still shows peak at async match count.

Please check server side metric
sum(service_pending_requests) by (temporal_namespace) to see if server does receive those poll requests.
Other things to look at is the latency metrics (service_latency and persistence_latency).

Can you also confirm history shard count and task queue partition counts?

Service latency and persistence latency graphs are attached on the first post @Yimin_Chen do those numbers look on the higher side? And yes we see high service pending requests matching the poller count on all the pods.
We are running 5 pods of Cassandra with 16 GB heap size with G1GC 48 GB storage and 4K shards on 5 16 core 64 GB t2D servers with replication factor 3 . Matching task queue partitions are at 4 and we even tried with 8. @samar
Attaching the read and write latencies seen on Cassandra

One query I have is, is the cores given to workers anyway related to the concurrent execution size of the tasks? We seemed to cross the 200 bottleneck when we gave 8 cores to each of our worker pods, but we pushed down to 2 and 4 since we saw low cpu utilisation , we once again saw rapid deterioration in latency(bottleneck) with same amount of pollers and pods.

So after we found a workaround for the worker slots issue (yet to identify root cause of it) we see correct slots as per pods of the worker application.
We changed the temporal cluster settings to -
Front end - 3 pods, 4 cores 4GB each ; History - 5 pods, 8 cores 8GB each; Matching - 3 pods, 4 cores 4GB each, 4K shards, matching queue partitions at 4.
No matter how many pods, and there by pollers, workers, cache we provisioned we continued to see bottleneck at 200 TPS (highest configuration being 12 pods, 64 pollers-200 max concurrent execution size-600 max thread and cache size each)
We then noticed there were service_errors_resource_exhausted errors and gradually increased rpslimits on frontend, history and matching service to 24000 and were finally able to cross the 200 TPS bottleneck upto 350 TPS(although with a 100ms increase in end to end latency) but faced similar bottleneck at 400 TPS with a peak in aysnc match rate. But when we moved to 36000 & 48000 on the rps limit we once saw application seeing peaks in async match count and latencies going into seconds with upto 1.2K resource exhausted errors for a 1 minute run at 300 TPS. Following queries on the behavior noticed below:

  1. With no other change in the configuration(worker pods), we were able to scale upto 350 TPS when we increased rps limits to 24000, why do we see once again see increase in async match at 300 when we increase rps limit to 36000/48000
  2. Attached graphs show run 1 minute run executions of 100,150,200,250,300,350,400(bottleneck) workflows per second -

    2.a. Why do we only see upto 25 activity worker slots getting used?
    2.b. Do the sticky cache evictions seen imply some config change needed? One other observation we noted is, no matter the worker configuration of pollers, cache size, pods, we always see similar peaks at that respective TPS…always! Is that expected?
    2.c Not seen in the graphs attached but, I only see 1000 workflow worker slots used during that bottleneck run. Does that mean the task execution was so fast during good runs, it only ever utilized 25 slots.
  3. What should be the logic behind setting up suitable rpslimits. Dont see a lot of documentation on this.

You should see why you still get the resource exhausted errors. Check the resource_exhausted_cause tag on that metrics to see if it is rps limit or concurrent limit or system overload. You may need to increase frontend.namespaceCount if you see concurrent limit. You need to increase persistence rate limit if you see system overload: [frontend|history|matching].persistenceMaxQPS.
And you probably need to increase frontend.namespaceRPS as well, but I guess you already done so.

The StickyCacheEviction indicate your worker’s sticky cache might not be big enough. You can either increase worker count, or increase sticky cache size.

@Yimin_Chen Can you please explain the difference between below attributes :
FrontendRPS: “frontend.rps”,
FrontendMaxNamespaceRPSPerInstance: “frontend.namespaceRPS”,
FrontendMaxNamespaceCountPerInstance: “frontend.namespaceCount”,
FrontendGlobalNamespaceRPS: “frontend.globalNamespacerps”,

  1. We have a single namespace “default” and tried increasing “frontend.rps” values till 48K, but still we are getting resource exhausted errors with cause as rps limit. Hence, need a clarification on “namespaceRPS” attribute as well.
    Also, what are the max values that these attributes support?

  2. We also see service_errors_entity_not_found in dashboard, what config needs to be verified for these kind of errors?

frontend.rps / history.rps / matching.rps sets RPS limit per service pod.
frontend.namespaceRPS sets per namespace RPS limit.
There is no max value limit on those configs.

service_errors_entity_not_found is expected error, it means workflow (or other entities like activity) cannot be found. This could happen if there are some tasks (timer/transfer/visibility tasks) after workflow is deleted (like due to retention). Or if you try to send signal to non-exists workflow.