Seeing high latencies between two subsequent activity task executions

Hi,

We are currently doing stress testing on our application which uses Temporal for micro-service orchestration.
Our current setup is a GKE cluster as defined below -
2 T2D Nodes with 32 cores, 64 GB memory having
8 application pods - with each pod having 50 workflow pollers, 50 acitivity pollers, 200 maxConcurrentWorkflowTaskExecutionSize, 200 maxConcurrentAcitivityTaskExecutionSize, 600 maxWorkflowThreadCount, 600 workflowCachesize
4 Frontend pods, 8 history pods, 4 matching pods, 1 worker pod, 8 matching queue partitions, 4K Shards
All application and temporal pods on the same 2 nodes mentioned above.
5 T2D 16 core, 64 GB Nodes hosting 5 cassandra pods set to 3 replication factor and local quorum consistency. All application and database nodes are in the same zone.

We are executing a workflow with 1 local activity & 3 activities. Our microservices take upto 150ms in the overall end-to-end latency. The end-to-end latencies noted of the workflow are as follows -
10 TPS - avg=247.5ms min=232.38ms med=244.97ms max=466.14ms p(90)=255.34ms p(95)=257.52ms
50 TPS - avg=237.03ms min=220ms med=232.71ms max=645.33ms p(90)=244ms p(95)=252.37ms
100 TPS - avg=235.01ms min=214.71ms med=231.74ms max=633.84ms p(90)=245.9ms p(95)=254.31ms
150 TPS - avg=243.06ms min=216.82ms med=233.8ms max=731.17ms p(90)=260.86ms p(95)=275.8ms
We notices degradation in latency post this point -
200 TPS - avg=262.14ms min=220.5ms med=246.04ms max=891.28ms p(90)=298.95ms p(95)=352.7ms
And just starts to bottleneck at 250 TPS - avg=1.33s min=222.38ms med=261.42ms max=11.25s p(90)=5.26s p(95)=5.69s

The workflow schedule to start latency, workflow task execution latency, activity schedule to start latency and activity execution latency seem to be doing ok from the metrics seen below. But as you can see end to end latency is in seconds.

From the traces what can be seen is the latency between end of a runActivity to startActivity seems to be increasing wildly as we increase throughput. Which happens to be the workflow task schedule to start and workflow task execution latencies but they seem well within a few milliseconds.

Service latencies and persistence latencies also seem to be doing fine -


As the latencies grow, we see sync match rate also drops to a low.
We are also seeing some sticky cache evictions but unsure if that is cause of bottleneck or the bottleneck causing that. But at 250 TPS sprayed across 8 pods of application, with having 600 cachesize, I dont see that as the cause for the bottleneck.

We have tried tweaking all the params suggested by the team, seem to hit this bottleneck at 250 tps with all the pods underutilized and nodes hitting only 50% cpu. utilization. Need your help in debugging this, we need to scale throughput without deterioration in latency upto 60% utilization of the nodes. And we plan to add more nodes to scale this setup horizontally. We also need to understand why there is degradation between 150 & 200 and tune that as well.

2 Likes

Hi @Himaja

asked the server team to provide input on your question and will report back.

1 Like

Attaching matching service metrics noted during the run for your reference




From the provided info it looks as latencies are happening due to your workers being under-provisioned.

Also it looks from description that you are shooting for > 100K state transitions per second as far as performance goes, is this true?

Each workflow execution takes 250 ms, there are 8 worker pods with 50 pollers each for workflow and activity, concurrent execution size of 200(we even tried with 300), and we are hitting 250 Workflows per second with 3 activities each activity taking 40-45 ms as can be seen from the trace attached. Which means we are effectively giving ~30 workflows per pod, with the given latencies and pollers and threads(600), I don’t believe the workers are underprovisioned. And as per the trace clearly the time between activity to activity execution where temporal executes workflows task seems to increase. But the workflow task schedule to start and execution latencies are a few 10s of ms. So it’s unclear where are the latencies inside temporal.
Please let me know if I can provide any other metrics or information.

Some of the things we tried since we posted the query -

  1. Increased pollers on the application pods
  2. Increased application pods where worker are running
  3. Increased worker pods from 1 to 2
  4. Increase concurrent execution size of workflows and activities to 600
  5. Increase cache size and max workflow thread count to 1200
  6. Increased matching queue partitions to 16
  7. Tried to create 250 tps load with two different workflows having seperate queues 4 each(1 for workflow and 3 for activities)
  8. Create a new workflow factory instance for each new worker

None of the above helped us moved past the bottleneck of 250 TPS. @samar Can you please suggest what else are we missing to configure/check?

@tihomir to your query, so this workflow has about 30 steps in workflow execution history(1 local activity and 3 activities) on web ui and wenquan had mentioned that comes upto roughly 15 db queries on one of our old queries. So on this setup once we achieve ~250 ms at 250-300 TPS, we want to be able to add another node and hit 600 TPS as our immediate next step of scaling. And we want to be able to scale to 3K workflows/second by the end of the year and to 10K w/s by the next year.

Hi Samar @samar
We were finally able to get past the bottleneck of 250 when we increased out matching pods from 4 to 6. We could only ever go upto 200 with 4 matching pods, now able to move past and hit upto 300 with deterioration nonetheless but no bottleneck atleast when we scaled out to 6 pods.

Is there any a limiting threshold of 50 workflows per pod on matching? Like i mentioned above we tried everything in the book like increasing cpu, memory, pollers, concurrent execution size, cache size, max thread count, partitions and maxdb connections. Nothing seemed to help for the 250 workflows bottleneck except adding more pods of matching service.

Also wanted to check are the partitions of a single task queue shared across the matching pods?

It seems to me like worker under provision issue. You can verify this by looking at your server side metrics: persistence_requests with tag operation=CreateTask, this metrics indicate that task fail to deliver to a poller and was written to database and become backlog. It will later be dispatched to a poller when available.
You can also verify that by looking at async match: sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m])), ideally it should stay at 0.

Thanks for the response Yimin. @Yimin_Chen I do see peaks in those metrics during the bottleneck. Attaching screenshots of our 15-20 minute runs for your reference. We run 50,100,150,200,250,300 TPS runs for 2 mins each.
This bottleneck was seen at 200 TPS at 5.36 PM -


And this one seen at 7.10 PM when we hit the system at 300 TPS after we increased the matching service pods from 4 to 6 -

So we were able to move from 200 to 300 workflows/second without any change to the pods hosting our workflow and activity workers. We have 6 pods of workers having 20 pollers each, default 200 concurrent execution size, 600 max threads and cache size. So thats a total of 1200 worker slots per workflow/activity. Do you believe that is still under provisioned?
We also see temporal_worker_task_slots_available is almost always at 1190 except during bottleneck we see it at 1175…does this mean even though threads and slots are available, the workflows arent actually getting executed in parallel? or since our execution latencies are as low as 1-2 ms, we dont see enough slots filled up at any given point? Want to clarify that I mention 1-2 ms here but the activity execution latencies in the graphs pasted on first post show upto 50 ms latency…post that we removed all latencies(120ms) in our application pods to see if we are able to go past the 250 tps bottleneck if end-to-end latency is reduced.

You can increase your poller count per each worker pod (for example from 20 to 40), and see if that would move the needle.

We have tried moving it up to 64, did not change the outcome at 250 TPS. @Yimin_Chen As mentioned the time between activity executions where workflow task is executed steadily increasing at higher TPS, but workflow task schedule to start and execution latencies remain low. It’s the time after an activity is completed and before the workflow task is put in the matching queue that seems to grow. There is no metric or trace that captures that latency. How can we check the latencies before the task is scheduled? I.e the status of completed activity is updated and the next workflow task is added to the queue.

Hello @Yimin_Chen we took a closer look at what you suggested in provisioning more pollers and also added more worker pods to reduce the async match count. What we noticed happens is even though there are 10 pods running with 50 pollers and 200 maxconcurrentexecutionsize, we dont see 10*concurrent_execution_size showing up in temporal_worker_task_slots_available metric on grafana. We tried looking for it in prometheus and see random number of pods show up with workflow workers not equal to pod count but sometimes 8 or 1 or 2 or even 0. That may have been the root cause the whole time, even though we are starting 6 worker pods, it still doesnt have sufficient pollers/workers at just 200 workflows per second each taking only 100 ms on avg. Attaching a screenshot when we dont see any slots available at all.


@maxim Can you please suggest if this is bug or expected behavior or something we need to check in our code?
Why does this metric keep shifting up and down like seen in the graph…can you please explain…

We see activity workers are consistently created on the activity task queue for each pod, but dont see that consistent behavior with workflow workers on the workflow task.

As long as you see persistence_request {operation=“CreateTask”} it means there is no poller waiting when the task is dispatched.
Did you take a look at this doc: How to tune Workers | Temporal Documentation

Yes the issue we are facing is all pods don’t seem to create workflow workers. So even if we increase worker pods which ideally would have created more pollers and worker slots, they don’t seem to be getting initialised as can be verified from Prometheus. Activity workers are always created equal to no of pods but not workflow workers. Seeing this consistently whenever we restart pods, only a random number of pods are seen in worker slots metric. Wanted to check what could cause that?
When there are sufficient workflow workers created, even if it’s only 6 we are able to hit 150 tps without degradation in latency(100 ms). But sometimes even with 10 pods, 150 tps goes into seconds latency as behind the scenes we see only one of the pods had created a workflow worker.
@Yimin_Chen due to this even though we have 10 pods with 100 pollers each, we see bottleneck at 200 workflows per second. Is Prometheus metric a reliable way to check for number of pollers and slots?

What SDK do you use? Could you share your worker initialization code?

Hello @Yimin_Chen
Temporak Sdk 1.13.0, Server 1.15.0
we did take a dive into our initialization code after you pointed it out…what we had was code similar to GitHub - applicaai/spring-boot-starter-temporal: The driver making it convenient to use Temporal with Spring Boot.. So we had all initiation code in one library and shared across our microservices containing workers. We removed all code from the library and all other task queues except the one we are testing, and simplified the iniitialization code in our springboot service project. Now we see that the slots show up perfectly i.e #pods*executionsize. We will add back all other workflows code and check exactly what made the jvm create the workers correctly on Monday.
But one question though is…even though we have 6 pods of workers with 100 pollers each and 600 concurrent execution, during the run, we only ever see 4-8 slots getting used in grafana at a given point. Wanted to understand why that is so? We are tracking the temporal_worker_task_slots_available metric. So after the code changes it did help us move from 200 tps bottleneck but when we try 250 workflows per second, even though we scaled our pods provided 400 more pollers and 1200 more execution size, it still shows peak at async match count.

Please check server side metric
sum(service_pending_requests) by (temporal_namespace) to see if server does receive those poll requests.
Other things to look at is the latency metrics (service_latency and persistence_latency).

Can you also confirm history shard count and task queue partition counts?

Service latency and persistence latency graphs are attached on the first post @Yimin_Chen do those numbers look on the higher side? And yes we see high service pending requests matching the poller count on all the pods.
We are running 5 pods of Cassandra with 16 GB heap size with G1GC 48 GB storage and 4K shards on 5 16 core 64 GB t2D servers with replication factor 3 . Matching task queue partitions are at 4 and we even tried with 8. @samar
Attaching the read and write latencies seen on Cassandra



One query I have is, is the cores given to workers anyway related to the concurrent execution size of the tasks? We seemed to cross the 200 bottleneck when we gave 8 cores to each of our worker pods, but we pushed down to 2 and 4 since we saw low cpu utilisation , we once again saw rapid deterioration in latency(bottleneck) with same amount of pollers and pods.

So after we found a workaround for the worker slots issue (yet to identify root cause of it) we see correct slots as per pods of the worker application.
We changed the temporal cluster settings to -
Front end - 3 pods, 4 cores 4GB each ; History - 5 pods, 8 cores 8GB each; Matching - 3 pods, 4 cores 4GB each, 4K shards, matching queue partitions at 4.
No matter how many pods, and there by pollers, workers, cache we provisioned we continued to see bottleneck at 200 TPS (highest configuration being 12 pods, 64 pollers-200 max concurrent execution size-600 max thread and cache size each)
We then noticed there were service_errors_resource_exhausted errors and gradually increased rpslimits on frontend, history and matching service to 24000 and were finally able to cross the 200 TPS bottleneck upto 350 TPS(although with a 100ms increase in end to end latency) but faced similar bottleneck at 400 TPS with a peak in aysnc match rate. But when we moved to 36000 & 48000 on the rps limit we once saw application seeing peaks in async match count and latencies going into seconds with upto 1.2K resource exhausted errors for a 1 minute run at 300 TPS. Following queries on the behavior noticed below:

  1. With no other change in the configuration(worker pods), we were able to scale upto 350 TPS when we increased rps limits to 24000, why do we see once again see increase in async match at 300 when we increase rps limit to 36000/48000
  2. Attached graphs show run 1 minute run executions of 100,150,200,250,300,350,400(bottleneck) workflows per second -

    2.a. Why do we only see upto 25 activity worker slots getting used?
    2.b. Do the sticky cache evictions seen imply some config change needed? One other observation we noted is, no matter the worker configuration of pollers, cache size, pods, we always see similar peaks at that respective TPS…always! Is that expected?
    2.c Not seen in the graphs attached but, I only see 1000 workflow worker slots used during that bottleneck run. Does that mean the task execution was so fast during good runs, it only ever utilized 25 slots.
  3. What should be the logic behind setting up suitable rpslimits. Dont see a lot of documentation on this.