Workflow Performance with Java SDK

Hello!
Together with the team, we are implementing the temporal. The workflow consists of 4 activities (1 local). When generating a load of 200 RPS, the temporal works normally, but I’m not satisfied with its speed. When testing, 12000 workflows are launched. Most of which get stuck in the WorkflowTaskScheduled status.

And the time according to the temporal_workflow_task_schedule_to_start_latency_seconds metric is continuously growing depending on the duration of the test.

Increasing maxConcurrentWorkflowTaskExecutionSize , maxConcurrentActivityExecutionSize , maxWorkflowThreadCount and so on does not affect the result.

Additional info:
frontend replica count - 3
matching replica count - 3
history replica count - 5
worker replica count - 2
cassandra cluster size-5
numHistoryShards - 4096

Please, help.

Would start by looking at sync match rate (serve metrics):

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

it should stay around 1 (100%) if you see it dip or not even reach 100% then most likely you need either more worker pods or more workflow task/activity task pollers on your workers.

Another server metric to look at is

sum(rate(persistence_requests{operation="CreateTask"}[1m]))

if this graph shows values > 0 then again seems as you might have unprovisioned workers (this metric shows tasks written to db when there are no pollers available).

On your worker side look at cpu and mem utilization % and share please. We need to see if we need to add more workers or add more poller counts.
If you can share results of all this and also share your dynamic config values of

numTaskqueueWritePartitions
numTaskqueueReadPartitions

as well as number of pollers you define on your worker config.

Share also your graph for SDK metrics workflow task and activity schedule to start latencies.