Temporal test bench by maru

I try to test temporal cluster throughput by maru. I deploy the temporal cluster by helm chart in three k8s nodes, and each service have two pods instance, db is postgresql.
In order to limit history service memory usage,I have modified the history cache size:
history.cacheInitialSize: 50
history.cacheMaxSize: 100
history.eventsCacheInitialSize: 50
history.eventsCacheMaxSize: 100
and I also modified the nums of shards to 4

my configuration of the maru yaml is:
step.count: 1000
step.ratePerSecond: 50
step.concurrency: 100
and I get the follow table, I found workflow execution total time is 760s, actually it took 170s to execute 962 workflows, the rest of time is spent to process remaining 38 workflows(backlog).
My question is how to locate the performance bottleneck of the cluster. In other word, What key performance metrics can I refer to, and what parameters should be adjusted to improve the throughput of the cluster.

Acording to other topic, there are some metric pictures in my test env.

Could you show your persistence latencies by operation?

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Also seems your sync match rate is not optimal (from last two graphs), can you show

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

Is there a reason to set number of histoy shards to 4? There is a nice writeup on this here if it helps.
Related with history cache size (dynamic config) reduction from default, these are per-shard configurations, were you running out of resources on your history pods during load test? What was cpu/mem % during test?

Hi,@tihomir, thanks a lot for your reply.

And I try to run two workers(before is just one worker), but ‘sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))’ doesn’t seem to have changed much, and the cluster throughput seems not be better.

The reason I set shards nums to 4, include modifying the history cache size is to limit the memory consumption of temporal cluster to the greatest extent while meeting our performance requirements,
my purpose is to find a balance between performance and resource consumption.

Hi,@tihomir, I have a very confused place where both PollWorkflowTaskQueue latency and PollActivityTaskQueue latency are at the minute level, however other operations latency is at the millisecond level, and PollWorkflowTaskQueue latency and PollActivityTaskQueue latency will drop when workflow start executing, so how to understand these two metrics?