Temporal performance issues

When testing the performance of Temporal, I had a problem with the workflow running time being too high.

The workflow consists of one activity with the simplest logic (log.info())

By disposal, all components work without reaching the limit.

The load on the temporal is 200 RPS.

All settings are listed below. Please, help.

Resources of temporal server components:

Component name CPU RAM Count of replicas
Frontend 1600m 1024Mi 1
History 6400m 8192Mi 1
Matching 1600m 1024Mi 1
Worker 200m 512Mi 1

Count of workers = 2

Would look at:

  1. increase of service requests sum (rate(service_requests[5m])) see if goes up around the same time your workflow and activity schedule to start latencies go up.
  2. if you had any resource exhausted issues during this time (also will show if you got rate limited by frontend hosts rps limits):
    sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
  3. your sync match rate sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

If you did not get rate limited and your sync match rate looks ok (around 100%, no dips), would look if you have “stuck” executions (workflow task failed/timed out).
Server metric for that is workflow_task_attempt (histogram metric) and also look at history hosts logs specifically for " Critical attempts processing workflow task".

Also start to close timeout for workflow task, server metric: