Hello!
When testing the performance of Temporal, I had a problem with the workflow running time being too high.
The workflow consists of one activity with the simplest logic (log.info())
By disposal, all components work without reaching the limit.
The load on the temporal is 200 RPS.
All settings are listed below. Please, help.
Resources of temporal server components:
Component name |
CPU |
RAM |
Count of replicas |
Frontend |
1600m |
1024Mi |
1 |
History |
6400m |
8192Mi |
1 |
Matching |
1600m |
1024Mi |
1 |
Worker |
200m |
512Mi |
1 |
Count of workers = 2
Would look at:
- increase of service requests
sum (rate(service_requests[5m]))
see if goes up around the same time your workflow and activity schedule to start latencies go up.
- if you had any resource exhausted issues during this time (also will show if you got rate limited by frontend hosts rps limits):
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
- your sync match rate
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
’
If you did not get rate limited and your sync match rate looks ok (around 100%, no dips), would look if you have “stuck” executions (workflow task failed/timed out).
Server metric for that is workflow_task_attempt
(histogram metric) and also look at history hosts logs specifically for " Critical attempts processing workflow task".
Also start to close timeout for workflow task, server metric:
sum(rate(start_to_close_timeout{operation="TimerActiveTaskWorkflowTaskTimeout"}[1m]))