Currently we are using temporal in a project, however we are not able to meet the performance required.
Kindly need your help to check whether we are using the right settings/recommendation.
Here are our case:
We are using go. Just 1 microservice with 18 activities. Each activity only do simple http call to backend with 100ms delay response (no complex logic).
We run the go microservice and temporal servers in AWS EKS. The temporal server services use docker image temporalio/auto-setup:1.16.2
We run 4 temporal server services each 1 pod. temporal-server-frontend, temporal-server-history, temporal-server-matching, and temporal-server-worker.
We are using cassandra and elasticsearch with 3 replicas.
We already did performance test and here are the result. The performance requirement is about 200 workflows/sec.
100 workflows completed in 4 seconds. ~25 workflows/sec
1000 workflows completed in 1 minute 6 seconds. ~15 workflows/sec
5000 workflows completed in 6 minutes 23 seconds. ~13 wokflows/sec.
I already tried using temporaio/server:1.16.2, and setup cassandra & elastic myself (create tables & indexes).
But I got error below.
âerrorâ:âUnable to decode search attributes: invalid search attribute type: Unspecifiedâ
{"level":"error","ts":"2022-05-24T09:07:13.104Z","msg":"Fail to process task","shard-id":1101,"address":"10.1.0.101:7234","component":"visibility-queue-processor","wf-namespace-id":"bb11fa9a-42d3-40fc-9219-1c0a705ddda2","wf-id":"tx1_server_test2","wf-run-id":"e7a9443f-a782-41ea-8d90-d88bfb43b86c","queue-task-id":1048579,"queue-task-visibility-timestamp":"2022-05-24T09:07:10.209Z","queue-task-type":"VisibilityStartExecution","queue-task":{"NamespaceID":"bb11fa9a-42d3-40fc-9219-1c0a705ddda2","WorkflowID":"tx1_server_test2","RunID":"e7a9443f-a782-41ea-8d90-d88bfb43b86c","VisibilityTimestamp":"2022-05-24T09:07:10.209487532Z","TaskID":1048579,"Version":0},"wf-history-event-id":0,"error":"Unable to decode search attributes: invalid search attribute type: Unspecified","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:333","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/home/builder/temporal/service/history/taskProcessor.go:333\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/home/builder/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/home/builder/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/home/builder/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/home/builder/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/home/builder/temporal/service/history/taskProcessor.go:247\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/home/builder/temporal/service/history/taskProcessor.go:177"}
and I used workaround, scaleup temporalio/auto-setup:1.16.2 first, and rollback again using temporalio/server:1.16.2 got no errors.
Changes I already test based on your recommendation:
Using image temporalio/server:1.16.2 in all service.
Using NUM_HISTORY_SHARDS=4096.
scaleup golang sdk service to 4 pods.
But still no impact in performance.
1000 workflows still completed in 1 minutes.
We noticed that request operation is flatten around 250 - 280 ops/s and there is also huge gap between execution started to task started in temporal UI monitoring. Is this related with limited throughput outcome from the workflow worker? Is there any more configuration that can be configured to improve the throughput?
Is there any more configuration that can be configured to improve the throughput?
One thing you can check is your persistence latencies. Try Prometheus query:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard operations. You will have to establish your own base line for these latencies, but is a good thing to look at.
Another thing to consider is worker capacity (number of pollers).
See this post and our worker tuning guide for more info.
Regarding shard count, try looking at shard lock contention:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
if this latency is too high indicates that your did not set enough shards.
How many frontend servers do you have handling client requests? Typically you want to have a L7 load balancer to round-robin client load among available frontend servers.
Another thing you could look at are resource exhausted issues.
Prometheus query: sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)
Look for error causes: RpsLimit, ConcurrentLimit, and SystemOverloaded. These are per service instance.
From this exploration, we noticed that this tightly related between RPS limit, activity/task poller, and matching task queue partition.
However we still in a blindspot on figuring out what is the optimized settings to fulfill our needs. Is there any rule of thumbs to tuning up those parameters?
Here are some questions popped out that need some clarification or guidance, in order to tune temporal in our environment better.
How to correlate RPS limit between frontend, matching, and history? In our experiment, seems matching needs highest RPS while history is the lowest.
Is RPS limit in our experiment is considered too high (51200 RPS)?
How many activity/task poller is considered too many?
What is the impact of setting too high RPS number? Same question for partition and poller.
How to determine how many partitions we need? Because we tried to increase queue partition, but our process become slower.
try to use 3 frontend pods; 3 matching pods; 3 history pods; 1 worker pods;
check CPU / mem utilization of above pods when running the load
make sum(rate(poll_success_sync[1m]))/sum(rate(poll_success[1m])) as close to 100% as possible by increasing the MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers
use 4 for both matching.numTaskqueueReadPartitions and matching.numTaskqueueWritePartitions
try to use local activity, which will improve the latency (reduce unnecessary round trip)
assuming my understanding is correct: per workflow there will be 18 normal activities, target is 100 workflow per sec: â translates to roughly 5K DB transaction
check if your DB is overloaded (DB CPU / mem)
check the metrics emitted from temporal persistence layer histogram_quantile(0.99, sum(rate(persistence_latency_bucket[1m])) by (operation, le)) â p99 latency, target is 50ms ish (create workflow execution, update workflow execution)
try to use more workflow for testing, 1000 workflow start to completion rate may not be accurate (long tail?), try 10,000 or maybe more
@maxim
Have you considered using local activities for short calls?
Hi Maxim,
We havenât considered local activity because initially eventhough the activities are short calls, we want to leverage temporal retry mechanism to keep simplicity in our workflow code. What I read during using local activity, we need to implement retry by ourselves and need to consider workflow timeout boundary for that particular activity.
try to use 3 frontend pods; 3 matching pods; 3 history pods; 1 worker pods;
Hi Wen Quan,
Thank you for the setup recommendation, will try out this settings + 10k workflows and revert back to this forum once itâs done.
make sum(rate(poll_success_sync[1m]))/sum(rate(poll_success[1m])) as close to 100% as possible by increasing the MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers
For poll success sync and poll success ratio, we did try to increase the poller settings and it degrades the performance, we suspect this is due to RPS limit configuration in temporal serverâs services. Could please enlighten us how to correlate RPS limit between termporal services, so we can figure out better how to setup RPS limit? Is there any documentation to tuning temporal service via dynamic config?
assuming my understanding is correct: per workflow there will be 18 normal activities, target is 100 workflow per sec: â translates to roughly 5K DB transaction
Last question from your feedback, how do you estimate 1 workflow, 18 normal activities, with 100 workflow per second could be translated to 5k DB transaction per second?
Could please enlighten us how to correlate RPS limit between termporal services, so we can figure out better how to setup RPS limit? Is there any documentation to tuning temporal service via dynamic config?