We are considering Temporal as a solution for our firm, as part of the same we are performing Temporal performance benchmarking with workflow design of each parent workflow have 50 child workflows and each child workflow will call 2 synchronous sequential activities. Overall time taken to perform both activity is around 30 ms.
We will be benchmarking for workload up to 600 K Parent workflows
Deployed resources:
Temporal services
-
History - 8 instances with 8 cores, 8GB memory each
-
Frontend - 8 instances with 6 cores, 4GB memory each
-
Matching - 6 instances with 4 cores, 4GB memory each
-
No Server workers
-
Cassandra as persistent store - Multi node cluster with sufficient resources
-
ES for visibility - Multi node cluster sufficient resources
Application workers are separate for Workflow(parent worker & child worker are on same instance) and activity
- Workflow services - 10 instances with 3 cores, 4GB memory each
- Activity services - 15 instances with 3 cores, 5GB memory each
Tried with multiple workload volume few of them are:
-
14K Parent workflow , Each with 50 child workflows,
Results/Finding :
- Parent workflow throughput ~12 wf/sec.
- Child workflow throughput ~ 150 wf/sec.
- Total run time : 70 minutes to complete all parents
-
100K Parent workflow , each with 50 child workflows
Results/Finding
- Parent workflow throughput ~ 2 wf/sec.
- Child workflow throughput - 145 wf/sec. (Only around 470K completed)
- Total run time : ~9 hours to complete all parents
- We have observed the throughput towards the end of runtime is lesser and some of workflows are taking very long time to complete.
Expectation :
Parent workflow throughput ~ 170 wf/sec
Child workflow throughput ~ 4000 wf/sec
Queries:
- Is the expected throughput achievable ?
- Is there any workflow-activity design change will help to achieve such throughput ? will converting the parent-child workflow to single workflow just calling activity be helpful ?
- We have done multiple combinations of vertical and horizontal scaling of the temporal services and workflow and activity services. We have not seen significant throughput improvement . Is there any known bottleneck ?
- What should be desirable combination of number of task queues and their partitions with number of workers running for the task queue for workflow and activity ?
- We have tried changing the below default configuration
WorkerOptions.setMaxConcurrentActivityExecutionSize from 200 to 2000
WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize 200 to 2000
WorkerOptions.setWorkflowPollThreadCount - 2 to 10
WorkerOptions.setActivityPollThreadCount - 5 to 20
we have not seen significant throughput improvement. What is suitable configuration for such load ?