Temporal throughput not improving

We are considering Temporal as a solution for our firm, as part of the same we are performing Temporal performance benchmarking with workflow design of each parent workflow have 50 child workflows and each child workflow will call 2 synchronous sequential activities. Overall time taken to perform both activity is around 30 ms.

We will be benchmarking for workload up to 600 K Parent workflows

Deployed resources:
Temporal services

  1. History - 8 instances with 8 cores, 8GB memory each

  2. Frontend - 8 instances with 6 cores, 4GB memory each

  3. Matching - 6 instances with 4 cores, 4GB memory each

  4. No Server workers

  5. Cassandra as persistent store - Multi node cluster with sufficient resources

  6. ES for visibility - Multi node cluster sufficient resources

Application workers are separate for Workflow(parent worker & child worker are on same instance) and activity

  1. Workflow services - 10 instances with 3 cores, 4GB memory each
  2. Activity services - 15 instances with 3 cores, 5GB memory each

Tried with multiple workload volume few of them are:

  1. 14K Parent workflow , Each with 50 child workflows,
    Results/Finding :
  • Parent workflow throughput ~12 wf/sec.
  • Child workflow throughput ~ 150 wf/sec.
  • Total run time : 70 minutes to complete all parents
  1. 100K Parent workflow , each with 50 child workflows
    Results/Finding
  • Parent workflow throughput ~ 2 wf/sec.
  • Child workflow throughput - 145 wf/sec. (Only around 470K completed)
  • Total run time : ~9 hours to complete all parents
  • We have observed the throughput towards the end of runtime is lesser and some of workflows are taking very long time to complete.

Expectation :
Parent workflow throughput ~ 170 wf/sec
Child workflow throughput ~ 4000 wf/sec

Queries:

  1. Is the expected throughput achievable ?
  2. Is there any workflow-activity design change will help to achieve such throughput ? will converting the parent-child workflow to single workflow just calling activity be helpful ?
  3. We have done multiple combinations of vertical and horizontal scaling of the temporal services and workflow and activity services. We have not seen significant throughput improvement . Is there any known bottleneck ?
  4. What should be desirable combination of number of task queues and their partitions with number of workers running for the task queue for workflow and activity ?
  5. We have tried changing the below default configuration
    WorkerOptions.setMaxConcurrentActivityExecutionSize from 200 to 2000
    WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize 200 to 2000
    WorkerOptions.setWorkflowPollThreadCount - 2 to 10
    WorkerOptions.setActivityPollThreadCount - 5 to 20
    we have not seen significant throughput improvement. What is suitable configuration for such load ?
1 Like
  1. what is the size of your cassandra cluster?

  2. child workflow is comparably expensive than activity.

  3. make sure numHistoryShards is large enough, try 16K as a start, ref: temporal/development.yaml at v1.9.2 · temporalio/temporal · GitHub

  4. closely monitor the CPU / mem util of your setup, i guess the existing capacity is not enough (after changing the number of shards above, see 3)

  5. parent to child ratio is 1:50, so i would expect the Expectation section to also follow the same ratio?

  6. maybe worth coming to our slack channel and talk about your workflow design

  7. we also have cloud, if you are interested