Navigating through the internal of workflow lifecycle

I have setup a single node temporal server and single node worker client backedby postgres. Its a SSD 8 core machine and I want to achieve 500TPS on a workflow with 3 activities each is doing a REST API call (mainly all IO operations). The REST API are hosted on the same network but different machine and is able to provide more than 700TPS on their own.
Below is my findings,

  • network is healthy (with 500TPS load)
  • DB is healthy - but I can see only 4-6 active connections with 50 idle connection always
  • irrespective of load (I tried with 10,100 and 500 concurrent requests), I am getting is 20TPS consistently wth max 300% CPU utilization.
  • as the load is 10 request/second the avg response time is 300-400ms where as it goes to 4sec when load is 500 req/sec

Below is my configuration,

  • Server Level

    • history.numTasklistPartitions: 64
    • history.persistence.numHistoryShards: 32
    • history.defaultWorkflowTaskTimeout: ā€œ10sā€
    • matching.numTasklistPartitions: 64
    • worker.taskQueue.activitiesPerSecond: 500
    • worker.taskQueue.activitiesPerTaskQueue: 500
    • worker.maxConcurrentActivityExecutionSize: 1000
    • worker.maxConcurrentWorkflowTaskExecutionSize: 1000
    • persistence.sql.maxConns: 250
    • persistence.sql.maxIdleConns: 10
    • persistence.sql.maxOpenConns: 200
  • Worker (Client) Level (refer: community forum)

    • WorkerOptions#workflowPollThreadCount: 40
    • WorkerOptions#activityPollThreadCount: 80
    • WorkerOptions#maxConcurrentWorkflowTaskExecutionSize: 20
    • WorkerOptions#maxConcurrentActivityExecutionSize: 40
    • WorkerFactoryOptions#maxWorkflowThreadCount: 200
    • WorkerFactoryOptions#workflowCacheSize: 20

It seems that the latency is somewhere between temporal server and client (task queue level?).

  1. Is my configuration is enough to support 500tps or I am missing something. Actually I want to measure, how much TPS I can get with this single node setup (after appropriate tunning) so that I can extrapolate accordingly.
  2. why the number of active db connection is always 4-6 having 50 idle connection despite of load when I have configured 32 shard count?

Please advice.