Navigating through the internal of workflow lifecycle

Hi, I am working to build a very high throughput system based on temporal. Currently, as per the requirement I am testing Temporal with postgres (will explore same with cassandra also).

Having a 4 activity based simple workflow implemented, I want to exact steps that are being executed by the temporal server for an entire execution specially all the db interaction taking place. I tried with enabling postgres debug log, but thats too clumsy.

Is there any recommended way to find these information?

However, I am also doing a performance testing of my workflow. Is there any performance statistics already established with postgres? I found multiple scattered documentation mentioning some of the tunning points, but is there any consolidated list?

Thanks in advance…

Define “very high throughput”. Postress is not the right technology for really high throughput scenarios as it doesn’t scale horizontally.

Enable SDK and service metrics before any performance testing.

The Performance Testing Temporal Presentation from Replay Conference might be helpful.

I understand @maxim , but my plan is to see how far we can go with postgres which helps us to advise clients when to go for cassandra. Our initial target is 500tps (is it not possible with postgres?) with postgres - please let me know if my expectation is wrong. Once this is acheived with postgres, we will migrate to cassandra for more tps.

However, as I mentioned, apart from performance tunning, please advise on how to know about the exact steps that are being executed by the temporal server for an entire execution specially all the db interaction taking place.

Also, thanks for performance tunning link :slight_smile:

I have setup a single node temporal server and single node worker client backedby postgres. Its a SSD 8 core machine and I want to achieve 500TPS on a workflow with 3 activities each is doing a REST API call (mainly all IO operations). The REST API are hosted on the same network but different machine and is able to provide more than 700TPS on their own.
Below is my findings,

  • network is healthy (with 500TPS load)
  • DB is healthy - but I can see only 4-6 active connections with 50 idle connection always
  • irrespective of load (I tried with 10,100 and 500 concurrent requests), I am getting is 20TPS consistently wth max 300% CPU utilization.
  • as the load is 10 request/second the avg response time is 300-400ms where as it goes to 4sec when load is 500 req/sec

Below is my configuration,

  • Server Level

    • history.numTasklistPartitions: 64
    • history.persistence.numHistoryShards: 32
    • history.defaultWorkflowTaskTimeout: “10s”
    • matching.numTasklistPartitions: 64
    • worker.taskQueue.activitiesPerSecond: 500
    • worker.taskQueue.activitiesPerTaskQueue: 500
    • worker.maxConcurrentActivityExecutionSize: 1000
    • worker.maxConcurrentWorkflowTaskExecutionSize: 1000
    • persistence.sql.maxConns: 250
    • persistence.sql.maxIdleConns: 10
    • persistence.sql.maxOpenConns: 200
  • Worker (Client) Level (refer: community forum)

    • WorkerOptions#workflowPollThreadCount: 40
    • WorkerOptions#activityPollThreadCount: 80
    • WorkerOptions#maxConcurrentWorkflowTaskExecutionSize: 20
    • WorkerOptions#maxConcurrentActivityExecutionSize: 40
    • WorkerFactoryOptions#maxWorkflowThreadCount: 200
    • WorkerFactoryOptions#workflowCacheSize: 20

It seems that the latency is somewhere between temporal server and client (task queue level?).

  1. Is my configuration is enough to support 500tps or I am missing something. Actually I want to measure, how much TPS I can get with this single node setup (after appropriate tunning) so that I can extrapolate accordingly.
  2. why the number of active db connection is always 4-6 having 50 idle connection despite of load when I have configured 32 shard count?

Please advice.