Bad performance when deployed in Kubernetes - How to diagnose bottleneck?

Hi Temporal team,

We are currently doing a PoC with Temporal and we would like to measure how many transactions per second Temporal can support at max.
Until now, the performances are not good and we would need your help to understand where the issues could come from / how to figure out where the bottleneck is.

Our approach is to leverage remote activities using task queue to communicate between our microservices. For the sake of the perf test, we have different spring boot applications:

  • The main one is hosting the workflow with collocated activities
  • The 3 other spring boot apps are hosting activities only (no workflow)

Task queues are used to perform remote execution from the workflow implementation.

Our workflow (this is a first version, we would like to test a second version with child workflow for request processing once the perf are OK with this version)

First results:
With less than 20 TPS, 90th percentile is around 2.6 seconds with many workflows executions which take less than 400ms but at one point it looks like workflows are kind of “stuck” and take more than 50 seconds to execute.

In addition there are a lot of DB errors such as:

{“level”:“error”,“ts”:“2021-12-02T17:59:51.222Z”,“msg”:“Operation failed with internal error.”,“service”:“history”,“error”:“GetTransferTasks operation failed. Select failed. Error: context deadline exceeded”,“metric-scope”:15,“shard-id”:547,“logging-call-at”:“persistenceMetricClients.go:676”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal. io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal. io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal. io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal. io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:220”}

{“level”:“error”,“ts”:“2021-12-02T17:59:50.348Z”,“msg”:“Error refreshing namespace cache”,“service”:“worker”,“error”:“GetMetadata operation failed. Error: driver: bad connection”,“logging-call-at”:“namespaceCache.go:414”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414”}

{“level”:“error”,“ts”:“2021-12-02T17:59:48.062Z”,“msg”:“Membership upsert failed.”,“service”:“worker”,“error”:“UpsertClusterMembership operation failed. Error: EOF”,“logging-call-at”:“rpMonitor.go:276”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276”}

Please find hereafter a description of our setup:

  • Activities are empty shells – we just perform some logging – since our goal is to measure the overhead of Temporal (mainly gRPC calls)
  • Temporal and microservices are deployed in a Kubernetes cluster
    • 1 pod for each spring boot app
      • The one that instantiates workflows:
        requests: cpu: 200m memory: 512Mi limits: cpu: 4000m memory: 6Gi
      • Others “activities” only spring boot apps:
        requests: cpu: 200m memory: 512Mi limits: cpu: 200m memory: 512Mi
    • 2 frontend pods
      requests: cpu: 500m memory: 128Mi limits: cpu: 1000m memory: 256Mi
    • 3 history pods
      requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 512Mi
    • 2 matching pods
      requests: cpu: 500m memory: 256Mi limits: cpu: 674m memory: 256Mi
    • 1 worker pod
      requests: cpu: 250m memory: 128Mi limits: cpu: 500m memory: 356Mi
    • 1 web pod
      resources: limits: cpu: 300m memory: 200Mi requests: cpu: 300m memory: 200Mi
    • 1 postgresql pod
      maxConnections: "1000" resources: requests: cpu: 200m memory: 1Gi limits: cpu: 4000m memory: 3Gi
      • Connection pools settings:
        • default
          maxConns: 20 maxConnLifetime: "1h"
        • visibility:
          maxConns: 5 maxConnLifetime: "1h"

Any help or advice would be kindly appreciated. Feel free to ask for further information in case I have not been clear enough or if some important information is missing.

Thanks a lot !