Bad performance when deployed in Kubernetes - How to diagnose bottleneck?

Hi Temporal team,

We are currently doing a PoC with Temporal and we would like to measure how many transactions per second Temporal can support at max.
Until now, the performances are not good and we would need your help to understand where the issues could come from / how to figure out where the bottleneck is.

Our approach is to leverage remote activities using task queue to communicate between our microservices. For the sake of the perf test, we have different spring boot applications:

  • The main one is hosting the workflow with collocated activities
  • The 3 other spring boot apps are hosting activities only (no workflow)

Task queues are used to perform remote execution from the workflow implementation.

Our workflow (this is a first version, we would like to test a second version with child workflow for request processing once the perf are OK with this version)

First results:
With less than 20 TPS, 90th percentile is around 2.6 seconds with many workflows executions which take less than 400ms but at one point it looks like workflows are kind of “stuck” and take more than 50 seconds to execute.

In addition there are a lot of DB errors such as:

{“level”:“error”,“ts”:“2021-12-02T17:59:51.222Z”,“msg”:“Operation failed with internal error.”,“service”:“history”,“error”:“GetTransferTasks operation failed. Select failed. Error: context deadline exceeded”,“metric-scope”:15,“shard-id”:547,“logging-call-at”:“persistenceMetricClients.go:676”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal. io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal. io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal. io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal. io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:220”}

{“level”:“error”,“ts”:“2021-12-02T17:59:50.348Z”,“msg”:“Error refreshing namespace cache”,“service”:“worker”,“error”:“GetMetadata operation failed. Error: driver: bad connection”,“logging-call-at”:“namespaceCache.go:414”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414”}

{“level”:“error”,“ts”:“2021-12-02T17:59:48.062Z”,“msg”:“Membership upsert failed.”,“service”:“worker”,“error”:“UpsertClusterMembership operation failed. Error: EOF”,“logging-call-at”:“rpMonitor.go:276”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276”}

Please find hereafter a description of our setup:

  • Activities are empty shells – we just perform some logging – since our goal is to measure the overhead of Temporal (mainly gRPC calls)
  • Temporal and microservices are deployed in a Kubernetes cluster
    • 1 pod for each spring boot app
      • The one that instantiates workflows:
        requests: cpu: 200m memory: 512Mi limits: cpu: 4000m memory: 6Gi
      • Others “activities” only spring boot apps:
        requests: cpu: 200m memory: 512Mi limits: cpu: 200m memory: 512Mi
    • 2 frontend pods
      requests: cpu: 500m memory: 128Mi limits: cpu: 1000m memory: 256Mi
    • 3 history pods
      requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 512Mi
    • 2 matching pods
      requests: cpu: 500m memory: 256Mi limits: cpu: 674m memory: 256Mi
    • 1 worker pod
      requests: cpu: 250m memory: 128Mi limits: cpu: 500m memory: 356Mi
    • 1 web pod
      resources: limits: cpu: 300m memory: 200Mi requests: cpu: 300m memory: 200Mi
    • 1 postgresql pod
      maxConnections: "1000" resources: requests: cpu: 200m memory: 1Gi limits: cpu: 4000m memory: 3Gi
      • Connection pools settings:
        • default
          maxConns: 20 maxConnLifetime: "1h"
        • visibility:
          maxConns: 5 maxConnLifetime: "1h"

Any help or advice would be kindly appreciated. Feel free to ask for further information in case I have not been clear enough or if some important information is missing.

Thanks a lot !

2 Likes

@SebastienJ , I think you did a great job writing up your question and I’m sad that no one seems to have responded.

I ran into a similar situation (running whole dev instance on a single cluster) and we couldn’t figure out where the bottleneck was.
Some of that is related to a lack of ops experience with k8s and some is a lack of familiarity with the insides of Temporal.

If I learn anything, I’m happy to share it here, unless there is a place where the performance management folks hang out.

We try to get to every single question both here and on our Slack, but it seems this is one of few questions that got lost in the mix, apologies for that. If your question did not get a response in while, feel free to “bump” it by either posting just a "hey im still here :slight_smile: " type of reply or feel free to send me priv reminder message here as well.

@SebastienJ can you please let us know if your question is still outstanding? Thanks!