Bad performance when deployed in Kubernetes - How to diagnose bottleneck?

SebastienJ · December 7, 2021, 5:15pm

Hi Temporal team,

We are currently doing a PoC with Temporal and we would like to measure how many transactions per second Temporal can support at max.
Until now, the performances are not good and we would need your help to understand where the issues could come from / how to figure out where the bottleneck is.

Our approach is to leverage remote activities using task queue to communicate between our microservices. For the sake of the perf test, we have different spring boot applications:

The main one is hosting the workflow with collocated activities
The 3 other spring boot apps are hosting activities only (no workflow)

Task queues are used to perform remote execution from the workflow implementation.

Our workflow (this is a first version, we would like to test a second version with child workflow for request processing once the perf are OK with this version)

First results:
With less than 20 TPS, 90th percentile is around 2.6 seconds with many workflows executions which take less than 400ms but at one point it looks like workflows are kind of “stuck” and take more than 50 seconds to execute.

In addition there are a lot of DB errors such as:

{“level”:“error”,“ts”:“2021-12-02T17:59:51.222Z”,“msg”:“Operation failed with internal error.”,“service”:“history”,“error”:“GetTransferTasks operation failed. Select failed. Error: context deadline exceeded”,“metric-scope”:15,“shard-id”:547,“logging-call-at”:“persistenceMetricClients.go:676”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal. io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal. io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal. io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal. io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal. io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal. io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:220”}

{“level”:“error”,“ts”:“2021-12-02T17:59:50.348Z”,“msg”:“Error refreshing namespace cache”,“service”:“worker”,“error”:“GetMetadata operation failed. Error: driver: bad connection”,“logging-call-at”:“namespaceCache.go:414”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414”}

{“level”:“error”,“ts”:“2021-12-02T17:59:48.062Z”,“msg”:“Membership upsert failed.”,“service”:“worker”,“error”:“UpsertClusterMembership operation failed. Error: EOF”,“logging-call-at”:“rpMonitor.go:276”,“stacktrace”:“go.temporal. io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal. io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276”}

Please find hereafter a description of our setup:

Activities are empty shells – we just perform some logging – since our goal is to measure the overhead of Temporal (mainly gRPC calls)
Temporal and microservices are deployed in a Kubernetes cluster
- 1 pod for each spring boot app
  - The one that instantiates workflows:
    requests: cpu: 200m memory: 512Mi limits: cpu: 4000m memory: 6Gi
  - Others “activities” only spring boot apps:
    requests: cpu: 200m memory: 512Mi limits: cpu: 200m memory: 512Mi
- 2 frontend pods
  requests: cpu: 500m memory: 128Mi limits: cpu: 1000m memory: 256Mi
- 3 history pods
  requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 512Mi
- 2 matching pods
  requests: cpu: 500m memory: 256Mi limits: cpu: 674m memory: 256Mi
- 1 worker pod
  requests: cpu: 250m memory: 128Mi limits: cpu: 500m memory: 356Mi
- 1 web pod
  resources: limits: cpu: 300m memory: 200Mi requests: cpu: 300m memory: 200Mi
- 1 postgresql pod
  maxConnections: "1000" resources: requests: cpu: 200m memory: 1Gi limits: cpu: 4000m memory: 3Gi
  - Connection pools settings:
    - default
      maxConns: 20 maxConnLifetime: "1h"
    - visibility:
      maxConns: 5 maxConnLifetime: "1h"

Any help or advice would be kindly appreciated. Feel free to ask for further information in case I have not been clear enough or if some important information is missing.

Thanks a lot !

Blaise_Pabon-Sureify · March 25, 2022, 5:29pm

@SebastienJ , I think you did a great job writing up your question and I’m sad that no one seems to have responded.

I ran into a similar situation (running whole dev instance on a single cluster) and we couldn’t figure out where the bottleneck was.
Some of that is related to a lack of ops experience with k8s and some is a lack of familiarity with the insides of Temporal.

If I learn anything, I’m happy to share it here, unless there is a place where the performance management folks hang out.

tihomir · March 25, 2022, 7:36pm

We try to get to every single question both here and on our Slack, but it seems this is one of few questions that got lost in the mix, apologies for that. If your question did not get a response in while, feel free to “bump” it by either posting just a "hey im still here " type of reply or feel free to send me priv reminder message here as well.

@SebastienJ can you please let us know if your question is still outstanding? Thanks!

Topic		Replies	Views
Temporal performance issues Community Support java-sdk , performance , worker , kubernetes	1	1930	April 26, 2023
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	3020	July 19, 2022
Workflow task timed out on GKE Community Support java-sdk , cassandra , metrics	6	1099	June 8, 2022
Temporal performance with golang microservice, Cassandra & Elasticsearch Community Support go-sdk , elasticsearch , cassandra , docker , performance	14	3485	February 1, 2023
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	9191	November 13, 2021

Bad performance when deployed in Kubernetes - How to diagnose bottleneck?

Related topics