Low TPS (~60) Despite High Concurrency Settings - Need Help Identifying Bottleneck

Hi Temporal Community,

I’m experiencing unexpectedly low throughput (approximately 60 TPS) in my Temporal deployment, and scaling up pods doesn’t seem to improve performance. I’d appreciate any guidance on identifying the bottleneck.

Workflow Details:

  • Simple test workflow with 2 sequential activities (one logs “hello”, the other logs “world”)
  • Using Temporal SDK v1.38.0 with Go

Deployment Architecture:

  • Running Temporal services in separate pods (not all-in-one)
  • 2 pods per service type (Frontend, History, Matching, Worker, ui, worker-app ( it has the activty and workflow configurations )
  • Each service has its own dedicated pods for better isolation

Temporal Server Configuration (per service pod):

Frontend Service:

  • no extra configurations, just membershipPort, grpcPort etc ports releted settings

History Service:

  • persistenceMaxQPS: 3000
  • persistenceGlobalMaxQPS: 0

Matching Service:

  • no extra configurations, just membershipPort, grpcPort etc ports releted settings

Persistence Layer:

  • History shards: 64
  • Database: PostgreSQL 12
  • Default store connections: maxConns=50, maxIdleConns=20, maxConnLifetime=1h
  • Visibility store connections: maxConns=400, maxIdleConns=100, maxConnLifetime=1h

Worker Configuration (per worker pod):

  • MaxConcurrentWorkflowTaskExecutionSize: 1000
  • MaxConcurrentActivityExecutionSize: 1000
  • MaxConcurrentLocalActivityExecutionSize: 1000
  • MaxConcurrentWorkflowTaskPollers: 64
  • MaxConcurrentActivityTaskPollers: 64
  • WorkerActivitiesPerSecond: not configured
  • have one taskqueue and all flow starts on the same, both pods of temporal-worker-app will have their one - one workers with above configurations,

Issue:
Despite these high concurrency settings and having 2 pods for each service type, I’m only achieving ~60 TPS. When I scale up the number of pods (tried increasing worker pods), the TPS remains the same, suggesting I’ve hit some kind of ceiling or bottleneck.

Questions:

  1. With 64 history shards, 2 History service pods, and these worker settings, what could be limiting throughput to 60 TPS?

  2. Could the History service’s persistenceMaxQPS of 3000 be throttling at a lower level?

  3. Is the database connection pool (50 connections for default store) the bottleneck? Should I increase this?

  4. With 2 pods per service, are there any service-level rate limits or configurations I’m missing?

  5. Should I increase history shards beyond 64 for better parallelism across the 2 History pods?

  6. Could the Matching service be the limiting factor with only 2 pods?

    Also if any details are required please let me know,

Any insights would be greatly appreciated.

Few graphs you may refer
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))

histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation=“ShardInfo”}[1m])) by (le))

Also I see, getTaskQueueUserData, PollActivityTaskqueue, PollWorkflowTaskqueue is taking too long. is this expected?