Many activity Schedule to start errors

We are running a temporal cluster on our EKS:

The problem we are facing is that we have around 100k workflows per day, each workflow have many activities. each activity task queue is issues to a different worker pod.
most of the activities work perfectly, while one just almost 50% of the times gets ScheduleToStart timeout (5m).
We have tried horizontally scaling the worker pods to a large amount, but that did no change.
We tried increasing the CPU of the pods but also almost no change.
The poll success rate is almost 0%, but no matter how much we increases the pollers on the pod it did no effect.

The bottom line:
We also have the same setup on Temporal Cloud which we have no issues at all, this makes me think that there got to be some issue with the self hosted cluster.

These are some metrics for that specific activity:


(left) - sum by(taskqueue) (rate(poll_success_sync{exported_namespace="mail"}[1m])) / sum by(taskqueue) (rate(poll_success{exported_namespace="mail"}[1m]))
(right) - histogram_quantile(0.95, sum by(le, task_queue) (rate(temporal_activity_schedule_to_start_latency_bucket[5m])))


(left) - histogram_quantile(0.95, sum by(operation, le, taskqueue) (rate(asyncmatch_latency_bucket{service_name=~"matching"}[5m])))
(right) - sum by(taskqueue) (rate(workflow_success{exported_namespace="mail"}[5m]))


(left) - histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
(right) - histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))


(left) - sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
(right) - histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

ignore the spike around 14:20, as i was increasing the DB instance type at that time.

EKS - 1.29
Temporal - 1.25.2
Database - RDS Postgres 16.3 db.t3.xlarge 4cpu 16ram
Workers - python SDK

helm values:

server:
  history:
    resources:
      limits:
        memory: 5Gi
      requests:
        cpu: 2000m
        memory: 5Gi
  matching:
    resources:
      limits:
        memory: 1024Mi
      requests:
        cpu: 1000m
        memory: 1024Mi
  metrics:
    serviceMonitor:
      enabled: true
  config:
    numHistoryShards: 4096
    persistence:
      default:
        driver: "sql"

        sql:
          driver: "postgres12"
          host: XXXXXXXXXXXXXXXXXXXX
          port: 5432
          database: temporal
          user: postgres
          password: 'XXXXXXXXXX'
          maxConns: 20
          maxConnLifetime: "1h"
          tls:
            enabled: true
            enableHostVerification: false

      visibility:
        driver: "sql"

        sql:
          driver: "postgres12"
          host: XXXXXXXXXXXXXXXX
          port: 5432
          database: temporal_visibility
          user: postgres
          password: 'XXXXXXXXXXXX'
          maxConns: 20
          maxConnLifetime: "1h"
          tls:
            enabled: true
            enableHostVerification: false
    namespaces:
      # Enable this to create namespaces
      create: true
      namespace:
        - name: default
          retention: 7d
        - name: mail
          retention: 1d

cassandra:
  enabled: false

mysql:
  enabled: false

postgresql:
  enabled: true

prometheus:
  enabled: false

grafana:
  enabled: false

elasticsearch:
  enabled: false

schema:
  createDatabase:
    enabled: true
  setup:
    enabled: true
  update:
    enabled: true