Many activity Schedule to start errors

Omri_Shilton · January 9, 2025, 2:53pm

We are running a temporal cluster on our EKS:

The problem we are facing is that we have around 100k workflows per day, each workflow have many activities. each activity task queue is issues to a different worker pod.
most of the activities work perfectly, while one just almost 50% of the times gets ScheduleToStart timeout (5m).
We have tried horizontally scaling the worker pods to a large amount, but that did no change.
We tried increasing the CPU of the pods but also almost no change.
The poll success rate is almost 0%, but no matter how much we increases the pollers on the pod it did no effect.

The bottom line:
We also have the same setup on Temporal Cloud which we have no issues at all, this makes me think that there got to be some issue with the self hosted cluster.

These are some metrics for that specific activity:

(left) -

sum by(taskqueue) (rate(poll_success_sync{exported_namespace="mail"}[1m])) / sum by(taskqueue) (rate(poll_success{exported_namespace="mail"}[1m]))

(right) - histogram_quantile(0.95, sum by(le, task_queue) (rate(temporal_activity_schedule_to_start_latency_bucket[5m])))

(left) -

histogram_quantile(0.95, sum by(operation, le, taskqueue) (rate(asyncmatch_latency_bucket{service_name=~"matching"}[5m])))

(right) - sum by(taskqueue) (rate(workflow_success{exported_namespace="mail"}[5m]))

(left) - histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
(right) - histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

(left) - sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
(right) - histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

ignore the spike around 14:20, as i was increasing the DB instance type at that time.

EKS - 1.29
Temporal - 1.25.2
Database - RDS Postgres 16.3 db.t3.xlarge 4cpu 16ram
Workers - python SDK

helm values:

server:
  history:
    resources:
      limits:
        memory: 5Gi
      requests:
        cpu: 2000m
        memory: 5Gi
  matching:
    resources:
      limits:
        memory: 1024Mi
      requests:
        cpu: 1000m
        memory: 1024Mi
  metrics:
    serviceMonitor:
      enabled: true
  config:
    numHistoryShards: 4096
    persistence:
      default:
        driver: "sql"

        sql:
          driver: "postgres12"
          host: XXXXXXXXXXXXXXXXXXXX
          port: 5432
          database: temporal
          user: postgres
          password: 'XXXXXXXXXX'
          maxConns: 20
          maxConnLifetime: "1h"
          tls:
            enabled: true
            enableHostVerification: false

      visibility:
        driver: "sql"

        sql:
          driver: "postgres12"
          host: XXXXXXXXXXXXXXXX
          port: 5432
          database: temporal_visibility
          user: postgres
          password: 'XXXXXXXXXXXX'
          maxConns: 20
          maxConnLifetime: "1h"
          tls:
            enabled: true
            enableHostVerification: false
    namespaces:
      # Enable this to create namespaces
      create: true
      namespace:
        - name: default
          retention: 7d
        - name: mail
          retention: 1d

cassandra:
  enabled: false

mysql:
  enabled: false

postgresql:
  enabled: true

prometheus:
  enabled: false

grafana:
  enabled: false

elasticsearch:
  enabled: false

schema:
  createDatabase:
    enabled: true
  setup:
    enabled: true
  update:
    enabled: true

Ethan · March 7, 2025, 9:06am

Hello @Omri_Shilton are you able to address the issue? Actually we are also self hosting our own cluster. And we are able to finish 100k workflows in around 15mins, although our activities are pretty simple.

Omri_Shilton · March 9, 2025, 1:06pm

we just ended up giving up on this issue and not use temporal for that part of our design.
the lack of community help was astounding…

Ethan · March 10, 2025, 1:41am

Aha, I see. Thank you so much for your reply.

Topic		Replies	Views
Why the ActivityTaskTimeout received? Community Support helm , php-sdk , kubernetes , postgresql	11	998	September 6, 2022
Temporal Activity Poll & Start Delays - Issues under Load Community Support java-sdk , general-impl	6	719	May 24, 2023
Too high schedule to start latency with fine metrics Community Support go-sdk , helm , general-impl , activity	4	412	March 26, 2024
Some activities seem to be stuck & not starting Server Deployment	3	1266	December 10, 2023
Temporal activity timeout issue Community Support	4	1538	December 18, 2020

Many activity Schedule to start errors

Related topics