We are running a temporal cluster on our EKS:
The problem we are facing is that we have around 100k workflows per day, each workflow have many activities. each activity task queue is issues to a different worker pod.
most of the activities work perfectly, while one just almost 50% of the times gets ScheduleToStart timeout (5m).
We have tried horizontally scaling the worker pods to a large amount, but that did no change.
We tried increasing the CPU of the pods but also almost no change.
The poll success rate is almost 0%, but no matter how much we increases the pollers on the pod it did no effect.
The bottom line:
We also have the same setup on Temporal Cloud which we have no issues at all, this makes me think that there got to be some issue with the self hosted cluster.
These are some metrics for that specific activity:
(left) -
sum by(taskqueue) (rate(poll_success_sync{exported_namespace="mail"}[1m])) / sum by(taskqueue) (rate(poll_success{exported_namespace="mail"}[1m]))
(right) -
histogram_quantile(0.95, sum by(le, task_queue) (rate(temporal_activity_schedule_to_start_latency_bucket[5m])))
(left) -
histogram_quantile(0.95, sum by(operation, le, taskqueue) (rate(asyncmatch_latency_bucket{service_name=~"matching"}[5m])))
(right) -
sum by(taskqueue) (rate(workflow_success{exported_namespace="mail"}[5m]))
(left) -
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
(right) -
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
(left) -
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
(right) -
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))
ignore the spike around 14:20, as i was increasing the DB instance type at that time.
EKS - 1.29
Temporal - 1.25.2
Database - RDS Postgres 16.3 db.t3.xlarge 4cpu 16ram
Workers - python SDK
helm values:
server:
history:
resources:
limits:
memory: 5Gi
requests:
cpu: 2000m
memory: 5Gi
matching:
resources:
limits:
memory: 1024Mi
requests:
cpu: 1000m
memory: 1024Mi
metrics:
serviceMonitor:
enabled: true
config:
numHistoryShards: 4096
persistence:
default:
driver: "sql"
sql:
driver: "postgres12"
host: XXXXXXXXXXXXXXXXXXXX
port: 5432
database: temporal
user: postgres
password: 'XXXXXXXXXX'
maxConns: 20
maxConnLifetime: "1h"
tls:
enabled: true
enableHostVerification: false
visibility:
driver: "sql"
sql:
driver: "postgres12"
host: XXXXXXXXXXXXXXXX
port: 5432
database: temporal_visibility
user: postgres
password: 'XXXXXXXXXXXX'
maxConns: 20
maxConnLifetime: "1h"
tls:
enabled: true
enableHostVerification: false
namespaces:
# Enable this to create namespaces
create: true
namespace:
- name: default
retention: 7d
- name: mail
retention: 1d
cassandra:
enabled: false
mysql:
enabled: false
postgresql:
enabled: true
prometheus:
enabled: false
grafana:
enabled: false
elasticsearch:
enabled: false
schema:
createDatabase:
enabled: true
setup:
enabled: true
update:
enabled: true