Hi Team,
I have the following setup: k8s, Temporal Helm Chart 0.65.0, PostgreSQL-HA Bitnami Helm Chart 16.0.22, some other containers (activity workers) - with one replica to make testing easier:
core-79566f69bf-d8smx 1/1 Running 0 16h
...
postgres-pgpool-9f44b7fb7-8qhb4 1/1 Running 0 12h
postgres-postgresql-0 1/1 Running 0 12h
postgres-postgresql-1 1/1 Running 0 12h
postgres-postgresql-2 1/1 Running 0 12h
temporal-admintools-78b788bcf9-l7s22 1/1 Running 0 44h
temporal-frontend-5d94d697bd-bcjvp 1/1 Running 0 44h
temporal-history-7d87cc6cfd-hhtdw 1/1 Running 0 44h
temporal-matching-64ddbc5cff-2p98d 1/1 Running 0 44h
temporal-web-74dff6cd44-wlmdc 1/1 Running 0 44h
temporal-worker-6f5f8d79cb-h75kj 1/1 Running 0 11h
Sometimes, activity gets stuck in the ‘Pending Activity’ state (or any other random state).
I’ve read many similar threads, but I haven’t found a clear answer that applies to my case.
In my use case, I need to start (lets say) 1000 workflows at a time and then wait up to 1 day for completion (currently it doesn’t work even for 5 workflows at a time). I can’t use short ‘schedule to start’ timeout, because in a normal situations, some tasks may wait in the queue for hours (depending on the worker load).
One thread suggested that direct connection to the master replica doesn’t cause any problems… and it might be true (I didn’t observe problem in that case), but I need to connect to pgpool anyway.
I have synchronous replication turned on, btw.
POSTGRESQL_SYNCHRONOUS_REPLICAS_MODE=FIRST
POSTGRESQL_NUM_SYNCHRONOUS_REPLICAS=2
Eample of hanged workflow:
I don’t have metrics integration yet…
Even in case of higher load on database side, I would expect some tasks will perform slower, not stuck at all.
What am I missing?..
Is it problem with postgres configuration or temporal setup…?
Temporal setup:
## TEMPORAL CHART - from https://github.com/temporalio/helm-charts
temporal:
enabled: true
debug: true
imagePullSecrets:
- name: *repoUser
server:
enabled: true
image:
repository: *temporalServerRepository
tag: *temporalServerVersion
pullPolicy: *pullPolicy
replicaCount: *temporalServerReplicas
metrics:
annotations:
enabled: true
tags: { }
excludeTags: { }
prefix:
serviceMonitor:
enabled: false
interval: 30s
prometheus:
timerType: histogram
podLabels: { }
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 1000m
memory: 1024Mi
affinity: { }
additionalVolumes: [ ]
additionalVolumeMounts: [ ]
additionalEnv: [ ]
securityContext:
fsGroup: 1000
runAsUser: 1000
config:
logLevel: "debug,info"
# IMPORTANT: This value cannot be changed, once it's set.
numHistoryShards: 512
persistence:
defaultStore: default
additionalStores: { }
default:
driver: "sql"
sql:
driver: "postgres12"
host: postgres-pgpool
port: 5432
database: temporal
user: temporal
existingSecret: *temporalSecret
maxConns: 20
maxIdleConns: 20
maxConnLifetime: "1h"
visibility:
driver: "sql"
sql:
driver: "postgres12"
host: postgres-pgpool
port: 5432
database: temporal-visibility
user: temporal
existingSecret: *temporalSecret
maxConns: 20
maxIdleConns: 20
maxConnLifetime: "1h"
namespaces:
create: false
frontend:
service:
# Evaluated as template
annotations: { }
type: ClusterIP
port: 7233
membershipPort: 6933
httpPort: 7243
ingress:
enabled: false
annotations: { }
hosts:
- "/"
tls: [ ]
metrics:
annotations:
enabled: true
serviceMonitor: { }
# enabled: false
prometheus: { }
# timerType: histogram
deploymentLabels: { }
deploymentAnnotations: { }
podAnnotations: { }
podLabels: { }
replicaCount: 1 # FIXME
resources:
limits:
cpu: 2000m
memory: 4096Mi
requests:
cpu: 100m
memory: 128Mi
nodeSelector: { }
tolerations: [ ]
affinity: { }
additionalEnv: [ ]
containerSecurityContext: { }
topologySpreadConstraints: [ ]
podDisruptionBudget: { }
internalFrontend:
# Enable this to create internal-frontend
enabled: false
history:
service:
# type: ClusterIP
port: 7234
membershipPort: 6934
metrics:
annotations:
enabled: true
serviceMonitor: { }
enabled: true
prometheus: { }
deploymentLabels: { }
deploymentAnnotations: { }
podAnnotations: { }
podLabels: { }
replicaCount: 1 # FIXME
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 2000m
memory: 2048Mi
nodeSelector: { }
tolerations: [ ]
affinity: { }
additionalEnv: [ ]
additionalEnvSecretName: ""
containerSecurityContext: { }
topologySpreadConstraints: [ ]
podDisruptionBudget: { }
matching:
service:
# type: ClusterIP
port: 7235
membershipPort: 6935
metrics:
annotations:
enabled: false
serviceMonitor: { }
enabled: true
prometheus: { }
deploymentLabels: { }
deploymentAnnotations: { }
podAnnotations: { }
podLabels: { }
replicaCount: 1 # FIXME
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 2000m
memory: 2048Mi
nodeSelector: { }
tolerations: [ ]
affinity: { }
additionalEnv: [ ]
containerSecurityContext: { }
topologySpreadConstraints: [ ]
podDisruptionBudget: { }
worker:
service:
# type: ClusterIP
port: 7239
membershipPort: 6939
metrics:
annotations:
enabled: true
serviceMonitor: { }
# enabled: false
prometheus: { }
# timerType: histogram
deploymentLabels: { }
deploymentAnnotations: { }
podAnnotations: { }
podLabels: { }
replicaCount: 1 # FIXME
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 2000m
memory: 2048Mi
nodeSelector: { }
tolerations: [ ]
affinity: { }
additionalEnv: [ ]
containerSecurityContext: { }
topologySpreadConstraints: [ ]
podDisruptionBudget: { }
admintools:
enabled: true
image:
repository: *temporalAdminToolsVersionRepository
tag: *temporalAdminToolsVersion
pullPolicy: *pullPolicy
service:
type: ClusterIP
port: 22
annotations: { }
tolerations: [ ]
affinity: { }
additionalEnv: [ ]
additionalEnvSecretName: ""
containerSecurityContext: { }
securityContext: { }
web:
enabled: true
replicaCount: 1
image:
repository: *temporalUiVersionRepository
tag: *temporalUiVersion
pullPolicy: *pullPolicy
service:
# set type to NodePort if access to web needs access from outside the cluster
# for more info see https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types
type: ClusterIP
# The below clusterIP setting can be set to "None" to make the temporal-web service headless.
# Note that this requires the web.service.type to be the default ClusterIP value.
# clusterIP:
port: 8080
annotations: { }
# loadBalancerIP:
ingress:
enabled: false
annotations: { }
hosts:
- "/"
tls: [ ]
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
nodeSelector: { }
tolerations: [ ]
affinity: { }
additionalVolumes: [ ]
additionalVolumeMounts: [ ]
# Adjust Web UI config with environment variables:
# https://docs.temporal.io/references/web-ui-environment-variables
additionalEnv: [ ]
additionalEnvSecretName: ""
containerSecurityContext: { }
securityContext: { }
topologySpreadConstraints: [ ]
podDisruptionBudget: { }
schema:
createDatabase:
enabled: true
setup:
enabled: true
backoffLimit: 100
update:
enabled: true
backoffLimit: 100
podAnnotations: { }
podLabels: { }
resources: { }
containerSecurityContext: { }
securityContext: { }
elasticsearch:
enabled: false
prometheus:
enabled: false
nodeExporter:
enabled: false
grafana:
enabled: false
cassandra:
enabled: false
mysql:
enabled: false
I’d appreciate You to point me the right direction.