Workflow gets stuck in random activities

Hi Team,
I have the following setup: k8s, Temporal Helm Chart 0.65.0, PostgreSQL-HA Bitnami Helm Chart 16.0.22, some other containers (activity workers) - with one replica to make testing easier:

core-79566f69bf-d8smx                          1/1     Running   0             16h
...
postgres-pgpool-9f44b7fb7-8qhb4                1/1     Running   0             12h
postgres-postgresql-0                          1/1     Running   0             12h
postgres-postgresql-1                          1/1     Running   0             12h
postgres-postgresql-2                          1/1     Running   0             12h
temporal-admintools-78b788bcf9-l7s22           1/1     Running   0             44h
temporal-frontend-5d94d697bd-bcjvp             1/1     Running   0             44h
temporal-history-7d87cc6cfd-hhtdw              1/1     Running   0             44h
temporal-matching-64ddbc5cff-2p98d             1/1     Running   0             44h
temporal-web-74dff6cd44-wlmdc                  1/1     Running   0             44h
temporal-worker-6f5f8d79cb-h75kj               1/1     Running   0             11h

Sometimes, activity gets stuck in the ‘Pending Activity’ state (or any other random state).
I’ve read many similar threads, but I haven’t found a clear answer that applies to my case.

In my use case, I need to start (lets say) 1000 workflows at a time and then wait up to 1 day for completion (currently it doesn’t work even for 5 workflows at a time). I can’t use short ‘schedule to start’ timeout, because in a normal situations, some tasks may wait in the queue for hours (depending on the worker load).

One thread suggested that direct connection to the master replica doesn’t cause any problems… and it might be true (I didn’t observe problem in that case), but I need to connect to pgpool anyway.
I have synchronous replication turned on, btw.
POSTGRESQL_SYNCHRONOUS_REPLICAS_MODE=FIRST
POSTGRESQL_NUM_SYNCHRONOUS_REPLICAS=2

Eample of hanged workflow:

I don’t have metrics integration yet…

Even in case of higher load on database side, I would expect some tasks will perform slower, not stuck at all.
What am I missing?..
Is it problem with postgres configuration or temporal setup…?

Temporal setup:

## TEMPORAL CHART - from https://github.com/temporalio/helm-charts
temporal:
  enabled: true
  debug: true
  imagePullSecrets:
    - name: *repoUser
  server:
    enabled: true
    image:
      repository: *temporalServerRepository
      tag: *temporalServerVersion
      pullPolicy: *pullPolicy
    replicaCount: *temporalServerReplicas
    metrics:
      annotations:
        enabled: true
      tags: { }
      excludeTags: { }
      prefix:
      serviceMonitor:
        enabled: false
        interval: 30s
      prometheus:
        timerType: histogram
    podLabels: { }
    resources:
      requests:
        cpu: 100m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 1024Mi
    affinity: { }
    additionalVolumes: [ ]
    additionalVolumeMounts: [ ]
    additionalEnv: [ ]
    securityContext:
      fsGroup: 1000
      runAsUser: 1000
    config:
      logLevel: "debug,info"
      # IMPORTANT: This value cannot be changed, once it's set.
      numHistoryShards: 512
      persistence:
        defaultStore: default
        additionalStores: { }
        default:
          driver: "sql"
          sql:
            driver: "postgres12"
            host: postgres-pgpool
            port: 5432
            database: temporal
            user: temporal
            existingSecret: *temporalSecret
            maxConns: 20
            maxIdleConns: 20
            maxConnLifetime: "1h"
        visibility:
          driver: "sql"
          sql:
            driver: "postgres12"
            host: postgres-pgpool
            port: 5432
            database: temporal-visibility
            user: temporal
            existingSecret: *temporalSecret
            maxConns: 20
            maxIdleConns: 20
            maxConnLifetime: "1h"
      namespaces:
        create: false
    frontend:
      service:
        # Evaluated as template
        annotations: { }
        type: ClusterIP
        port: 7233
        membershipPort: 6933
        httpPort: 7243
      ingress:
        enabled: false
        annotations: { }
        hosts:
          - "/"
        tls: [ ]
      metrics:
        annotations:
          enabled: true
        serviceMonitor: { }
        # enabled: false
        prometheus: { }
        # timerType: histogram
      deploymentLabels: { }
      deploymentAnnotations: { }
      podAnnotations: { }
      podLabels: { }
      replicaCount: 1 # FIXME
      resources:
        limits:
          cpu: 2000m
          memory: 4096Mi
        requests:
          cpu: 100m
          memory: 128Mi
      nodeSelector: { }
      tolerations: [ ]
      affinity: { }
      additionalEnv: [ ]
      containerSecurityContext: { }
      topologySpreadConstraints: [ ]
      podDisruptionBudget: { }
    internalFrontend:
      # Enable this to create internal-frontend
      enabled: false
    history:
      service:
        # type: ClusterIP
        port: 7234
        membershipPort: 6934
      metrics:
        annotations:
          enabled: true
        serviceMonitor: { }
        enabled: true
        prometheus: { }
      deploymentLabels: { }
      deploymentAnnotations: { }
      podAnnotations: { }
      podLabels: { }
      replicaCount: 1 # FIXME
      resources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 2000m
          memory: 2048Mi
      nodeSelector: { }
      tolerations: [ ]
      affinity: { }
      additionalEnv: [ ]
      additionalEnvSecretName: ""
      containerSecurityContext: { }
      topologySpreadConstraints: [ ]
      podDisruptionBudget: { }
    matching:
      service:
        # type: ClusterIP
        port: 7235
        membershipPort: 6935
      metrics:
        annotations:
          enabled: false
        serviceMonitor: { }
        enabled: true
        prometheus: { }
      deploymentLabels: { }
      deploymentAnnotations: { }
      podAnnotations: { }
      podLabels: { }
      replicaCount: 1 # FIXME
      resources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 2000m
          memory: 2048Mi
      nodeSelector: { }
      tolerations: [ ]
      affinity: { }
      additionalEnv: [ ]
      containerSecurityContext: { }
      topologySpreadConstraints: [ ]
      podDisruptionBudget: { }
    worker:
      service:
        # type: ClusterIP
        port: 7239
        membershipPort: 6939
      metrics:
        annotations:
          enabled: true
        serviceMonitor: { }
        # enabled: false
        prometheus: { }
        # timerType: histogram
      deploymentLabels: { }
      deploymentAnnotations: { }
      podAnnotations: { }
      podLabels: { }
      replicaCount: 1 # FIXME
      resources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 2000m
          memory: 2048Mi
      nodeSelector: { }
      tolerations: [ ]
      affinity: { }
      additionalEnv: [ ]
      containerSecurityContext: { }
      topologySpreadConstraints: [ ]
      podDisruptionBudget: { }
  admintools:
    enabled: true
    image:
      repository: *temporalAdminToolsVersionRepository
      tag: *temporalAdminToolsVersion
      pullPolicy: *pullPolicy
    service:
      type: ClusterIP
      port: 22
      annotations: { }
    tolerations: [ ]
    affinity: { }
    additionalEnv: [ ]
    additionalEnvSecretName: ""
    containerSecurityContext: { }
    securityContext: { }
  web:
    enabled: true
    replicaCount: 1
    image:
      repository: *temporalUiVersionRepository
      tag: *temporalUiVersion
      pullPolicy: *pullPolicy
    service:
      # set type to NodePort if access to web needs access from outside the cluster
      # for more info see https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types
      type: ClusterIP
      # The below clusterIP setting can be set to "None" to make the temporal-web service headless.
      # Note that this requires the web.service.type to be the default ClusterIP value.
      # clusterIP:
      port: 8080
      annotations: { }
      # loadBalancerIP:
    ingress:
      enabled: false
      annotations: { }
      hosts:
        - "/"
      tls: [ ]
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi
    nodeSelector: { }
    tolerations: [ ]
    affinity: { }
    additionalVolumes: [ ]
    additionalVolumeMounts: [ ]
    # Adjust Web UI config with environment variables:
    # https://docs.temporal.io/references/web-ui-environment-variables
    additionalEnv: [ ]
    additionalEnvSecretName: ""
    containerSecurityContext: { }
    securityContext: { }
    topologySpreadConstraints: [ ]
    podDisruptionBudget: { }
  schema:
    createDatabase:
      enabled: true
    setup:
      enabled: true
      backoffLimit: 100
    update:
      enabled: true
      backoffLimit: 100
    podAnnotations: { }
    podLabels: { }
    resources: { }
    containerSecurityContext: { }
    securityContext: { }
  elasticsearch:
    enabled: false
  prometheus:
    enabled: false
    nodeExporter:
      enabled: false
  grafana:
    enabled: false
  cassandra:
    enabled: false
  mysql:
    enabled: false

I’d appreciate You to point me the right direction.

Few more examples (sorry, I cant past it in one comment):

And what it supposed to look like (and in most cases does):

Hm, I’m not sure if I accidentally didn’t solve my problem :smiley:

Changing

disableLoadBalancingOnWrite: transaction

to

disableLoadBalancingOnWrite: always

causes I’m not experiencing any of the described problems… :thinking: :man_shrugging:

Temporal requires full db consistency. Seems that with your setup setting disableLoadBalancingOnWrite to always helped in that regard

1 Like