Workflow Activity gets stuck intermittently

Hi Team,

For context, we have long running activities in temporal and we self-host the temporal cluster.

We’re observing 2 cases of workflow getting stuck intermittently and not progressing indefinitely

  1. Here the workflow has gotten stuck, we see that the workflow execution has started and workflow task has been scheduled, but it stays stuck in this state indefinitely.

In the metrics dashboard, we observe that the Matching Service doesn’t receive the AddWorkflowTask from the History Service in this case, which I believe should be received.

  1. Here the activities are stuck in PENDING_ACTIVITY_STATE_SCHEDULED

In the metrics dashboard, we observe that the Matching Service receives the the AddWorkflowTask, but doesn’t received the AddActivityTask from the History Service in this case. In case of a successful workflow, we observed that both the events are received accurately. We believe this isn’t related to any specific activity code as we see this happening across activities (long and short running).

Note:

  • We checked the resource utilization of all components in the temporal service, none of them seems to be exceeding 15%.

Temporal Service Version - v1.24.3

1 Like

Did you check your server logs for any error messages for AddActivityTask, and look at your server metrics:

sum(rate(service_errors{}[1m]) or on () vector(0))

Also persistence errors:

sum(rate(persistence_error_with_type[1m])) by (operation, service_name)

On running
sum(rate(persistence_error_with_type[1m])) by (operation, service_name) I see the following, specifically the errors are related to GetTaskQueueUserData and GetTaskQueue

No results for sum(rate(service_errors{}[1m]) or on () vector(0)). No errors logs on service.

@tihomir

I am working with @inishchith on this, adding more context that might be useful(this applies to workflows where the ActivityTasks are stuck but the workflow task has completed, i.e. the 2nd case of error in the original message) -

  1. In Temporal Server metrics, we can see that for all failed workflows, we don’t see any “AddActivityTask” event (however, we do see “AddWorkflowTask” and “RespondWorkflowTaskCompleted”)
  2. In the history service dashboard, we don’t see any “TransferActiveTaskActivity” and “TimerActiveTaskActivityTimeout” events

Failed workflow-

Successful workflow-

@maxim @tihomir any direction that you could help with?

on further investigation with the following setup (1 master/read-write instance and 2 replicas/read-only), we have found that this occurs quite frequently when setup temporal to connect to PgPool. Is yet to occur in case we setup temporal to connect directly to the postgres-master.

Suggesting a requirement for strong consistency, minimal to no read-latency. But would love to read further on this.

docker-compose

services:
  postgres-master:
    image: postgres:14.3
    container_name: postgres-master
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: temporal
      PGDATA: /var/lib/postgresql/data/pgdata
    ports:
      - "5432:5432"
    volumes:
      - ./pg_hba.conf:/pg_hba.conf
      - postgres_master_data:/var/lib/postgresql/data
    command: >
      postgres -c listen_addresses='*' -c wal_level=replica -c max_wal_senders=5 -c hot_standby=on -c wal_log_hints=on -c hba_file=/pg_hba.conf -c max_connections=100
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  postgres-replica1:
    image: postgres:14.3
    container_name: postgres-replica1
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      PGDATA: /var/lib/postgresql/data/pgdata
    depends_on:
      - postgres-master
    volumes:
      - postgres_replica1_data:/var/lib/postgresql/data
    command: >
      bash -c "until pg_isready -h postgres-master -p 5432 -U postgres; do echo 'Waiting for master...'; sleep 2; done && 
      rm -rf /var/lib/postgresql/data/pgdata/* && 
      mkdir -p /var/lib/postgresql/data/pgdata && 
      chmod 0700 /var/lib/postgresql/data/pgdata && 
      PGPASSWORD=postgres pg_basebackup -h postgres-master -U postgres -D /var/lib/postgresql/data/pgdata -Fp -Xs -P -R && 
      postgres"

  postgres-replica2:
    image: postgres:14.3
    container_name: postgres-replica2
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      PGDATA: /var/lib/postgresql/data/pgdata
    depends_on:
      - postgres-master
    volumes:
      - postgres_replica2_data:/var/lib/postgresql/data
    command: >
      bash -c "until pg_isready -h postgres-master -p 5432 -U postgres; do echo 'Waiting for master...'; sleep 2; done && 
      rm -rf /var/lib/postgresql/data/pgdata/* && 
      mkdir -p /var/lib/postgresql/data/pgdata && 
      chmod 0700 /var/lib/postgresql/data/pgdata && 
      PGPASSWORD=postgres pg_basebackup -h postgres-master -U postgres -D /var/lib/postgresql/data/pgdata -Fp -Xs -P -R && 
      postgres"

  pgpool:
    image: bitnami/pgpool:latest
    container_name: pgpool
    networks:
      - postgres-network
    environment:
      PGPOOL_BACKEND_NODES: "0:postgres-master:5432,1:postgres-replica1:5432,2:postgres-replica2:5432"
      PGPOOL_POSTGRES_USERNAME: postgres
      PGPOOL_POSTGRES_PASSWORD: postgres
      PGPOOL_ADMIN_USERNAME: admin
      PGPOOL_ADMIN_PASSWORD: adminpassword
      PGPOOL_SR_CHECK_USER: postgres
      PGPOOL_SR_CHECK_PASSWORD: postgres
      PGPOOL_HEALTH_CHECK_USER: postgres
      PGPOOL_HEALTH_CHECK_PASSWORD: postgres
      PGPOOL_ENABLE_LOAD_BALANCING: "yes"
      PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING: "yes"
      PGPOOL_SR_CHECK_PERIOD: "10"
      PGPOOL_SR_CHECK_DATABASE: "postgres"
      PGPOOL_POSTGRES_HOST: "postgres-master"
      PGPOOL_POSTGRES_PORT: "5432"
      PGPOOL_BACKEND_APPLICATION_NAME: "pgpool"
      PGPOOL_HEALTH_CHECK_PERIOD: "10"
      PGPOOL_HEALTH_CHECK_TIMEOUT: "5"
      PGPOOL_HEALTH_CHECK_MAX_RETRIES: "3"
      PGPOOL_HEALTH_CHECK_RETRY_DELAY: "1"
      PGPOOL_NUM_INIT_CHILDREN: "32"
      PGPOOL_MAX_POOL: "4"
      PGPOOL_AUTH_METHOD: "md5"
      PGPOOL_BACKEND_DATA_DIRECTORY0: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_DATA_DIRECTORY1: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_DATA_DIRECTORY2: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_FLAG0: "ALLOW_TO_FAILOVER"
      PGPOOL_BACKEND_FLAG1: "ALLOW_TO_FAILOVER"
      PGPOOL_BACKEND_FLAG2: "ALLOW_TO_FAILOVER"
      PGPOOL_FAILOVER_ON_BACKEND_ERROR: "off"  # Prevent automatic failover during initial setup
      PGPOOL_FAIL_OVER_ON_BACKEND_ERROR: "off"
    ports:
      - "5433:5432"
    depends_on:
      postgres-master:
        condition: service_healthy
      postgres-replica1:
        condition: service_started
      postgres-replica2:
        condition: service_started
    healthcheck:
      test: ["CMD", "/opt/bitnami/scripts/pgpool/healthcheck.sh"]
      interval: 10s
      timeout: 5s
      retries: 5

  temporal:
    image: temporalio/auto-setup:1.24.3
    container_name: temporal
    networks:
      - postgres-network
    depends_on:
      pgpool:
        condition: service_healthy
    environment:
      DB: postgres12
      POSTGRES_USER: postgres
      POSTGRES_PWD: postgres
      POSTGRES_SEEDS: pgpool
      DB_PORT: "5432"
      # POSTGRES_DB: temporal
      POSTGRES_ENABLE_SSL: "false"
      TEMPORAL_ADDRESS: temporal:7233
    ports:
      - "7233:7233"
      # - "8233:8233"
  
  
  temporal-ui:
    container_name: temporal-ui
    depends_on:
      - temporal
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TEMPORAL_CORS_ORIGINS=http://localhost:3000
    image: temporalio/ui:latest
    networks:
      - postgres-network
    ports:
      - 8233:8080


volumes:
  postgres_master_data:
  postgres_replica1_data:
  postgres_replica2_data:

networks:
  postgres-network:
    driver: bridge

Temporal only works with strongly consistent databases.

1 Like