Workflow Activity gets stuck intermittently

inishchith · March 25, 2025, 10:11am

Hi Team,

For context, we have long running activities in temporal and we self-host the temporal cluster.

We’re observing 2 cases of workflow getting stuck intermittently and not progressing indefinitely

Here the workflow has gotten stuck, we see that the workflow execution has started and workflow task has been scheduled, but it stays stuck in this state indefinitely.

Screenshot 2025-03-25 at 3.32.07 PM2688×720 28.4 KB

Screenshot 2025-03-25 at 3.34.45 PM830×214 14.6 KB

In the metrics dashboard, we observe that the Matching Service doesn’t receive the AddWorkflowTask from the History Service in this case, which I believe should be received.

Here the activities are stuck in PENDING_ACTIVITY_STATE_SCHEDULED

Screenshot 2025-03-25 at 3.33.30 PM2698×756 157 KB

In the metrics dashboard, we observe that the Matching Service receives the the AddWorkflowTask, but doesn’t received the AddActivityTask from the History Service in this case. In case of a successful workflow, we observed that both the events are received accurately. We believe this isn’t related to any specific activity code as we see this happening across activities (long and short running).

Note:

We checked the resource utilization of all components in the temporal service, none of them seems to be exceeding 15%.

Temporal Service Version - v1.24.3

tihomir · March 25, 2025, 2:25pm

Did you check your server logs for any error messages for AddActivityTask, and look at your server metrics:

sum(rate(service_errors{}[1m]) or on () vector(0))

Also persistence errors:

sum(rate(persistence_error_with_type[1m])) by (operation, service_name)

inishchith · March 25, 2025, 2:57pm

On running
sum(rate(persistence_error_with_type[1m])) by (operation, service_name) I see the following, specifically the errors are related to GetTaskQueueUserData and GetTaskQueue

No results for sum(rate(service_errors{}[1m]) or on () vector(0)). No errors logs on service.

Sanil_Khurana1 · March 26, 2025, 4:53am

@tihomir

I am working with @inishchith on this, adding more context that might be useful(this applies to workflows where the ActivityTasks are stuck but the workflow task has completed, i.e. the 2nd case of error in the original message) -

In Temporal Server metrics, we can see that for all failed workflows, we don’t see any “AddActivityTask” event (however, we do see “AddWorkflowTask” and “RespondWorkflowTaskCompleted”)
In the history service dashboard, we don’t see any “TransferActiveTaskActivity” and “TimerActiveTaskActivityTimeout” events

Failed workflow-

Sanil_Khurana1 · March 26, 2025, 5:00am

Successful workflow-

inishchith · March 26, 2025, 5:30pm

@maxim @tihomir any direction that you could help with?

inishchith · March 29, 2025, 9:36am

on further investigation with the following setup (1 master/read-write instance and 2 replicas/read-only), we have found that this occurs quite frequently when setup temporal to connect to PgPool. Is yet to occur in case we setup temporal to connect directly to the postgres-master.

Suggesting a requirement for strong consistency, minimal to no read-latency. But would love to read further on this.

docker-compose

services:
  postgres-master:
    image: postgres:14.3
    container_name: postgres-master
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: temporal
      PGDATA: /var/lib/postgresql/data/pgdata
    ports:
      - "5432:5432"
    volumes:
      - ./pg_hba.conf:/pg_hba.conf
      - postgres_master_data:/var/lib/postgresql/data
    command: >
      postgres -c listen_addresses='*' -c wal_level=replica -c max_wal_senders=5 -c hot_standby=on -c wal_log_hints=on -c hba_file=/pg_hba.conf -c max_connections=100
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  postgres-replica1:
    image: postgres:14.3
    container_name: postgres-replica1
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      PGDATA: /var/lib/postgresql/data/pgdata
    depends_on:
      - postgres-master
    volumes:
      - postgres_replica1_data:/var/lib/postgresql/data
    command: >
      bash -c "until pg_isready -h postgres-master -p 5432 -U postgres; do echo 'Waiting for master...'; sleep 2; done && 
      rm -rf /var/lib/postgresql/data/pgdata/* && 
      mkdir -p /var/lib/postgresql/data/pgdata && 
      chmod 0700 /var/lib/postgresql/data/pgdata && 
      PGPASSWORD=postgres pg_basebackup -h postgres-master -U postgres -D /var/lib/postgresql/data/pgdata -Fp -Xs -P -R && 
      postgres"

  postgres-replica2:
    image: postgres:14.3
    container_name: postgres-replica2
    user: postgres
    networks:
      - postgres-network
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      PGDATA: /var/lib/postgresql/data/pgdata
    depends_on:
      - postgres-master
    volumes:
      - postgres_replica2_data:/var/lib/postgresql/data
    command: >
      bash -c "until pg_isready -h postgres-master -p 5432 -U postgres; do echo 'Waiting for master...'; sleep 2; done && 
      rm -rf /var/lib/postgresql/data/pgdata/* && 
      mkdir -p /var/lib/postgresql/data/pgdata && 
      chmod 0700 /var/lib/postgresql/data/pgdata && 
      PGPASSWORD=postgres pg_basebackup -h postgres-master -U postgres -D /var/lib/postgresql/data/pgdata -Fp -Xs -P -R && 
      postgres"

  pgpool:
    image: bitnami/pgpool:latest
    container_name: pgpool
    networks:
      - postgres-network
    environment:
      PGPOOL_BACKEND_NODES: "0:postgres-master:5432,1:postgres-replica1:5432,2:postgres-replica2:5432"
      PGPOOL_POSTGRES_USERNAME: postgres
      PGPOOL_POSTGRES_PASSWORD: postgres
      PGPOOL_ADMIN_USERNAME: admin
      PGPOOL_ADMIN_PASSWORD: adminpassword
      PGPOOL_SR_CHECK_USER: postgres
      PGPOOL_SR_CHECK_PASSWORD: postgres
      PGPOOL_HEALTH_CHECK_USER: postgres
      PGPOOL_HEALTH_CHECK_PASSWORD: postgres
      PGPOOL_ENABLE_LOAD_BALANCING: "yes"
      PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING: "yes"
      PGPOOL_SR_CHECK_PERIOD: "10"
      PGPOOL_SR_CHECK_DATABASE: "postgres"
      PGPOOL_POSTGRES_HOST: "postgres-master"
      PGPOOL_POSTGRES_PORT: "5432"
      PGPOOL_BACKEND_APPLICATION_NAME: "pgpool"
      PGPOOL_HEALTH_CHECK_PERIOD: "10"
      PGPOOL_HEALTH_CHECK_TIMEOUT: "5"
      PGPOOL_HEALTH_CHECK_MAX_RETRIES: "3"
      PGPOOL_HEALTH_CHECK_RETRY_DELAY: "1"
      PGPOOL_NUM_INIT_CHILDREN: "32"
      PGPOOL_MAX_POOL: "4"
      PGPOOL_AUTH_METHOD: "md5"
      PGPOOL_BACKEND_DATA_DIRECTORY0: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_DATA_DIRECTORY1: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_DATA_DIRECTORY2: "/var/lib/postgresql/data"
      PGPOOL_BACKEND_FLAG0: "ALLOW_TO_FAILOVER"
      PGPOOL_BACKEND_FLAG1: "ALLOW_TO_FAILOVER"
      PGPOOL_BACKEND_FLAG2: "ALLOW_TO_FAILOVER"
      PGPOOL_FAILOVER_ON_BACKEND_ERROR: "off"  # Prevent automatic failover during initial setup
      PGPOOL_FAIL_OVER_ON_BACKEND_ERROR: "off"
    ports:
      - "5433:5432"
    depends_on:
      postgres-master:
        condition: service_healthy
      postgres-replica1:
        condition: service_started
      postgres-replica2:
        condition: service_started
    healthcheck:
      test: ["CMD", "/opt/bitnami/scripts/pgpool/healthcheck.sh"]
      interval: 10s
      timeout: 5s
      retries: 5

  temporal:
    image: temporalio/auto-setup:1.24.3
    container_name: temporal
    networks:
      - postgres-network
    depends_on:
      pgpool:
        condition: service_healthy
    environment:
      DB: postgres12
      POSTGRES_USER: postgres
      POSTGRES_PWD: postgres
      POSTGRES_SEEDS: pgpool
      DB_PORT: "5432"
      # POSTGRES_DB: temporal
      POSTGRES_ENABLE_SSL: "false"
      TEMPORAL_ADDRESS: temporal:7233
    ports:
      - "7233:7233"
      # - "8233:8233"
  
  
  temporal-ui:
    container_name: temporal-ui
    depends_on:
      - temporal
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TEMPORAL_CORS_ORIGINS=http://localhost:3000
    image: temporalio/ui:latest
    networks:
      - postgres-network
    ports:
      - 8233:8080


volumes:
  postgres_master_data:
  postgres_replica1_data:
  postgres_replica2_data:

networks:
  postgres-network:
    driver: bridge

maxim · March 29, 2025, 4:49pm

Temporal only works with strongly consistent databases.

Topic		Replies	Views
Stuck workflows after hight database load Community Support general-impl	11	456	July 11, 2024
Scheduled WorkflowTasks occasionally not starting Community Support	15	1165	July 1, 2024
Activity scheduled but not started (need help) Community Support go-sdk	22	5261	June 27, 2022
Some activities seem to be stuck & not starting Server Deployment	3	1311	December 10, 2023
Scheduling of activity gets stuck for about 10 minutes before starting Community Support go-sdk	4	340	February 9, 2024

Workflow Activity gets stuck intermittently

Related topics