Workflow deletion batch job fails despite reporting successful completion

guspirl · July 30, 2025, 10:53am

Hey folks,

To make temporal workflows list and its web equivalent usable again, I reduced the retention period from 30d to 7d and submitted two batch jobs to delete workflows older than 7d. Both batch jobs completed successfully, reporting to have deleted all of those old workflows, but I’m still seeing many workflows that closed before the cutoff date.

Before running the batch job, I ran temporal workflow count, which reported 594,313 workflows closed before the 7d cutoff date (--query 'CloseTime < "2025-07-22T23:59:59Z"'). As of this writing, the same command reports 283,507 workflows, so only half appear to have been deleted.

Batch job #1

I started with a smaller batch to get a sense of how Temporal and its DB would handle a bigger batch, so I deleted records older than 1st July, which means we had around one week’s worth of workflows given the original 30d retention.

This is the command I ran yesterday:

temporal workflow delete --query 'CloseTime < "2025-06-30T23:59:59Z"' --command-timeout 300s --rps 10
Start batch against approximately 30278 workflow(s)? y/N y
Started batch for job ID: 176074b3-8405-43bf-9fe4-a33b793f2a41

And this is its status as of today:

temporal batch describe --job-id 176074b3-8405-43bf-9fe4-a33b793f2a41 -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "176074b3-8405-43bf-9fe4-a33b793f2a41",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-29T09:48:45.509343914Z",
  "closeTime": "2025-07-29T10:38:44.424677935Z",
  "totalOperationCount": "29649",
  "completeOperationCount": "29649",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-55v5f",
  "reason": "<redacted>"
}

Unfortunately, I forgot to check that the workflows were actually deleted with temporal workflow list.

Batch job #2

I then ran the next batch job to delete all remaining workflows that closed over 7d ago, with a higher RPS having seen how Temporal and its DB coped with the previous job:

temporal workflow delete \
  --query 'CloseTime < "2025-07-22T00:00:00Z"' \
  --reason "<redacted>" \
  --rps 25 \
  --command-timeout 300s
Start batch against approximately 537082 workflow(s)? y/N y
Started batch for job ID: 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938

And this is its status as of this writing:

temporal batch describe --job-id 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938 -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-29T12:23:02.197788770Z",
  "closeTime": "2025-07-29T18:40:31.324505932Z",
  "totalOperationCount": "536921",
  "completeOperationCount": "536921",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-mqt29",
  "reason": "<redacted>"
}

Logs

There was a massive spike of error and warning logs (55M) between yesterday at 3pm UTC and today at 5am UTC (Datadog screenshot below uses UTC+1):

Zooming out, this is what the past 7 days look like:

There are two groups of error logs:

Operation failed with an error (~21M logs). Example:

{
  "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error
    /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154
    go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric
    /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217
  go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution
    /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:135
  go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution
    /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:360
  go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute
    /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:129
  go.temporal.io/server/service/history/queues.(*executableImpl).Execute
    /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:374
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:223
  go.temporal.io/server/common/backoff.ThrottleRetry.func1
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62
  go.temporal.io/server/common/backoff.ThrottleRetryContext
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89
  go.temporal.io/server/common/backoff.ThrottleRetry
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211",
  "level": "error",
  "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217",
  "ts": "2025-07-30T04:28:54.046Z",
  "error": "context deadline exceeded"
}

Critical error processing task, retrying. (~12.6M). Example:

{
  "unexpected-error-attempts": 69,
  "queue-task-type": "VisibilityDeleteExecution",
  "shard-id": 441,
  "address": "10.160.48.225:7234",
  "wf-namespace-id": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed",
  "level": "error",
  "wf-id": "autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z",
  "wf-history-event-id": 0,
  "queue-task": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed/autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z/0198049b-415b-7910-8d88-e351db0fc0a9",
  "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68",
  "attempt": 70,
  "wf-run-id": "0198049b-415b-7910-8d88-e351db0fc0a9",
  "component": "visibility-queue-processor",
  "queue-task-key": {
    "FireTime": "1970-01-01T00:00:00Z",
    "TaskID": 817889465
  },
  "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error
    /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154
  go.temporal.io/server/common/log.(*lazyLogger).Error
    /home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68
  go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr.func1
    /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:535
  go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr
    /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:599
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:224
  go.temporal.io/server/common/backoff.ThrottleRetry.func1
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62
  go.temporal.io/server/common/backoff.ThrottleRetryContext
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89
  go.temporal.io/server/common/backoff.ThrottleRetry
    /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233
  go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask
    /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211",
  "operation-result": "OperationCritical",
  "ts": "2025-07-30T03:56:31.010Z",
  "error": "unable to delete custom search attributes: context deadline exceeded"
}

DB metrics

These are some metrics between yesterday at 00:00 UTC and now (11am UTC):

Environment

Temporal server version: 1.27.2.
DB: MySQL v8 (Amazon RDS).
Helm chart version: 0.63.0. Each of the four deployment gets three pods.
Number of shards: 512 (we don’t know why, so we’re planning to shrink this in the future).

I’d really appreciate any help to get to the bottom of this issue! Thanks!

guspirl · July 30, 2025, 11:41am

After posting the message above, I realised that I accidentally ran temporal workflow delete with --rps 25 but I actually meant to use --rps 15. I suppose that could’ve overwhelmed Temporal and/or its DB.

So, I re-ran the delete subcommand with 15 RPS:

temporal workflow delete \
  --query 'CloseTime < "2025-07-22T00:00:00Z"' \
  --reason "<redacted>" \
  --rps 15 \
  --command-timeout 300s
Start batch against approximately 283507 workflow(s)? y/N y
Started batch for job ID: ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e

But it’s not really deleting anything, after running for half an hour and reporting 11% completion without failures:

temporal batch describe --job-id ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e",
  "state": "BATCH_OPERATION_STATE_RUNNING",
  "startTime": "2025-07-30T10:59:50.051215987Z",
  "totalOperationCount": "283507",
  "completeOperationCount": "31000",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-qtg8l",
  "reason": "<redacted>"
}

temporal workflow count --query 'CloseTime < "2025-07-22T00:00:00Z"' --command-timeout 90s
Total: 283507

I don’t see anything unusual in the DB metrics, but I do see a spike in warnings on Datadog:

The message is List query exceeded threshold. Example:

{
  "visibility-query": "CloseTime < \"2025-07-22T00:00:00Z\"",
  "level": "warn",
  "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:146",
  "ts": "2025-07-30T11:38:27.315Z",
  "namepsace": "default",
  "duration": 1.968041448
}

There are zero error messages so far.

UPDATE: The job completed but it didn’t actually delete anything:

temporal workflow count --query 'CloseTime < "2025-07-22T00:00:00Z"' --command-timeout 90s
Total: 283507

guspirl · July 31, 2025, 9:46am

I just created an issue on GitHub.

I also realised that batch size has nothing to do with this:

$ temporal workflow delete --query 'CloseTime > "2025-07-02T01:30:00Z" AND CloseTime < "2025-07-02T01:31:00Z"' --reason 'Reproducing batch deletion for upcoming GH issue' --rps 5 --command-timeout 90s
Start batch against approximately 56 workflow(s)? y/N y
Started batch for job ID: 497cbd13-f55a-4a7e-b384-68bc93afd5fd

$ temporal batch describe --job-id 497cbd13-f55a-4a7e-b384-68bc93afd5fd -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "497cbd13-f55a-4a7e-b384-68bc93afd5fd",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-31T09:36:57.740641188Z",
  "closeTime": "2025-07-31T09:37:10.744771544Z",
  "totalOperationCount": "56",
  "completeOperationCount": "56",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-qtg8l",
  "reason": "Reproducing batch deletion for upcoming GH issue"
}

$ temporal workflow count --query 'CloseTime > "2025-07-02T01:30:00Z" AND CloseTime < "2025-07-02T01:31:00Z"' --command-timeout 90s
Total: 56

Topic		Replies	Views
Temporal Delete WorkFlow Community Support java-sdk	4	250	November 10, 2025
Workflow deletion in dashboard Community Support workflow-options , typescript-sdk	2	1957	March 14, 2023
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2581	July 13, 2023
Invisible task queue backlog after mass cancellation Community Support	5	104	November 21, 2024
Occasionally workflow task won't be started after scheduled Community Support	17	684	September 2, 2025