Workflow deletion batch job fails despite reporting successful completion

Hey folks,

To make temporal workflows list and its web equivalent usable again, I reduced the retention period from 30d to 7d and submitted two batch jobs to delete workflows older than 7d. Both batch jobs completed successfully, reporting to have deleted all of those old workflows, but I’m still seeing many workflows that closed before the cutoff date.

Before running the batch job, I ran temporal workflow count, which reported 594,313 workflows closed before the 7d cutoff date (--query 'CloseTime < "2025-07-22T23:59:59Z"'). As of this writing, the same command reports 283,507 workflows, so only half appear to have been deleted.

Batch job #1

I started with a smaller batch to get a sense of how Temporal and its DB would handle a bigger batch, so I deleted records older than 1st July, which means we had around one week’s worth of workflows given the original 30d retention.

This is the command I ran yesterday:

temporal workflow delete --query 'CloseTime < "2025-06-30T23:59:59Z"' --command-timeout 300s --rps 10
Start batch against approximately 30278 workflow(s)? y/N y
Started batch for job ID: 176074b3-8405-43bf-9fe4-a33b793f2a41

And this is its status as of today:

temporal batch describe --job-id 176074b3-8405-43bf-9fe4-a33b793f2a41 -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "176074b3-8405-43bf-9fe4-a33b793f2a41",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-29T09:48:45.509343914Z",
  "closeTime": "2025-07-29T10:38:44.424677935Z",
  "totalOperationCount": "29649",
  "completeOperationCount": "29649",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-55v5f",
  "reason": "<redacted>"
}

Unfortunately, I forgot to check that the workflows were actually deleted with temporal workflow list.

Batch job #2

I then ran the next batch job to delete all remaining workflows that closed over 7d ago, with a higher RPS having seen how Temporal and its DB coped with the previous job:

temporal workflow delete \
  --query 'CloseTime < "2025-07-22T00:00:00Z"' \
  --reason "<redacted>" \
  --rps 25 \
  --command-timeout 300s
Start batch against approximately 537082 workflow(s)? y/N y
Started batch for job ID: 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938

And this is its status as of this writing:

temporal batch describe --job-id 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938 -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-29T12:23:02.197788770Z",
  "closeTime": "2025-07-29T18:40:31.324505932Z",
  "totalOperationCount": "536921",
  "completeOperationCount": "536921",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-mqt29",
  "reason": "<redacted>"
}

Logs

There was a massive spike of error and warning logs (55M) between yesterday at 3pm UTC and today at 5am UTC (Datadog screenshot below uses UTC+1):

Zooming out, this is what the past 7 days look like:

There are two groups of error logs:

  • Operation failed with an error (~21M logs). Example:
    {
      "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error
        /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154
        go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric
        /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217
      go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution
        /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:135
      go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution
        /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:360
      go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute
        /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:129
      go.temporal.io/server/service/history/queues.(*executableImpl).Execute
        /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:374
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:223
      go.temporal.io/server/common/backoff.ThrottleRetry.func1
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62
      go.temporal.io/server/common/backoff.ThrottleRetryContext
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89
      go.temporal.io/server/common/backoff.ThrottleRetry
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211",
      "level": "error",
      "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217",
      "ts": "2025-07-30T04:28:54.046Z",
      "error": "context deadline exceeded"
    }
    
  • Critical error processing task, retrying. (~12.6M). Example:
    {
      "unexpected-error-attempts": 69,
      "queue-task-type": "VisibilityDeleteExecution",
      "shard-id": 441,
      "address": "10.160.48.225:7234",
      "wf-namespace-id": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed",
      "level": "error",
      "wf-id": "autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z",
      "wf-history-event-id": 0,
      "queue-task": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed/autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z/0198049b-415b-7910-8d88-e351db0fc0a9",
      "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68",
      "attempt": 70,
      "wf-run-id": "0198049b-415b-7910-8d88-e351db0fc0a9",
      "component": "visibility-queue-processor",
      "queue-task-key": {
        "FireTime": "1970-01-01T00:00:00Z",
        "TaskID": 817889465
      },
      "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error
        /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154
      go.temporal.io/server/common/log.(*lazyLogger).Error
        /home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68
      go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr.func1
        /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:535
      go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr
        /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:599
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:224
      go.temporal.io/server/common/backoff.ThrottleRetry.func1
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62
      go.temporal.io/server/common/backoff.ThrottleRetryContext
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89
      go.temporal.io/server/common/backoff.ThrottleRetry
        /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233
      go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask
        /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211",
      "operation-result": "OperationCritical",
      "ts": "2025-07-30T03:56:31.010Z",
      "error": "unable to delete custom search attributes: context deadline exceeded"
    }
    

DB metrics

These are some metrics between yesterday at 00:00 UTC and now (11am UTC):




Environment

  • Temporal server version: 1.27.2.
  • DB: MySQL v8 (Amazon RDS).
  • Helm chart version: 0.63.0. Each of the four deployment gets three pods.
  • Number of shards: 512 (we don’t know why, so we’re planning to shrink this in the future).

I’d really appreciate any help to get to the bottom of this issue! Thanks!

After posting the message above, I realised that I accidentally ran temporal workflow delete with --rps 25 but I actually meant to use --rps 15. I suppose that could’ve overwhelmed Temporal and/or its DB.

So, I re-ran the delete subcommand with 15 RPS:

temporal workflow delete \
  --query 'CloseTime < "2025-07-22T00:00:00Z"' \
  --reason "<redacted>" \
  --rps 15 \
  --command-timeout 300s
Start batch against approximately 283507 workflow(s)? y/N y
Started batch for job ID: ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e

But it’s not really deleting anything, after running for half an hour and reporting 11% completion without failures:

temporal batch describe --job-id ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "ba1e0270-7e3a-4c0f-a0e4-1ba6ad08392e",
  "state": "BATCH_OPERATION_STATE_RUNNING",
  "startTime": "2025-07-30T10:59:50.051215987Z",
  "totalOperationCount": "283507",
  "completeOperationCount": "31000",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-qtg8l",
  "reason": "<redacted>"
}

temporal workflow count --query 'CloseTime < "2025-07-22T00:00:00Z"' --command-timeout 90s
Total: 283507

I don’t see anything unusual in the DB metrics, but I do see a spike in warnings on Datadog:

The message is List query exceeded threshold. Example:

{
  "visibility-query": "CloseTime < \"2025-07-22T00:00:00Z\"",
  "level": "warn",
  "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:146",
  "ts": "2025-07-30T11:38:27.315Z",
  "namepsace": "default",
  "duration": 1.968041448
}

There are zero error messages so far.

UPDATE: The job completed but it didn’t actually delete anything:

temporal workflow count --query 'CloseTime < "2025-07-22T00:00:00Z"' --command-timeout 90s
Total: 283507

I just created an issue on GitHub.

I also realised that batch size has nothing to do with this:

$ temporal workflow delete --query 'CloseTime > "2025-07-02T01:30:00Z" AND CloseTime < "2025-07-02T01:31:00Z"' --reason 'Reproducing batch deletion for upcoming GH issue' --rps 5 --command-timeout 90s
Start batch against approximately 56 workflow(s)? y/N y
Started batch for job ID: 497cbd13-f55a-4a7e-b384-68bc93afd5fd

$ temporal batch describe --job-id 497cbd13-f55a-4a7e-b384-68bc93afd5fd -o json
{
  "operationType": "BATCH_OPERATION_TYPE_DELETE",
  "jobId": "497cbd13-f55a-4a7e-b384-68bc93afd5fd",
  "state": "BATCH_OPERATION_STATE_COMPLETED",
  "startTime": "2025-07-31T09:36:57.740641188Z",
  "closeTime": "2025-07-31T09:37:10.744771544Z",
  "totalOperationCount": "56",
  "completeOperationCount": "56",
  "identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-qtg8l",
  "reason": "Reproducing batch deletion for upcoming GH issue"
}

$ temporal workflow count --query 'CloseTime > "2025-07-02T01:30:00Z" AND CloseTime < "2025-07-02T01:31:00Z"' --command-timeout 90s
Total: 56