Hey folks,
To make temporal workflows list and its web equivalent usable again, I reduced the retention period from 30d to 7d and submitted two batch jobs to delete workflows older than 7d. Both batch jobs completed successfully, reporting to have deleted all of those old workflows, but I’m still seeing many workflows that closed before the cutoff date.
Before running the batch job, I ran temporal workflow count, which reported 594,313 workflows closed before the 7d cutoff date (--query 'CloseTime < "2025-07-22T23:59:59Z"'). As of this writing, the same command reports 283,507 workflows, so only half appear to have been deleted.
Batch job #1
I started with a smaller batch to get a sense of how Temporal and its DB would handle a bigger batch, so I deleted records older than 1st July, which means we had around one week’s worth of workflows given the original 30d retention.
This is the command I ran yesterday:
temporal workflow delete --query 'CloseTime < "2025-06-30T23:59:59Z"' --command-timeout 300s --rps 10
Start batch against approximately 30278 workflow(s)? y/N y
Started batch for job ID: 176074b3-8405-43bf-9fe4-a33b793f2a41
And this is its status as of today:
temporal batch describe --job-id 176074b3-8405-43bf-9fe4-a33b793f2a41 -o json
{
"operationType": "BATCH_OPERATION_TYPE_DELETE",
"jobId": "176074b3-8405-43bf-9fe4-a33b793f2a41",
"state": "BATCH_OPERATION_STATE_COMPLETED",
"startTime": "2025-07-29T09:48:45.509343914Z",
"closeTime": "2025-07-29T10:38:44.424677935Z",
"totalOperationCount": "29649",
"completeOperationCount": "29649",
"identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-55v5f",
"reason": "<redacted>"
}
Unfortunately, I forgot to check that the workflows were actually deleted with temporal workflow list.
Batch job #2
I then ran the next batch job to delete all remaining workflows that closed over 7d ago, with a higher RPS having seen how Temporal and its DB coped with the previous job:
temporal workflow delete \
--query 'CloseTime < "2025-07-22T00:00:00Z"' \
--reason "<redacted>" \
--rps 25 \
--command-timeout 300s
Start batch against approximately 537082 workflow(s)? y/N y
Started batch for job ID: 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938
And this is its status as of this writing:
temporal batch describe --job-id 9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938 -o json
{
"operationType": "BATCH_OPERATION_TYPE_DELETE",
"jobId": "9e17a1b4-bd3f-44a1-a9b3-c7e2b11a4938",
"state": "BATCH_OPERATION_STATE_COMPLETED",
"startTime": "2025-07-29T12:23:02.197788770Z",
"closeTime": "2025-07-29T18:40:31.324505932Z",
"totalOperationCount": "536921",
"completeOperationCount": "536921",
"identity": "temporal-cli:temporal@temporal-admintools-6578f9bc6d-mqt29",
"reason": "<redacted>"
}
Logs
There was a massive spike of error and warning logs (55M) between yesterday at 3pm UTC and today at 5am UTC (Datadog screenshot below uses UTC+1):
Zooming out, this is what the past 7 days look like:
There are two groups of error logs:
Operation failed with an error(~21M logs). Example:{ "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154 go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217 go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution /home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:135 go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:360 go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute /home/runner/work/docker-builds/docker-builds/temporal/service/history/visibility_queue_task_executor.go:129 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:374 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:223 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89 go.temporal.io/server/common/backoff.ThrottleRetry /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211", "level": "error", "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:217", "ts": "2025-07-30T04:28:54.046Z", "error": "context deadline exceeded" }Critical error processing task, retrying.(~12.6M). Example:{ "unexpected-error-attempts": 69, "queue-task-type": "VisibilityDeleteExecution", "shard-id": 441, "address": "10.160.48.225:7234", "wf-namespace-id": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed", "level": "error", "wf-id": "autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z", "wf-history-event-id": 0, "queue-task": "e8ded80f-c3e7-43d8-9061-3fe119bda9ed/autoRotateSignerAuthorities/t-yaimdsom6j/td-axmffjrsje-2025-07-13T16:25:00Z/0198049b-415b-7910-8d88-e351db0fc0a9", "logging-call-at": "/home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68", "attempt": 70, "wf-run-id": "0198049b-415b-7910-8d88-e351db0fc0a9", "component": "visibility-queue-processor", "queue-task-key": { "FireTime": "1970-01-01T00:00:00Z", "TaskID": 817889465 }, "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error /home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154 go.temporal.io/server/common/log.(*lazyLogger).Error /home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68 go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr.func1 /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:535 go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr /home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:599 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:224 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:62 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89 go.temporal.io/server/common/backoff.ThrottleRetry /home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:63 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211", "operation-result": "OperationCritical", "ts": "2025-07-30T03:56:31.010Z", "error": "unable to delete custom search attributes: context deadline exceeded" }
DB metrics
These are some metrics between yesterday at 00:00 UTC and now (11am UTC):
Environment
- Temporal server version: 1.27.2.
- DB: MySQL v8 (Amazon RDS).
- Helm chart version: 0.63.0. Each of the four deployment gets three pods.
- Number of shards: 512 (we don’t know why, so we’re planning to shrink this in the future).
I’d really appreciate any help to get to the bottom of this issue! Thanks!





