Activity stuck and not processing until start-to-close timeout is hit

dnnsthnnr · June 4, 2024, 7:49am

Hi there,

when starting a bunch of workflows in a short timeframe we notice that some activities (at random different activities) are stuck and not processing anything until they are running into their start-to-close timeout and are being retried.

I was able to pinpoint a log error message by temporal-history (version 1.23.1) matching exactly the stuck activity of a given workflow stating

Fail to process task

and extras

{
“component”: “transfer-queue-processor”,
“error”: “context deadline exceeded”,
“lifecycle”: “Processing failed”,
“level”: “error”,
}

Including this stack trace:


go.temporal.io/server/common/log.(*zapLogger).Error

/home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:156

go.temporal.io/server/common/log.(*lazyLogger).Error

/home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68

go.temporal.io/server/service/history/queues.(*executableImpl).HandleErr

/home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:421

go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1

/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:224

go.temporal.io/server/common/backoff.ThrottleRetry.func1

/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:117

go.temporal.io/server/common/backoff.ThrottleRetryContext

/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:143

go.temporal.io/server/common/backoff.ThrottleRetry

/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:118

go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask

/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233

go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask

/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211

From this line: temporal/common/tasks/fifo_scheduler.go at v1.23.1 · temporalio/temporal · GitHub

Can you help me understand what is failing here and how we can prevent this from affecting an activity being stuck until the start-to-close timeout is actually hit?

tihomir · June 4, 2024, 6:15pm

If transient this can be ignored, if not would look at system overload / db overload. do you have service metrics?

dnnsthnnr · June 5, 2024, 11:48am

Thank you for pointing me in the right direction. I noticed that our DB was under full CPU load at that time, which probably explains the context deadline exceeded error.

maxim · June 5, 2024, 11:41pm

Long running activities must heartbeat and set HeartbeatTimeout to detect failures faster than StartToCloseTimeout.

dnnsthnnr · June 7, 2024, 9:07am

That I am aware of, the thing is that the activities that are stuck are normally finished within a couple of seconds.

When the stuck activities are happening I can’t see any logs or traces that they are actually running on the workers. For me, it looks like the temporal server thinks a worker is working on an activity but actually none is. Could that be the case?
Are there any metrics that would support this?

maxim · June 7, 2024, 5:27pm

Temporal relies on timeouts for failure detection. So, the server doesn’t know anything about the worker’s state. It waits for each activity timeout to react.

Topic		Replies	Views
Activity stuck after activity timeout Community Support activity , timeout	9	1729	June 2, 2021
Why does my activity often StartToCloseTimeout? Community Support go-sdk	9	3368	September 28, 2023
Activity Scehduled but not started and timedout Community Support	7	371	April 3, 2024
Random Activity Timeouts Community Support go-sdk	3	259	January 30, 2024
Temporal TIMEOUT_TYPE_START_TO_CLOSE Community Support go-sdk	3	17	July 29, 2025

Activity stuck and not processing until start-to-close timeout is hit

Related topics