Activity timeout bug

I have a simple workflow with 2 activity steps, the activity steps are long running (up to an hour) and memory intensive (a host can only run 1 concurrent activity at a time) tasks. I currently dont do any heartbeating on the activity.

So my workflow is configured with MaxConcurrentActivityExecutionSize as 1
And my activity stub is configure with ScheduleToStartTimeout and StartToCloseTimeout as 1 hour each

I am noticing this bug where my first activity is completing fine, then my second activity is scheduled but not getting executed immediately, instead after an hour (activity timeout) the second activity is getting executed twice.

So here is the sequence of steps (from temporal UI + my logs)
00:00 workflow started
00:00 first activity scheduled
00:00 first activity started
00:01 first activity complete

00:01 second activity scheduled, i can see form the logs that this did not start executing till an hour

– 1 hour later –
01:01 second activity started (with last error as “activity timeout”) see screenshot
01:02 second activity complete

– duplicate (from host logs) –
01:03 second activity triggered again
01:04: PollerOptions: uncaught exception java.lang.RuntimeException: NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed

workflow error:

If you look at my activity_schedule_to_start p99 metric it never goes above 13 mins

I am trying to understand why my second activity task is not getting executed immediately even thought it says as scheduled. And after the time out happens why is it getting executed twice?

Let me know if you need any more details or if there is a better place i should report bugs to.

From the log it looks like your activity has completed from the 2nd attempt and last failure was “activity timeout”. This means that on the first attempt your activity has not responded back (possibly worker crashed/restarted or activity hang). In this situation there is no way for temporal to distinguish between activity that is still running from the one that is no longer doing the work. This is exact situation that activity hearbeat is supposed to address by providing a shorter time frame to detect unresponsive long running activities.

We won’t be able to tell what exactly happened to the first execution by looking at our UI since all we know is that it timed out. But what you could do is look at the logs and find which worker picked up that activity at 11:30 and try to understand what exactly happened after.

Regardless of the outcome of your investigation I strongly recommend using activity hearbeat with short hearbeat interval (e.g. 1 min in your case) in order to avoid long wait times, which are guaranteed to occur otherwise at least when you restart/redeploy your worker fleet.

Please post here if you find any additional data, but so far I don’t see any bugs in the system based on the info that you’ve provided.

Thanks @Vitaly. That was my first hunch as well (that my activity host crashed on first try), but that does not explain the following:

  1. After an hour the activity is executed 2 times within a couple of seconds (based on my host logs).
  2. The first line in my activity is a log statement and i dont see that log until after an hour. Is there a way to know which host picked up the first attempt and when?

Also this seems to be happening fairly regularly. I can see the following exception 20-30 times a day:
PollerOptions: uncaught exception java.lang.RuntimeException: NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed

Hearbeating is definitely something i am looking into, just didn’t get to it yet.

Above exception indicates that your activity indeed runs longer then timeout interval and that it completes after new execution has started (or workflow completed). Is it possible that among two execution that you see one is from the attempt number 1 and the second is from the attempt number 2?

There are couple options to figure out which host picked up first activity execution. One way you can tell it is by increasing log level to TRACE, but that would result in a lot of logging (all RPC calls). From the UI you should be able to see pending activities (those that are currently running) on the workflow summary view, I believe it should also show which host runs it, but you would need to look at the affected workflow at the right moment in order to see it.
What I would actually recommend, is adding some extra logging in your own code (make sure to log attempt, activity id, workflow id and run id from Activity.getExecutionContext().getInfo()), this way you could tell when each activity execution starts and finishes.

The first and last statement of my activity is a log statement. So i know for sure that both the activity runs only started after an hour and that each of them only took a few seconds from start to finish. The two executions i am guessing is from attempt #1 and #2, what i am trying to understand is why is it getting executed together after the timeout

Can you post raw logs that include workflow id/run id/activity id/attempt at the time when activity is started/finished?

Unfortunately i cannot get this since i dont have TRACE logs enabled. I only have my logs with activity specific details that i emit.

Does the server store anymore history than is listed in the UI? For long running activities (even with hearbeating) the last log type is “ActivityTaskScheduled” till the activity is complete. “ActivityTaskStarted” log type only shows up after the host has completed the activity.

Is it possible to get the details of which host has picked up the task and when for all attempts?

You don’t need trace log to get these details. Just use activity execution context, mentioned above, to get all the details and log them as part of your activity execution. Then you can simply grep your logs and see everything that was happening. I hope this is helpful.

@Vitaly I updated my workflow to start heartbeating and also emit more logs, and i am still seeing same activity timeout errors.

My Workflow params:

Activity ScheduleToStartTimeout = 1 hour
Activity StartToCloseTimeout = 1 hour
Activity HeartbeatTimeout = 5 mins
Activity heartbeating rate in my code: every 30 seconds

Example workflow with the issue:

From Temporal UI:
WorkflowExecutionStarted Dec 14th 6:30:30 pm

WorkflowTaskScheduled     Dec 14th 6:30:30 pm
WorkflowTaskStarted       Dec 14th 6:30:30 pm
WorkflowTaskCompleted     Dec 14th 6:30:30 pm

ActivityTaskScheduled     Dec 14th 6:30:30 pm   activityId:05822e91-dea5-3c47-b16a-14440b4b4b1b, activityType.name:StartIngestion
ActivityTaskStarted       Dec 14th 6:36:00 pm   attempt:2, lastFailure: "activity timeout"
ActivityTaskCompleted     Dec 14th 6:36:05 pm

WorkflowTaskScheduled     Dec 14th 6:36:05 pm
....

From Host logs

6:30:30.487 PST: Started workflow with workflowId=3c113957-1e8e-4ead-bb30-ce11e74d1a51 and runId=f8f052fa-a034-404e-b57e-034de128b608

<different host>
6:35:40.315 PST: Executing feed_ingestion activity with following details: id 05822e91-dea5-3c47-b16a-14440b4b4b1b for workflow id 3c113957-1e8e-4ead-bb30-ce11e74d1a51 for attempt 1.
6:35:45.064 PST: Execution for feed_ingestion activity complete with following details: id 05822e91-dea5-3c47-b16a-14440b4b4b1b for workflow id 3c113957-1e8e-4ead-bb30-ce11e74d1a51 for attempt 1.
6:35:45.073 PST: io.grpc.StatusRuntimeException: NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed

<different host>
6:36:00.610 PST: Executing feed_ingestion activity with following details: id 05822e91-dea5-3c47-b16a-14440b4b4b1b for workflow id 3c113957-1e8e-4ead-bb30-ce11e74d1a51 for attempt 2.
6:36:05.139 PST: Execution for feed_ingestion activity complete with following details: id 05822e91-dea5-3c47-b16a-14440b4b4b1b for workflow id 3c113957-1e8e-4ead-bb30-ce11e74d1a51 for attempt 2.
....

From my host logs you can see that attempt #1 only started executing at 6:35. Could you help me understand why the activity is timing out and getting retried? Let me know if you need any more details

Also this is my schedule_to_start latency

It is possible that the activity task took a few mins to get scheduled. But the heartbeat timeout should start from when the activity was picked up by a host not when it was scheduled, so it still doesn’t add up.

It looks like your workers might be running behind, assuming that your activity schedule to start graph is p99 in milliseconds, you have a backlog of up to 5 minutes. This would explain why your first attempt was delayed.

On the other hand you are correct that heartbeat timeout should be calculated from the time when activity execution has started and seeing activity timeout 5 seconds after the start does look suspicious. Let me try to reproduce this issue locally.

For now, can you add a call to activity heartbeat at the start of that activity and see if the problem persists?

From my logs you can see that first attempt execution starts only 5mins after activity is scheduled so moving the heartbeating earlier likely wont make a difference but i can give it a try.

After i set up heartbeating and added a 5 min heartbeating timeout (previously had 1 hour activity timeout only) i am seeing these poller exceptions more frequently. Although the workflows are completing fine, my logs are filled with poller errors. Would love to understand the root cause of this, please let me know if there is anything else i can help with.

Also adding my worker configuration if that helps:

final WorkerOptions workerOptions = WorkerOptions.newBuilder()
                .setMaxConcurrentWorkflowTaskExecutionSize(200)
                .setMaxConcurrentActivityExecutionSize(1)
                .setWorkflowPollThreadCount(2)
                .setActivityPollThreadCount(5)
                .build();
        final Worker worker =
                workerFactory.newWorker(workflowTaskQueueName, workerOptions);
        worker.registerWorkflowImplementationTypes(FeedUploadWorkflowImpl.class);
        worker.registerActivitiesImplementations(activities);

And my worker factory:

WorkerFactory.newInstance(workflowClient);

@Steven_Sun suggested that i update ActivityPollThreadCount to 1 since my MaxConcurrentActivityExecutionSize is 1. Because of this my activity poller might be pulling activity tasks but not working on them causing them to time out. Gonna give that a try.

I don’t think reducing ActivityPollThreadCount would help. It is a good thing to do to reduce unnecessary load on the service, but I believe the poll request is not initiated until there is a spare capacity in the activity execution thread pool.

Makes sense. You are right, I just tried reducing it and that hasnt solved the issue.

Hi @maxim, what my colleague is observing here is that he wants only one activity running on the host at a time, what ends up happening is that the first activity succeeded, and the second activity 1st try will timeout on heartbeat and the 2nd retry kicked off, but then the 1st try actually started running together with the 2nd retry. Any idea how to prevent this?

In shorter version, what could cause activity 2 attempts to run together at the similar time?

There are two separate issues here.

  1. Activity timeout is enforced on the service side only. So if an activity doesn’t heartbeat in the specified interval it is timed out and retried. It doesn’t stop an activity execution that failed to heartbeat for some reason. On the next heartbeat, the original activity will receive an exception and can exit.

  2. Are you saying that even if MaxConcurrentActivityExecutionSize is set to 1 you see two activities executed by the same worker at the same time? If it is the case then it is a bug that has to be fixed.

Max concurrent executions are working as expected, i see only 1 execution run at a particular time on a single host. What Steven was mentioning is that both attempts of the activity execution run happen close to each other (on different hosts from the examples i looked at).

Then it is by design as I described in 1. You can increase initial retry interval in the retry options to ensure that retry happens after long enough period to avoid double execution.

@maxim the problem here is with delay in activity task getting started, not with activity execution taking long/skipping heartbeat. My understanding is that the activity execution for attempt #1 is getting delayed in starting, by when the server thinks the activity is timed out and kicks off another attempt. So increasing the initial retry interval wont solve the problem since the activity could be delayed by a non deterministic time.