Activity Not Being Called On Retry

I have an activity called “AwaitState” that polls state of an object in the database using server side retry. As long as the state is In Progress, the activity throws an error and hence it retries based on the RetryOptions specified.

Recently, we have been facing an issue with this logic. The AwaitState activity is hanging after some retries and not getting executed. For example, the activity retries 5 times, at attempt number 6, I can see that my activity is pending but nothing inside it got executed. After 1hr, my attemp 6 timeout, and my activity will get retried and behave normally.

My issue is that attempt number 6 didn’t get executed at all. I have a log statement at the start of activity and it never got printed.
What could be the reason for such issue ?

What is the state of activity pending? Could it be waiting for the StartToCloseTimeout?

Where can I check the pending activity state?

Also, why would the activity wait for startToClose timeout instead of being executed.

Where can I check the pending activity state?

In the “pending activities” view in the UI or temporal workflow describe CLI command.

Also, why would the activity wait for startToClose timeout instead of being executed.

Because a process crash or some other failure would preclude it from doing so. Timeouts are protection against such situations. See The 4 Types of Activity timeouts | Temporal Technologies

I see thanks, I’ll check it out. Can I still check for these failures or crashes in the workflow history even if the activity retries after the timeout and was successful?

Also in my case it’s the startToClose timeout that is timing out.

Could the reason for the timeout he that the worker went down when the activity was waiting for the next attempt. I know that if worker restarted while activity being executed and if it doesn’t have a heartbeats it will wait for startToCloseTimeout but in my case the activity was waiting for the next attempt when the worker went down so this shouldn’t be an issue right?

If an activity is waiting for a retry, then restarting the worker shouldn’t affect it. But if the activity task was in transit when the worker went down, then the task might be lost, and the StartToClose timeout will be used to detect that.

1 Like

Alright thanks, and also one last thing about the first question. Can I see the failures/crashes that led to the activity to wait for the startToClose timeout in the workflow history or they might not be there. In my case, I can’t see any failures in the history.

If activity is still pending, this information is available in the pending view.

After an activity completes, the information about the failure that caused the last retry is stored in the ActivityTaskStartedEvent.

The issue happened again, the state of the activity is PENDING_ACTIVITY_STATE_STARTED

The last failure message was the custom error message I am creating to force the activity to fail and retry to achieve the polling.

Here are some screesnhots

The activity is on attempt number 6

This is the retry policy. It should retry each 30 seconds

But you can see that the activity has been running for 37 mins and still on attempt 6

Here’s some other information about the activity

What could be the reason. This is very weird in my opinion.

StartToClose timeout is 1 hour. So the service is waiting for 1 hour to expire to mark this activity as timed out and take the next action like retry.

yes but why would that happen? no suspicious or unexpected failures happened during the process. the activity was behaving fine but it suddenly hanged on attempt #6 waiting for the startToCloseTimeout.

is it expected from temporal that some activity attemps would not be executed and rely on startToCloseTimeout to try again?

I don’t know your environment. There are many possible reasons for an activity task being dropped, such as network issues, worker process crashes, frontend and matching engine restarts, etc. Temporal cannot prevent them and relies on timeouts to detect these failures. So setting an appropriate StartToClose (and heartbeat if the StartToClose needs to be long) timeout is essential.

1 Like

Alright thank you! These will help me debug the issue. I solved the issue by reducing startToCloseTimeout as it doesn’t need to be that long but I wanted to know more why would that usually happen