Activity Not Being Called On Retry

Khaled_Baghdadi · September 5, 2024, 3:43pm

I have an activity called “AwaitState” that polls state of an object in the database using server side retry. As long as the state is In Progress, the activity throws an error and hence it retries based on the RetryOptions specified.

Recently, we have been facing an issue with this logic. The AwaitState activity is hanging after some retries and not getting executed. For example, the activity retries 5 times, at attempt number 6, I can see that my activity is pending but nothing inside it got executed. After 1hr, my attemp 6 timeout, and my activity will get retried and behave normally.

My issue is that attempt number 6 didn’t get executed at all. I have a log statement at the start of activity and it never got printed.
What could be the reason for such issue ?

maxim · September 5, 2024, 4:30pm

What is the state of activity pending? Could it be waiting for the StartToCloseTimeout?

Khaled_Baghdadi · September 5, 2024, 5:05pm

Where can I check the pending activity state?

Also, why would the activity wait for startToClose timeout instead of being executed.

maxim · September 5, 2024, 5:07pm

Where can I check the pending activity state?

In the “pending activities” view in the UI or temporal workflow describe CLI command.

Also, why would the activity wait for startToClose timeout instead of being executed.

Because a process crash or some other failure would preclude it from doing so. Timeouts are protection against such situations. See The 4 Types of Activity timeouts | Temporal Technologies

Khaled_Baghdadi · September 5, 2024, 5:24pm

I see thanks, I’ll check it out. Can I still check for these failures or crashes in the workflow history even if the activity retries after the timeout and was successful?

Also in my case it’s the startToClose timeout that is timing out.

Could the reason for the timeout he that the worker went down when the activity was waiting for the next attempt. I know that if worker restarted while activity being executed and if it doesn’t have a heartbeats it will wait for startToCloseTimeout but in my case the activity was waiting for the next attempt when the worker went down so this shouldn’t be an issue right?

maxim · September 5, 2024, 5:29pm

If an activity is waiting for a retry, then restarting the worker shouldn’t affect it. But if the activity task was in transit when the worker went down, then the task might be lost, and the StartToClose timeout will be used to detect that.

Khaled_Baghdadi · September 5, 2024, 5:32pm

Alright thanks, and also one last thing about the first question. Can I see the failures/crashes that led to the activity to wait for the startToClose timeout in the workflow history or they might not be there. In my case, I can’t see any failures in the history.

maxim · September 5, 2024, 5:44pm

If activity is still pending, this information is available in the pending view.

After an activity completes, the information about the failure that caused the last retry is stored in the ActivityTaskStartedEvent.

Khaled_Baghdadi · September 6, 2024, 1:40pm

The issue happened again, the state of the activity is PENDING_ACTIVITY_STATE_STARTED

The last failure message was the custom error message I am creating to force the activity to fail and retry to achieve the polling.

Here are some screesnhots

The activity is on attempt number 6

This is the retry policy. It should retry each 30 seconds

But you can see that the activity has been running for 37 mins and still on attempt 6

Here’s some other information about the activity

What could be the reason. This is very weird in my opinion.

maxim · September 6, 2024, 5:21pm

StartToClose timeout is 1 hour. So the service is waiting for 1 hour to expire to mark this activity as timed out and take the next action like retry.

Khaled_Baghdadi · September 6, 2024, 5:31pm

yes but why would that happen? no suspicious or unexpected failures happened during the process. the activity was behaving fine but it suddenly hanged on attempt #6 waiting for the startToCloseTimeout.

is it expected from temporal that some activity attemps would not be executed and rely on startToCloseTimeout to try again?

maxim · September 6, 2024, 6:07pm

I don’t know your environment. There are many possible reasons for an activity task being dropped, such as network issues, worker process crashes, frontend and matching engine restarts, etc. Temporal cannot prevent them and relies on timeouts to detect these failures. So setting an appropriate StartToClose (and heartbeat if the StartToClose needs to be long) timeout is essential.

Khaled_Baghdadi · September 6, 2024, 6:18pm

Alright thank you! These will help me debug the issue. I solved the issue by reducing startToCloseTimeout as it doesn’t need to be that long but I wanted to know more why would that usually happen

Topic		Replies	Views
How to test Server outageous Community Support java-sdk	6	1027	June 1, 2021
Activity staying in pending Community Support java-sdk , activity	5	1328	July 20, 2023
Activity completes long after ScheduleToCloseTimeout Community Support go-sdk , retries	4	408	July 25, 2024
Why does my activity often StartToCloseTimeout? Community Support go-sdk	9	3295	September 28, 2023
Clarity on activity error retry state Community Support go-sdk , error-handling	5	640	September 1, 2022

Activity Not Being Called On Retry

Related topics