I have an activity called “AwaitState” that polls state of an object in the database using server side retry. As long as the state is In Progress, the activity throws an error and hence it retries based on the RetryOptions specified.
Recently, we have been facing an issue with this logic. The AwaitState activity is hanging after some retries and not getting executed. For example, the activity retries 5 times, at attempt number 6, I can see that my activity is pending but nothing inside it got executed. After 1hr, my attemp 6 timeout, and my activity will get retried and behave normally.
My issue is that attempt number 6 didn’t get executed at all. I have a log statement at the start of activity and it never got printed.
What could be the reason for such issue ?
I see thanks, I’ll check it out. Can I still check for these failures or crashes in the workflow history even if the activity retries after the timeout and was successful?
Also in my case it’s the startToClose timeout that is timing out.
Could the reason for the timeout he that the worker went down when the activity was waiting for the next attempt. I know that if worker restarted while activity being executed and if it doesn’t have a heartbeats it will wait for startToCloseTimeout but in my case the activity was waiting for the next attempt when the worker went down so this shouldn’t be an issue right?
If an activity is waiting for a retry, then restarting the worker shouldn’t affect it. But if the activity task was in transit when the worker went down, then the task might be lost, and the StartToClose timeout will be used to detect that.
Alright thanks, and also one last thing about the first question. Can I see the failures/crashes that led to the activity to wait for the startToClose timeout in the workflow history or they might not be there. In my case, I can’t see any failures in the history.
yes but why would that happen? no suspicious or unexpected failures happened during the process. the activity was behaving fine but it suddenly hanged on attempt #6 waiting for the startToCloseTimeout.
is it expected from temporal that some activity attemps would not be executed and rely on startToCloseTimeout to try again?
I don’t know your environment. There are many possible reasons for an activity task being dropped, such as network issues, worker process crashes, frontend and matching engine restarts, etc. Temporal cannot prevent them and relies on timeouts to detect these failures. So setting an appropriate StartToClose (and heartbeat if the StartToClose needs to be long) timeout is essential.
Alright thank you! These will help me debug the issue. I solved the issue by reducing startToCloseTimeout as it doesn’t need to be that long but I wanted to know more why would that usually happen