We have a multi-step workflow which runs for multiple days. Some activities submit a batch job to external service and poll for the results in a sleep-poll-sleep loop. The results can take up to days to complete. The polling activity is idempotent, when restarted it will resume polling the last submitted job and each poll call is accompanied by a temporal heartbeat. However, the workers are periodically deployed (CI/CD) and a running worker pod often gets killed and replaced by a new worker pod. Temporal detects a missing heartbeat and reschedules the activity but counts that as a retry. After 3 retries the workflow is failed by the temporal (current max_retries is 3), however the job was still being successfully polled and would have succeeded had the workflow not failed. Now here is the problem, if we set it to infinite retries that will also apply to genuine exceptions and timeouts raised by the python program that the activity is running, which should not be retried beyond a reasonable # of attempts. The obvious patch is to handle the program exceptions manually (or use tenacity), but that seems like not using temporal’s core strengths. Is there any other way to do this?
Restrict retries based on activity ScheduleToClose timeout. Alternatively could have some custom logic when to raise non-retryable failure from activity code based on initial schedule time of activity, see here.