By default temporal will retry forever(or 10 years according to this forum) to run a workflow or activity. What’s the approach if you’ve missed an exception(and didn’t make it non-retryable) or created bug in a workflow or an activity that causes infinite retries. Even if you set a timeout for a workflow for a week, I guess you still want to monitor failures and probably know about that as soon as possible. I guess we can send a notification on every failure. Is that the best approach?
By default temporal will retry forever(or 10 years according to this forum) to run a workflow or activity.
Workflow executions do not have a default retry policy, so they do not retry unless you explicitly say they should (which is pretty rarely needed).
Yes, activities do have a default retry policy which does not restrict retries by default.
The “10 years” i think is just a number, meaning that its really the db capacity that will in the end determine how many running executions you have in your system.
Even if you set a timeout for a workflow for a week
Yes you can set workflow run/execution timeouts which means when these timers fire service will terminate/time out your execution. This however comes with some side effects, like not being able to react to it on worker side which can cause issues as especially running activities at that time keep running on your workers until they either heartbeat or complete/fail. Could look as alternative into using workflow timers which you can react to.
I guess you still want to monitor failures
SDK emits temporal_activity_execution_failed counter metric which is per activity attempt.
In activity code you also can via activity context get the retry attempt count and can use it for logging or pushing some custom metrics if needed.
The original Temporal release had 10 year limit. This limit was removed a long time ago. So it is fully unlimited now. The practical implication that no 10 year timer is created if a limit is not specified.