We have a cron workflow and each night, it creates an activity to purge 1500 orphan rows in a database table and continues to create activities to do this until the orphans are all removed. Normally, this runs fine. Occasionally, there are so many orphans that the workflow runs until it exceeds some limit and is terminated by the server.
This would be OK if it would pick up the next night but once terminated, it needs a code change to get rescheduled.
Is there a setting we can use to tell it to not be terminated but just end so that it will pick up the next time it is scheduled to run?
until it exceeds some limit and is terminated by the server.
I think this can be two things, either the execution reaches the workflow execution timeout set in your WorkflowOptions, but could also that your workflow is starting so many activities that it’s reaching the 50K limit for exec history for this execution. From your description it seems its could be the second case.
Activity invocation creates 3 history events, activity scheduled / started and then completed / failed.
Given the number of activities you invoke you could be reaching this limit and if reached server is going to automatically terminate your workflow execution.
You could check this via tctl for example:
tctl wf desc -w <wfid> -r <runid> | jq .workflowExecutionInfo.historyLength
The workaround for this is typically to use continueAsNew to continue your execution after X amount of history events in order not to reach this limit (and by the way you really do not want to even get close to this limit as the larger the history the more your workers have to pull from server in cases they need to get the whole history for an execution. typically you would want to call continueasnew around 20K or so mark imo).
Reason I say typically that unfortunately continueasnew has problems currently with cron executions, see issue here.
My recommendation for this use case would be to have a main workflow which is a cron workflow
which starts lets say if we use your 1500 example, 5 child workflows (in parallel if you need) where each child workflow processes 500 rows (calls x number of activities that it needs). Your child workflows can call continueasnew to process much more if needed, to the parent workflow this will look like a single child workflow invocation.
Make sure that in your main workflow you limit the invocation of child workflows so you do not hit the 50K limit.
Hope this helps.
Thank you for this. We are hitting the 50K max which is causing the termination. I also saw the issue you mentioned so I think that won’t work. I think we need to use your recommendation or simply keep track of activities and when we reach so many, just end so the next day can pick up the work. And possibly run it several times a day. Thanks for these ideas and the issue link, it is very helpful.