Workflow Cancelation timeouts

I currently have a temporal workflow that has around 400k live executions. When updates are made we need to restart them with a cancel and start call. When attempting to do this through an sqs queue I ran into issues with timeouts where the cancellation would be put on the task queue but would timeout before it could get to the start task. This caused rampant workflow already started errors as the workflows never cancelled by the time we got to the start task. Is there a way to ensure that the first task completes before running the start task?

Have you considered using TERMINATE_EXISTING WorkflowIdConflictPolicy?

Another option is for a workflow to call continue-as-new after completing cancellation. This way there is no need to start from outside.

That is a great idea, I think though after looking into the issue, updating the workflow to receive signals to allow restartAsNew behavior may be better.

1 Like