Problems cancelling a running activity from parent workflow

I have a workflow, which base on an incoming signal would need to stop any currently running activity (i.e a cancel event, need to stop whatever we are doing and go to cleanup)

First, I create an activityCtx with cancel function

activityCtx, cancelActivity := workflow.WithCancel(ctx)

Then launch my activity using that activityCtx.

And, when I get that signal, i invoke cancelActivity().

I can see the activity is cancel in workflow, as I get CanceledError when trying to get status of the ExecuteActivity:

activityErr = workflow.ExecuteActivity(activityCtx, SomeActivity, payload).Get(activityCtx, nil) 
  • this returns canceledError

Questions:

  • In SomeActivity, I am doing select on ctx.Done(). But the signal never came and hence my activity keeps running. I am sending heartbeat inside my activity: activity.RecordHeartBeat(ctx, 100) - where ctx is passed into the activity method (as first argument). Am my expectation correct? (i.e I should get ctx.Done inside activity when cancelFunc is called?)
  • In go sdk docs, there is this statement: Cancellation is only delivered to Activities that call RecordActivityHeartbeat - Is it really that specific function? or just activity.RecordHeartBeat is ok?

Also is attached my options used to start activity

options := workflow.ActivityOptions{
		StartToCloseTimeout: time.Hour,
		HeartbeatTimeout:    time.Minute,
		// Optionally provide a customized RetryPolicy.
		// Temporal retries failures by default, this is just an example.
		RetryPolicy: &temporal.RetryPolicy{
			InitialInterval:    time.Second,
			BackoffCoefficient: 1.6,
			MaximumInterval:    5 * time.Minute,
			MaximumAttempts:    1,
		},
	}

Does the activity call heartbeat periodically? Calling it once is not enough.

Yes, we have a ticker which calls heartbeat periodically. We have tried various intervals 1 sec, 5 sec, 0.5 sec…

I can see even after cancelFunc() is called in parent workflow, the activity still keeps going with those heartbeat (I am logging the attempts).

To give a bit more context, I launched a goroutine inside activity to wait for success REST call (I believe you can launch go routine inside activity).

And then I have a ticker channel, and have select which either gets Ctx.Done(), the go routine sends back results to a channel, or just log heartbeat

I have tried logging heartbeat once or not before launching goroutine. does not make a diff

Do you see the activity cancellation in the workflow history? It is of ActivityTaskCancelRequested event type.

Oh, I think the cancel event triggered the workflow to be closed too soon?

I see my workflow is completed, but with a pending activity

activityId

39

activityType.name

SomeActivity

state

CancelRequested

heartbeatDetails

[
  100
]

lastHeartbeatTime

May 18, 2021 5:09 PM

lastStartedTime

2021-05-19T00:09:50.000Z

I do see ActivityTaskCancelRequested

I suppose need to wait till all activity is closed before shut down workflow? how do you synchronize? Or do we need to?

Activity is going to get its context cancelled on the next heartbeat if a workflow is closed. So no need to wait for it to cancel unless you need to execute some logic after that.

  1. Is there anyway to make the activity stop right away? how long does it take before the ActivityTaskCancelRequested comes up before activity is shut down?
  2. I am seeing even if workflow is closed, the heartbeat logging is still happening (meaning the activity is still running).
  3. There might be another activity which needs to cleanup after cancel, so we need to make sure current activity is indeed done first

I do notice if I adjust the heartbeat timeout to be shorter when starting an activity, it will close sooner.

I am still confuse, giving I see ActivityTaskCancelRequested, should I expect the ctx.Done() in my activity to be triggered?

To avoid excessive calls to the service heartbeats are not send to the service up to 80% of the heartbeat interval. So if you want to speed up the notification of the activity about the cancellation/workflow completion reduce the heartbeat interval from 1 minute to some lower value.

  1. There might be another activity which needs to cleanup after cancel, so we need to make sure current activity is indeed done first

Set ActivityOptions.WaitForCancellation to true to wait (by blocking on the activity Future) for an activity cancellation.

Will try that for sure.

Going back to my previous question, I should get ctx.Done() signal inside my activity correct?

Indeed, when I set WaitForCancellation to be true the function never returns… but again, if ctx.Done() never came to my activity I cannot exit

I think i understand now how it works… just want to confirm my understanding.

I have heart beat timeout of 20 seconds. so 80% is 16 seconds.
I beat every 2 seconds.

Cancel is called… so at around 16 second time frame than I will get ctx.Done(), which seems to be what I am observing… is that correct?

Yes, it is correct. We have plans to separate cancellation from the heartbeat and deliver it immediately. But for now, this is the way it works.

Got it thanks… Thanks so much for your help

Thanks @stuartccng for asking this question! @maxim can you add this to the documentation (Workflows in Go | Temporal documentation)? This answer makes a lot of sense but if you aren’t aware of it you can spend a lot of time running in circles trying to figure things out (like I did)!

@tdk where exactly would you like to see this added? some guidance (or a PR with suggested language) would be very helpful.

@swyx The issue I and others have run into is the expectation of when an activity or workflow will be canceled. The assumption was that after sending the cancelation command, the workflow and associated activities would receive the cancelation in the context quickly as long as we were sending heartbeats.

It turns out that this can be delayed up to 80% of the heartbeat timeout period so it was very challenging to try and debug why our activities were not being canceled. If you happen to set the heartbeat timeout for minutes, it could seem like the cancelation was never going to occur.

The documentation could be something as simple on the page/anchor I provided above saying something like:

“ctx.Done() will be signaled when a heartbeat is sent to the service. The library throttles this so a heart beat may not be sent to the service until 80% of the heartbeat timeout has elapsed. For example if your heartbeat timeout is 20 seconds, ctx.Done() will not be signaled until 80% of 20 seconds (or around 16 seconds) has elapsed. To increase or decrease the delay of cancelation, modify the heartbeat timeout defined for the activity context.”

Does that make sense?

1 Like

Hey @tdk! that does and thank you so much for obliging with some clarification! (it was hard to follow along this thread as a newcomer to this issue - i personally didnt even know how this worked). I’ve spun up a PR that reflects your lessons: add activity cancel docs by sw-yx · Pull Request #563 · temporalio/documentation · GitHub

feel free to add any comments if you wish but otherwise thank you for taking the time to explain! I’m sure someone reading this in future is going to appreciate you greatly.