Invalid history builder state for action: add-activitytask-cancel-requested-event

We have a parent workflow starting a few child workflows in parallel, each of which would run a number of activities in parallel or sequentially.
We have a use case in which the parent would cancel all the child workflows and this works most of the times, but sometimes one of the child workflows would get stuck trying to cancel, with this error:
"BadRequestCancelActivityAttributes: invalid history builder state for action: add-activitytask-cancel-requested-event"
Then the child workflow would get stuck and we will need to manually terminate it. This only happens from time to time. Any hints what things we need to check on our side?

More details:
Golang SDK version 1.11.1
Code to start child workflows:

childCtx, childCancel := workflow.WithCancel(ctx)
...
cwo := workflow.ChildWorkflowOptions{
		WorkflowID: workflowID,
		WorkflowTaskTimeout: time.Minute,
		WaitForCancellation: true,
		TaskQueue:           taskQueue,
	}
cctx = workflow.WithChildOptions(childCtx, cwo)
childFuture := workflow.ExecuteChildWorkflow(cctx, DSLWorkflow, dslWflw)
// Wait until childworkflow is started
errChWflwExec := childFuture.GetChildWorkflowExecution().Get(cctx, nil)
...

Inside childworkflows, activities are started following dsl pattern in https://github.com/temporalio/samples-go/blob/main/dsl/workflow.go:

ao := workflow.ActivityOptions{
    HeartbeatTimeout:    models.HeartbeatTimeout,
    RetryPolicy:         retryPolicy,
    WaitForCancellation: true,
    TaskQueue:           taskQueue, // same as parent workflow's
}
ctx = workflow.WithActivityOptions(ctx, ao)
...
childCtx, cancelHandler := workflow.WithCancel(ctx)
selector := workflow.NewSelector(ctx)
var activityErr error
result := int64(0)

for _, s := range p.Branches {
    f := executeAsync(s, childCtx, bindings)
    selector.AddFuture(f, func(f workflow.Future) {
        var res int64
        err := f.Get(ctx, &res)
        if err != nil {
            // cancel all pending activities
            cancelHandler()
            if !temporal.IsCanceledError(err) || activityErr == nil {
                activityErr = err
            }
        } else {
            result += res
        }
    })
    for i := 0; i < len(p.Branches); i++ {
		selector.Select(ctx) // this will wait for one branch
	}
}

Golang SDK version 1.11.1

@Chad_Retz @Spencer_Judge
This looks like part of Canceling workflow can cause infinite replay attempts · Issue #481 · temporalio/sdk-go · GitHub
and was not par of fix State machine issue after activity cancellation by cretz · Pull Request #625 · temporalio/sdk-go · GitHub

Related forum posts:

wdyt?

This type of error usually happens during non-deterministic code in the workflow or a bug in the SDK. We have just opened a PR at Fix invalid command ID expectation on child workflow cancel by cretz · Pull Request #647 · temporalio/sdk-go · GitHub that is related to child workflow cancellation so it is possible that could fix this.

I will attempt to replicate using the sample in samples-go. Are there any map iterations in your workflow code that could result in non-determinism? Or anything else non-determinstic? There is one in the sample but that is just building activity arguments so it should be safe.

Can you reliably replicate? If not, I’m afraid we might need to see the code and the history of a failed execution to try to replicate.

Unfortunately I still haven’t found a way to reliably reproduce. I do have an export of a failing run though that I can try to sanitize and share

Are there any map iterations in your workflow code that could result in non-determinism? Or anything else non-determinstic?

I can’t find anything but can share a somewhat sanitized export and our dsl workflow code. How would be the best way to share these with you?

One way is to take that history JSON and use worker.NewWorkflowReplayer, register your workflow, and use one of the methods on there to rerun the history. Assuming there is no non-determinism, this should reliably cause the error. If you can reliably cause it, you can see if the patch in Fix invalid command ID expectation on child workflow cancel by cretz · Pull Request #647 · temporalio/sdk-go · GitHub fixes it.

You can find me at @Chad Retz on the community Slack (maybe the best option), or just DM me here, or just send to chad@temporal.io.

A PR has been opened to address this issue: Remove pending activity cancellations when activity completion occurs by cretz · Pull Request #726 · temporalio/sdk-go · GitHub. Thanks for connecting with me off-forum to help me build a reproducer!