We have a parent workflow starting a few child workflows in parallel, each of which would run a number of activities in parallel or sequentially.
We have a use case in which the parent would cancel all the child workflows and this works most of the times, but sometimes one of the child workflows would get stuck trying to cancel, with this error: "BadRequestCancelActivityAttributes: invalid history builder state for action: add-activitytask-cancel-requested-event"
Then the child workflow would get stuck and we will need to manually terminate it. This only happens from time to time. Any hints what things we need to check on our side?
More details:
Golang SDK version 1.11.1
Code to start child workflows:
ao := workflow.ActivityOptions{
HeartbeatTimeout: models.HeartbeatTimeout,
RetryPolicy: retryPolicy,
WaitForCancellation: true,
TaskQueue: taskQueue, // same as parent workflow's
}
ctx = workflow.WithActivityOptions(ctx, ao)
...
childCtx, cancelHandler := workflow.WithCancel(ctx)
selector := workflow.NewSelector(ctx)
var activityErr error
result := int64(0)
for _, s := range p.Branches {
f := executeAsync(s, childCtx, bindings)
selector.AddFuture(f, func(f workflow.Future) {
var res int64
err := f.Get(ctx, &res)
if err != nil {
// cancel all pending activities
cancelHandler()
if !temporal.IsCanceledError(err) || activityErr == nil {
activityErr = err
}
} else {
result += res
}
})
for i := 0; i < len(p.Branches); i++ {
selector.Select(ctx) // this will wait for one branch
}
}
I will attempt to replicate using the sample in samples-go. Are there any map iterations in your workflow code that could result in non-determinism? Or anything else non-determinstic? There is one in the sample but that is just building activity arguments so it should be safe.
Can you reliably replicate? If not, I’m afraid we might need to see the code and the history of a failed execution to try to replicate.