Error in child workflow when parent is cancelled

Error:

"BadRequestCancelActivityAttributes: invalid history builder state for action: add-activitytask-cancel-requested-event"

SDK version used : go.temporal.io/sdk v1.7.0

Usecase:
ParentWorkFlow will sleep for a long duration waiting for the scheduled time.
After it wakes up, multiple childWorkflows are run in parallel via workflow.Go. The parentWorkflow waits for the children to complete.
The childWorkFlow executes some activities and waits for an external signal for its completion, and then notifies the parent.

Issue:
While the parent is waiting in sleep for the scheduled time, if the parent workflow is cancelled via an external signal, upon receiving the cancel signal, the parent wakes up from sleep and starts the child workflows.
One childworkflow completes properly by receiving the cancel message, but another gets the above mentioned error.

NotWorking ChildFlow’s history:

Working ChildFlow’s history:

Hey there,

Thanks for the detailed report. If you happen to be able to share the workflow code that produced this, I’d love to get a peek at that too. I’ve opened a bug here to track fixing this: Propagation of cancellations to child workflows can cause state machine panic · Issue #483 · temporalio/sdk-go · GitHub

Please feel free to add any additional details there. Thanks!

Please find below the test code used :

func ParentWorkflow(ctx workflow.Context, sleepDuration time.Duration) error {
	terminalResultChannel := workflow.NewChannel(ctx)
	
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: time.Minute,
	}
	ctx = workflow.WithActivityOptions(ctx, ao)
	logger := workflow.GetLogger(ctx)

	workflowID := workflow.GetInfo(ctx).WorkflowExecution.ID
	var msg DummyRequest
	msg.Id = "1234"

	var a *Activities // Used to call Activities by function pointer
	defer func() {
		// When the Workflow is canceled, it has to get a new disconnected context to execute any Activities
		newCtx, _ := workflow.NewDisconnectedContext(ctx)
		err := workflow.ExecuteActivity(newCtx, a.CleanupActivity, " *** parent *** ").Get(ctx, nil)
		if err != nil {
			logger.Error("CleanupActivity failed", "Error", err)
		}
	}()

	// Sleep till the scheduled time arrives
	logger.Info("sleepDuration = ", sleepDuration)
	if sleepDuration > 0 {
		logger.Info("Sleep for ", sleepDuration, " seconds")
		_ = workflow.Sleep(ctx, sleepDuration)
		logger.Info("Woke up from sleep after ", sleepDuration, " seconds")
	}

	devPrefix := "D0001"
	numOfRuns := 2
	for i := 0; i < numOfRuns; i++ {
		device := devPrefix + strconv.Itoa(i)
		workflow.Go(ctx, func(ctx workflow.Context) {
			var result DummyResponse
			cwo := workflow.ChildWorkflowOptions{
				WorkflowID: workflowID + "_" + device,
				//ParentClosePolicy: enums.PARENT_CLOSE_POLICY_REQUEST_CANCEL,
			}
			ctx = workflow.WithChildOptions(ctx, cwo)
			err := workflow.ExecuteChildWorkflow(ctx, ChildWorkflow, device, workflowID, msg).Get(ctx, &result)
			logger.Info("Parent received response from child . ", result)
			if err != nil {
				logger.Info("Failed for ", device, ", error = ", zap.Error(err))
			}
			terminalResultChannel.Send(ctx, result)
		})
	}

	var results []DummyResponse
	for i := 0; i < numOfDevices; i++ {
		var result DummyResponse
		terminalResultChannel.Receive(ctx, &result)
		results = append(results, result)
	}
	log.Println("Parent Workflow Execution complete. ", results)
	return nil
}


func ChildWorkflow(ctx workflow.Context, device string, workflowid string, msg DummyRequest) (DummyResponse, error) {
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 30 * time.Second,
		//HeartbeatTimeout:    5 * time.Second,
		WaitForCancellation: true,
		//ParentClosePolicy: enums.PARENT_CLOSE_POLICY_REQUEST_CANCEL,
	}
	ctx = workflow.WithActivityOptions(ctx, ao)
	logger := workflow.GetLogger(ctx)

	var a *Activities // Used to call Activities by function pointer
	defer func() {
		// When the Workflow is canceled, it has to get a new disconnected context to execute any Activities
		newCtx, _ := workflow.NewDisconnectedContext(ctx)
		err := workflow.ExecuteActivity(newCtx, a.CleanupActivity, " *** child *** ").Get(ctx, nil)
		if err != nil {
			logger.Error("CleanupActivity failed", "Error", err)
		}
	}()

	var result string
	err := workflow.ExecuteActivity(ctx, a.ActivityToBeCanceled, device).Get(ctx, &result)
	logger.Info(fmt.Sprintf("ActivityToBeCanceled returns %v, %v", result, err))

	var event DummyResponse
	signalName := SIGNAL_PREFIX + device
	statusCh := workflow.GetSignalChannel(ctx, signalName)

	selector := workflow.NewSelector(ctx)
	selector.AddReceive(statusCh, func(c workflow.ReceiveChannel, more bool) {
		c.Receive(ctx, &event)
		logger.Info("ChildWorkflow Received signal  ", signalName, " with event = ", event)
	})

	selector.AddReceive(ctx.Done(), func(c workflow.ReceiveChannel, more bool) {
		logger.Info("!!!!!!!!!!! ChildWorkflow got cancelled !!!!!!!!!!!")
		event.Result = "Cancelled"
	})

	selector.Select(ctx)
	log.Println("Child Workflow Execution complete.", event)

	return event, nil
}

type Activities struct{}

func (a *Activities) ActivityToBeCanceled(ctx context.Context, device string) (string, error) {
	logger := activity.GetLogger(ctx)
	return "sent", nil
}

func (a *Activities) CleanupActivity(ctx context.Context, from string) error {
	logger := activity.GetLogger(ctx)
	logger.Info("Cleanup Activity started for " + from)
	return nil
}

@sunils

So I started digging into this and I’m a bit confused. The child workflows aren’t started until after the parent is cancelled, and you’re expecting them to be cancelled as well? The real bug here may be that we even allow that, I think ExecuteChildWorkflow should just immediately return an error in this case, but I’m confirming that and will fix that if it’s indeed the behavior we expect.

I’d suggest that you take a different approach here. I’m not sure exactly what your business logic is, but rather than spawning new child workflows when the parent is already cancelled, it would likely make more sense to run whatever cleanup activities need to be from the parent workflow itself.

@Spencer_Judge

That is excatly what even I expected, since a cancel request was recieved by the parent workflow, I thought the child worflows wouldnt start.
But looking at the history, the child workflows are started and then they receive the cancel request (which was having the cancellation issue).

My usecase is : At a specific time I need to start a bunch of child worflows in parallel which inturn run a few activities.
As you mentioned, i would want to just run a cleanup of the parent if the child workflows hadnt started yet.

That different apporach that you mentioned is excatly what I need. Could you please provide some code snippet on how to do that?
I have a cleanup handler in the parent workflow, that does get called.
But how do i make sure that the child worflows dont get executed when the parent is already cancelled?

1 Like

@sunils How about this, right after the sleep?

if err := ctx.Err(); err != nil {
  logger.Info("context cancelled/timed out", err)
  return err
}

That should provide a workaround until the API is fixed to fail ExecuteChildWorkflow.

This has been fixed with Do not allow starting child workflows from an already-cancelled context by Sushisource · Pull Request #485 · temporalio/sdk-go · GitHub

Thanks everyone!