Error "Failed finding child for sequence ..."

I’m getting this error in a parent workflow:

{
    "message": "Failed finding child for sequence 34",
    "stackTrace": "   at Temporalio.Worker.WorkflowInstance.ApplyResolveChildWorkflowExecution(ResolveChildWorkflowExecution resolve)\n   at Temporalio.Worker.WorkflowInstance.Activate(WorkflowActivation act)",
    "applicationFailureInfo":
    {
        "type": "InvalidOperationException"
    }
}

But I can’t seem to find any information about this.
The workflow is something like this:

[WorkflowRun]
public async Task<PostCallBatchWorkflowOutput> RunAsync(PostCallBatchWorkflowInput input)
{
    
    // execute activity

   // execute child workflow

    PostCallRequest[] requests; //usually 100
    {
        var handles = await Workflow.WhenAllAsync<ChildWorkflowHandle<PostCallRequestWorkflow>>(requests.Select(r =>
        {
            return Workflow.StartChildWorkflowAsync<PostCallRequestWorkflow>(
                wf => wf.RunAsync(new PostCallRequestWorkflowInput(...)),
                new ChildWorkflowOptions
                {
                    Id = $"wf_post_call_request_{WorkflowWrapper.GenRandomUUID()}",
                    ParentClosePolicy = ParentClosePolicy.Abandon,
                    ExecutionTimeout = TimeSpan.FromMinutes(5),
                    RetryPolicy = new Temporalio.Common.RetryPolicy
                    {
                        InitialInterval = TimeSpan.FromSeconds(2),
                        BackoffCoefficient = 2.0f,
                        MaximumInterval = TimeSpan.FromSeconds(8),
                        MaximumAttempts = 4,
                    },
                });
        }));

        await Workflow.WhenAllAsync(handles.AsEnumerable().Select(handle =>
        {
            var task = handle.GetResultAsync();
            task.ConfigureAwait(true);
            return task;
        }));
    }

    await Workflow.WaitConditionAsync(() => Workflow.AllHandlersFinished);

    return new PostCallBatchWorkflowOutput(.....);
}

I’m waiting for everything to finish so the parent exists while all child workflows also exist.
Am I doing something wrong?

Thank you.

In the rare case where a child was initiated, started, and completed all in the same workflow task, there was a .NET bug that we have recently fixed at Fix issue where child workflow starts and completes in same activation by cretz · Pull Request #492 · temporalio/sdk-dotnet · GitHub. It will be part of next release (ideally soon but no exact timeline).

Ok, it makes sense: we’re using 1.7.0.
Can you confirm that the issue doesn’t exist on 1.6.0?

If it is related to the fix (so child workflow initiated, started, and completed in the same workflow task), I cannot confirm that. In fact, from my understanding of the underlying algorithm involved, this has been an issue since .NET SDK’s inception, though would have to test on older versions to be sure. For most this is a rare situation that a child workflow completes so fast and/or the parent workflow’s task processor is not fast enough where the entire child workflow’s lifecycle is in one workflow task, so that may be why you were not seeing it before.

Do you need more information that I could provide to confirm that?

Also, we reverted back to 1.4.0 and the issue seems fixed on that version, so maybe something happened after 1.4.0?

It is technically possible though I see nothing obvious. Some questions to help us figure this out:

  1. Is this a regular occurrence? Is it possible to even reliably replicate? If so, is it possible to reduce the replication down to something simple enough where we could replicate on our side? Understood if replication may involve a loop or racy situations.
  2. Does the exact same history fail in latest but pass in 1.4.0? The way you would test this is take a workflow history from a 1.4.0 workflow that succeeded that you believe would fail in 1.7.0. Then run it through the WorkflowReplayer in 1.4.0, confirm it does not fail, and 1.7.0 and confirm it does fail. If the exact same history fails in 1.7.0 but passes in 1.4.0 then there definitely was some change. Note, you can’t do the inverse (e.g. grab history in 1.7.0 and run it in 1.4.0) as history is mostly only forward compatible.
  3. Can you provide a history dump of the failing workflow history? We just want to confirm whether it is the same start-and-complete-same-task issue. This can be provided via a ticket if you are a cloud user, or if not, can be provided via DM to me, cretz, in our public Slack or via email attachment to me w/ my address being first name at temporal.io

OK, so we’ve confirmed that this is indeed the issue you mentioned.
We’ll wait for the next release.

Thanks,

1 Like