Parent workflow unable to complete after the child workflow completed

Hi!

I’m facing an issue where the parent workflow never completes
after the child workflow is completed.

The child workflow is completed successfully and we can assert the claim
by looking at the artifacts produced and the logs.

{
  "executionConfig": {
    "taskQueue": {
      "name": "parent-cycle",
      "kind": "Normal"
    },
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s"
  },
  "workflowExecutionInfo": {
    "execution": {
      "workflowId": "parent-cycle-1",
      "runId": "db80c230-3bdf-11ee-90d7-00155d1836bf"
    },
    "type": {
      "name": "ParentWorkflowV1"
    },
    "startTime": "2023-08-07T00:17:29.270762753Z",
    "status": "Running",
    "historyLength": "10",
    "executionTime": "2023-08-07T00:17:29.270762753Z",
    "memo": {

    },
    "searchAttributes": {
      "indexedFields": {
        "BuildIds": "[\"unversioned\"]"
      }
    },
    "autoResetPoints": {

    },
    "stateTransitionCount": "6",
    "historySizeBytes": "5277",
    "mostRecentWorkerVersionStamp": {

    }
  },
  "pendingChildren": [
    {
      "workflowId": "child-cycle-1",
      "runId": "fcf59eea-3bdf-11ee-8c3b-00155d1836bf",
      "workflowTypeName": "ChildWorkflowV1",
      "initiatedId": "6",
      "parentClosePolicy": "Abandon"
    }
  ]
}

The child workflow history and execution doesn’t seem to exist anymore.

$ TEMPORAL_ADDRESS=localhost:7777 temporal workflow describe  --namespace customer1 --workflow-id child-cycle-1
Error: workflow describe failed: sql: no rows in result set
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

The namespace configuration.

$ TEMPORAL_ADDRESS=localhost:7777 temporal operator namespace describe customer1
  NamespaceInfo.Name                    customer1
  NamespaceInfo.Id                      ac179d56-3be0-11ee-a34e-00155d1836bf
  NamespaceInfo.Description
  NamespaceInfo.OwnerEmail
  NamespaceInfo.State                   Registered
  Config.WorkflowExecutionRetentionTtl  24h0m0s   
  ReplicationConfig.ActiveClusterName   active
  ReplicationConfig.Clusters            [&ClusterReplicationConfig{ClusterName:active,}]
  Config.HistoryArchivalState           Disabled
  Config.VisibilityArchivalState        Disabled
  IsGlobalNamespace                     false
  FailoverVersion                                                                      0
  FailoverHistory                       []

I’m not if this information can help but at some point the cluster was under provisioned and the queues (task_queue, timer_queue) become quiet large. ~20millions rows.
My assumption is that a race condition happened between the history cleanup of the child workflow and the signal to the parent workflow.

Setup:
version: 1.21.4
database: postgres 13 (aurora)
os: EKS

What can cause such behavior ?

Thanks

Hi, sorry late response but are you still running into this issue? My guess would be that the child workflow after completion was already removed by namespace retention policy set to 24hrs per info you shared.
I’m not yet sure why that would cause your parent workflow not able to complete. Can you share event history of this parent execution?