Hi!
I’m facing an issue where the parent workflow never completes
after the child workflow is completed.
The child workflow is completed successfully and we can assert the claim
by looking at the artifacts produced and the logs.
{
"executionConfig": {
"taskQueue": {
"name": "parent-cycle",
"kind": "Normal"
},
"workflowExecutionTimeout": "0s",
"workflowRunTimeout": "0s",
"defaultWorkflowTaskTimeout": "10s"
},
"workflowExecutionInfo": {
"execution": {
"workflowId": "parent-cycle-1",
"runId": "db80c230-3bdf-11ee-90d7-00155d1836bf"
},
"type": {
"name": "ParentWorkflowV1"
},
"startTime": "2023-08-07T00:17:29.270762753Z",
"status": "Running",
"historyLength": "10",
"executionTime": "2023-08-07T00:17:29.270762753Z",
"memo": {
},
"searchAttributes": {
"indexedFields": {
"BuildIds": "[\"unversioned\"]"
}
},
"autoResetPoints": {
},
"stateTransitionCount": "6",
"historySizeBytes": "5277",
"mostRecentWorkerVersionStamp": {
}
},
"pendingChildren": [
{
"workflowId": "child-cycle-1",
"runId": "fcf59eea-3bdf-11ee-8c3b-00155d1836bf",
"workflowTypeName": "ChildWorkflowV1",
"initiatedId": "6",
"parentClosePolicy": "Abandon"
}
]
}
The child workflow history and execution doesn’t seem to exist anymore.
$ TEMPORAL_ADDRESS=localhost:7777 temporal workflow describe --namespace customer1 --workflow-id child-cycle-1
Error: workflow describe failed: sql: no rows in result set
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
The namespace configuration.
$ TEMPORAL_ADDRESS=localhost:7777 temporal operator namespace describe customer1
NamespaceInfo.Name customer1
NamespaceInfo.Id ac179d56-3be0-11ee-a34e-00155d1836bf
NamespaceInfo.Description
NamespaceInfo.OwnerEmail
NamespaceInfo.State Registered
Config.WorkflowExecutionRetentionTtl 24h0m0s
ReplicationConfig.ActiveClusterName active
ReplicationConfig.Clusters [&ClusterReplicationConfig{ClusterName:active,}]
Config.HistoryArchivalState Disabled
Config.VisibilityArchivalState Disabled
IsGlobalNamespace false
FailoverVersion 0
FailoverHistory []
I’m not if this information can help but at some point the cluster was under provisioned and the queues (task_queue, timer_queue) become quiet large. ~20millions rows.
My assumption is that a race condition happened between the history cleanup of the child workflow and the signal to the parent workflow.
Setup:
version: 1.21.4
database: postgres 13 (aurora)
os: EKS
What can cause such behavior ?
Thanks