Workflow gets removed from temporal (can't find it by id after a while). Need help investigating/troubleshooting

Hello there, I have a problem with workflow getting removed from temporal history (basically I’m no longer able to find them in temporal admin panel after a while).

I have a workflow, which start a child workflow like that

  private fun sendAnnulmentReports(order: Order, participant: Participant): Promise<WorkflowExecution> {
    val annulmentReason = AnnulmentReason.fromCloseReason(closeParams.reason)
    val reports = reportActivity.createAnnulmentReports(order.id, order.extension, annulmentReason, order.products.mapNotNull { product -> product.gtin }.toSet(), participant)
    val options = ChildWorkflowOptions.newBuilder()
      .setWorkflowId("${order.id}:${Workflow.randomUUID()}")
      .setTaskQueue(TemporalConfiguration.REPORTS_WF_QUEUE)
      .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
      .validateAndBuildWithDefaults()
    val annulmentReportsWorkflow = Workflow.newChildWorkflowStub(AnnulmentReportsWorkflow::class.java, options)
    Async.procedure(
      annulmentReportsWorkflow::sendReports, reports, participant,
      AnnulmentReportWorkflowProps(workflowProps.orderWorkflowProps.timeToDeleteAnnulmentReports, workflowProps.orderWorkflowProps.shouldCheckReceipt)
    )
    return Workflow.getWorkflowExecution(annulmentReportsWorkflow)
  }
...
annulmentReportsExecution?.get()

This workflow is long running and takes 30 days to complete, meanwhile the parent workflow proceeds and can finish while the child workflow is sleeping. I expect that if I set ParentClosyPolicy.ABANDON child workflow itsn’t affected by the parent completion/termination and proceeds as if it’s standalone. Am I wrong?


The first part of the child workflow id (before the semicolon) is the id of the parent workflow. When I try to search temporal admin panel for the parent workflow, I can’t find anything (not even complete or terminated workflow, just literally nothing)

This makes me wonder if workflows get removed from temporal after a certain amount of time after completion. What might possibly go on here?

When the child workflow wakes up it starts to fail with the following exception:

2022-10-11 11:46:03.822  WARN 1 --- [="default": 651] i.t.i.w.WorkflowWorker                   : Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=1f0e94d5-be8b-462c-b011-e45d290d4daa:58666d14-52c6-31dc-afb4-78145efcd55c, RunId=49e7e7f1-f6d1-4fbc-8ffe-407403183a37, startedEventId=20
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: invalid command sequence: [CompleteWorkflowExecution, RecordMarker], command CompleteWorkflowExecution must be the last command.
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.48.1.jar:1.48.1]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.48.1.jar:1.48.1]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.48.1.jar:1.48.1]
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondWorkflowTaskCompleted(WorkflowServiceGrpc.java:3262) ~[temporal-serviceclient-1.16.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:334) ~[temporal-sdk-1.16.0.jar:?]
	at io.temporal.internal.retryer.GrpcRetryer.lambda$retry$0(GrpcRetryer.java:53) ~[temporal-serviceclient-1.16.0.jar:?]
	at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:64) ~[temporal-serviceclient-1.16.0.jar:?]
	at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:61) ~[temporal-serviceclient-1.16.0.jar:?]
	at io.temporal.internal.retryer.GrpcRetryer.retry(GrpcRetryer.java:51) ~[temporal-serviceclient-1.16.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:328) ~[temporal-sdk-1.16.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:241) ~[temporal-sdk-1.16.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:180) ~[temporal-sdk-1.16.0.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ~[temporal-sdk-1.16.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]

What might be the reason?

Hi @Nicholas_Neutron

you are right. As per documentation The child workflow will do nothing if you set ParentClosyPolicy.ABANDON, Temporal Java SDK developer's guide | Temporal Documentation

Yes, the workflow history is removed after the retention period, please see What is a Temporal Cluster? | Temporal Documentation, default retention period is 2 days.

you can check the retention period of your namespace

and update it

Could it be that because the parent workflow was removed as per retention period, my child workflow starts to fail with the provided above exception even thought I specified ParentClosePolicy.ABANDON on starting the child workflow?

@Nicholas_Neutron I don’t think so.

I think that might be related to what is mentioned in this post Exceptions from parallel activities causing Workflow restart infinitely - #11 by tihomir

Could you share the code of your child workflow?

Sure, here it is:

class AnnulmentReportsWorkflowImpl : AnnulmentReportsWorkflow {

  private val log = Workflow.getLogger(AnnulmentReportsWorkflowImpl::class.java)

  private lateinit var workflowProps: AnnulmentReportWorkflowProps

  private val docStorageActivity: DocStorageActivity = Workflow.newLocalActivityStub(
    DocStorageActivity::class.java,
    LocalActivityOptions.newBuilder()
      .setStartToCloseTimeout(Duration.ofSeconds(30))
      .validateAndBuildWithDefaults()
  )

  private val txActivity: TxActivity = Workflow.newLocalActivityStub(
    TxActivity::class.java,
    LocalActivityOptions.newBuilder()
      .setStartToCloseTimeout(Duration.ofSeconds(30))
      .validateAndBuildWithDefaults()
  )

  private val registrarReportActivity = Workflow.newLocalActivityStub(RegistrarReportActivity::class.java)

  private val reportActivity: ReportActivity = Workflow.newLocalActivityStub(
    ReportActivity::class.java,
    LocalActivityOptions.newBuilder()
      .setStartToCloseTimeout(Duration.ofMinutes(30))
      .validateAndBuildWithDefaults()
  )

  override fun sendReports(reports: List<AnnulmentReport>, participant: Participant, props: AnnulmentReportWorkflowProps) {
    log.info("Annulment report workflow: {} {}", reports, props.shouldCheckReceipt)
    workflowProps = props
    val promises = reports.map { report -> Async.function { sendReport(report, participant) } }
    Promise.allOf(promises).get()
    Workflow.sleep(props.timeToDelete)
    val deleteRequests = reports.map { report ->
      Async.procedure { reportActivity.deleteReport(report.id) }
    }
    Promise.allOf(deleteRequests)
  }

  private fun sendReport(report: AnnulmentReport, participant: Participant) {
    val newReport = if (workflowProps.shouldCheckReceipt) {
      val saveAnnulmentReport = docStorageActivity.saveAnnulmentReport(report, participant)
      val txId = Workflow.randomUUID()
      val newReport = txActivity.createAnnulmentReportTransaction(txId, report, participant, Event(success = true, txEventType = TxEventType.ANNULMENT_CODES))
        .run { report.copy(txId = txId) }
        .also { reportActivity.setResultDocId(report.id, txId) }
        .also { newReport -> txActivity.appendAnnulmentReportEvent(newReport, participant, Event(success = true, txEventType = TxEventType.ANNULMENT_CODES), saveAnnulmentReport) }
      newReport
    } else {
      report
    }

    val (success, rejectionReason) = registrarReportActivity.sendAnnulmentReport(newReport, participant)

    val newStatus = if (success) ReportStatus.SENT else ReportStatus.REJECTED
    reportActivity.updateReportStatus(newReport.id, ReportStatus.PENDING, newStatus, rejectionReason, removeSntins = true)

    if (workflowProps.shouldCheckReceipt) {
      txActivity.appendAnnulmentReportEvent(
        newReport, participant,
        Event(
          success = success,
          errorMessage = rejectionReason,
          txEventType = TxEventType.ANNULMENT_CODES_EMISSION_SEND
        )
      )
    }
  }
}

Can you try calling get() to wait for your local activities to complete before the workflow method finish?

Promise.allOf(deleteRequests).get()

1 Like

Yeah, I solved this like that just before I saw your answer:) Kinda slipped before my eyes! Thanks a lot for the help!