ERROR Workflow panic - Error lookup failed for scheduledEventID to activityID

Hi, I am running a couple of workflows, using a lot of workflow.Sleep methods.

For some of those, I am getting: ERROR Workflow panic - Error lookup failed for scheduledEventID to activityID.

I have tried to identify the problem for a while but without luck. I do not know how to identify the place causing this error. Do you have any suggestions on how to find a root of a problem?

Full stack trace:

2022/08/17 08:25:27 ERROR Workflow panic Namespace temporal-dev TaskQueue dev-queue WorkerID 1551413@ip-172-31-13-130@ WorkflowType CollectV007S01 WorkflowID bf-CollectV007S01-9bc6954c-6409-4a17-9ade-7d6b49099726-1845d12c-c841-4538-90d9-2cdf3f8ab686 RunID 46aee63d-95e8-434f-974b-37ffe823a279 Attempt 4098 Error lookup failed for scheduledEventID to activityID: scheduleEventID: 45, activityID: 45 StackTrace process event for dev-queue [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_command_state_machine.go:480
go.temporal.io/sdk/internal.(*commandsHelper).handleActivityTaskScheduled(0xc000bcf770, {0xc001115218, 0x2}, 0x0)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_command_state_machine.go:1052 +0x10a
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc0010aaae0, 0xc001107340, 0xd8, 0x0)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_event_handlers.go:858 +0x28b
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0025a4c80, 0xc00292ad20)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_task_handlers.go:903 +0xd68
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc00016d8c0, 0xc00292ad20, 0xc00292b170)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_task_handlers.go:750 +0x493
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0004680d0, 0xc00292ad20)
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_task_pollers.go:290 +0x2cd
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0004680d0, {0xeee4e0, 0xc00292ad20})
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_task_pollers.go:261 +0x6c
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc00047c000, {0xeee0a0, 0xc000881bf0})
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_worker_base.go:418 +0x167
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
	/Users/konrad/go/pkg/mod/go.temporal.io/sdk@v1.16.0/internal/internal_worker_base.go:323 +0xba

Can you share the history of this workflow please?

For example with tctl:
tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json

Does this happen randomly or after some client trigger like cancellation?

Yes, you can find it here (sorry, it is too large to paste). In stack trace, it additionally says, “possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition”, but this happens without any change or even restart of worker.

I am running it on go version go1.17.5 linux/amd64

but this happens without any change or even restart of worker

Before the panic you have a timer (sleep) for 3 days it seems:
Timer Started: “eventTime”: “2022-07-30T10:53:10.979829652Z”,
Timer Fired: “eventTime”: “2022-08-02T10:53:11.027287627Z”,

Workers have in-memory workflow executions cache and when your workflow execution is blocked (on a sleep for example) workers can evict it from it’s cache to allow other ones to be processed (this is optimization technique workers in all sdks have).

I think this is what may be happening here, during this long sleep worker evicted it from cache and when timer fired, worker picked up the workflow task to continue, but since this workflow exec no longer in its cache it has to replay the whole history from the start.

During this replay it seems that you might have some non-deterministic code, or this is SDK bug, check if maybe you are using system time or a lib that uses system time when you define or calculate
your sleep durations.
Try debugging using WorkflowReplayer and see if you are able to find the exact place.

If you could share your workflow code or could provide a reproducer that would help as well.

Check the sleep that you call after invocation of Action4 activity (on the dev-queue task queue). Looks as maybe thats where this happens.

Just to add for reference, here is a great Go SDK troubleshooting guide that can help especially this section for your question.

@tihomir thanks for links, I think I know the issue after reading links you suggested. I will confirm tomorrow but I was getting Intrinsic non-deterministic logic problem.
If someone will hit same issue I am adding some more explanations below - in general do not use default time API as @tihomir suggested above.

My code looks like this:

d := promise.GetStart() // <- calculate execution date "d" (closest Monday, for current week it is in past)
sleepTime := d.Sub(workflow.Now(ctx))  // <- this was time.Now() !!!
if sleepTime > 0 { //<- if it is in future
  workflow.Sleep(ctx, sleepTime)
  ... perform some other activities here
}

workflow.Sleep(ctx, 24 * time.Hour)
... perform some other activities here

So I was getting “Intrinsic non-deterministic logic” described here.

Hope switching to API for time from Temporal SDK will help.

Thank you a lot!