Hi, I am running a couple of workflows, using a lot of workflow.Sleep methods.
For some of those, I am getting: ERROR Workflow panic - Error lookup failed for scheduledEventID to activityID.
I have tried to identify the problem for a while but without luck. I do not know how to identify the place causing this error. Do you have any suggestions on how to find a root of a problem?
Yes, you can find it here (sorry, it is too large to paste). In stack trace, it additionally says, “possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition”, but this happens without any change or even restart of worker.
I am running it on go version go1.17.5 linux/amd64
but this happens without any change or even restart of worker
Before the panic you have a timer (sleep) for 3 days it seems:
Timer Started: “eventTime”: “2022-07-30T10:53:10.979829652Z”,
Timer Fired: “eventTime”: “2022-08-02T10:53:11.027287627Z”,
Workers have in-memory workflow executions cache and when your workflow execution is blocked (on a sleep for example) workers can evict it from it’s cache to allow other ones to be processed (this is optimization technique workers in all sdks have).
I think this is what may be happening here, during this long sleep worker evicted it from cache and when timer fired, worker picked up the workflow task to continue, but since this workflow exec no longer in its cache it has to replay the whole history from the start.
During this replay it seems that you might have some non-deterministic code, or this is SDK bug, check if maybe you are using system time or a lib that uses system time when you define or calculate
your sleep durations.
Try debugging using WorkflowReplayer and see if you are able to find the exact place.
If you could share your workflow code or could provide a reproducer that would help as well.
@tihomir thanks for links, I think I know the issue after reading links you suggested. I will confirm tomorrow but I was getting Intrinsic non-deterministic logic problem.
If someone will hit same issue I am adding some more explanations below - in general do not use default time API as @tihomir suggested above.
My code looks like this:
d := promise.GetStart() // <- calculate execution date "d" (closest Monday, for current week it is in past)
sleepTime := d.Sub(workflow.Now(ctx)) // <- this was time.Now() !!!
if sleepTime > 0 { //<- if it is in future
workflow.Sleep(ctx, sleepTime)
... perform some other activities here
}
workflow.Sleep(ctx, 24 * time.Hour)
... perform some other activities here
So I was getting “Intrinsic non-deterministic logic” described here.
Hope switching to API for time from Temporal SDK will help.