Hello,
The issue we are facing is similar to this topic.
In some of our long-running workflows we are facing this panic from the worker:
getState: illegal access from outside of workflow context
StackTrace:
File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/internal_workflow.go", line 653, in go.temporal.io/sdk/internal.getState
File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/internal_workflow.go", line 659, in go.temporal.io/sdk/internal.assertNotInReadOnlyState
File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/workflow.go", line 568, in go.temporal.io/sdk/internal.ExecuteActivity
File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/workflow/workflow.go", line 121, in go.temporal.io/sdk/workflow.ExecuteActivity
Using temporal UI, it seems we have two situations that trigger this error on the worker side:
-
WorkflowTaskTimedOut
Here the worker seems to hang once it has started the workflow task, even if there is practically no code between the previous activity execution and the next.Example temporal history:
{ "eventId": "11991", "eventTime": "2024-09-11T01:14:59.380443295Z", "eventType": "WorkflowTaskScheduled", "version": "0", "taskId": "21055344", "workerMayIgnore": false, "workflowTaskScheduledEventAttributes": { "taskQueue": { "name": "my-service-68686d5f9f-jd22n:4eacb871-7886-4f03-930a-6fc4f18e5ad0", "kind": "Sticky" }, "startToCloseTimeout": "120s", "attempt": 1 } }, { "eventId": "11992", "eventTime": "2024-09-11T01:14:59.401451526Z", "eventType": "WorkflowTaskStarted", "version": "0", "taskId": "21055348", "workerMayIgnore": false, "workflowTaskStartedEventAttributes": { "scheduledEventId": "11991", "identity": "1@my-service-68686d5f9f-jd22n@", "requestId": "d7b75a92-bc00-409a-b8e2-60ef6660316d", "suggestContinueAsNew": false, "historySizeBytes": "0" } }, { "eventId": "11993", "eventTime": "2024-09-11T01:16:59.403518309Z", "eventType": "WorkflowTaskTimedOut", "version": "0", "taskId": "21055352", "workerMayIgnore": false, "workflowTaskTimedOutEventAttributes": { "scheduledEventId": "11991", "startedEventId": "11992", "timeoutType": "StartToClose" } }
I’m having trouble understanding what the worker is doing during these two minutes, since no deadlock is detected…
It does not appear to be under heavy load, since the same worker instance is casually executing other workflows during this time span, at regular intervals and with periods of inactivity. -
WorkflowTaskFailed
Here a deadlock is detected 1s after the workflow task has started on the worker side.Example temporal history:
{ "eventId": "53613", "eventTime": "2024-09-15T03:24:32.848769352Z", "eventType": "WorkflowTaskScheduled", "version": "0", "taskId": "22183038", "workerMayIgnore": false, "workflowTaskScheduledEventAttributes": { "taskQueue": { "name": "my-service-7fd4b658b-hg248:36ca2738-b743-4428-9f1d-5e563715d855", "kind": "Sticky" }, "startToCloseTimeout": "120s", "attempt": 1 } }, { "eventId": "53614", "eventTime": "2024-09-15T03:24:32.860723791Z", "eventType": "WorkflowTaskStarted", "version": "0", "taskId": "22183042", "workerMayIgnore": false, "workflowTaskStartedEventAttributes": { "scheduledEventId": "53613", "identity": "1@my-service-7fd4b658b-hg248@", "requestId": "d76d17f5-ae42-4adf-8696-59b89365b554", "suggestContinueAsNew": false, "historySizeBytes": "0" } }, { "eventId": "53615", "eventTime": "2024-09-15T03:24:33.885275290Z", "eventType": "WorkflowTaskFailed", "version": "0", "taskId": "22183046", "workerMayIgnore": false, "workflowTaskFailedEventAttributes": { "scheduledEventId": "53613", "startedEventId": "53614", "cause": "WorkflowWorkerUnhandledFailure", "failure": { "message": "Potential deadlock detected: workflow goroutine \"root\" didn't yield for over a second", "source": "GoSDK", "stackTrace": "process event for default-queue [panic]:\ngo.temporal.io/sdk/internal.(*coroutineState).call(0xc008f33090, 0x3b9aca00)\n\t/go/pkg/mod/go.temporal.> "encodedAttributes": null, "cause": null, "applicationFailureInfo": { "type": "PanicError", "nonRetryable": true, "details": null } }, "identity": "1@my-service-7fd4b658b-hg248@", "baseRunId": "", "newRunId": "", "forkEventVersion": "0", "binaryChecksum": "e0870f5e78775d3afcfcc5a97b5987a3" } }
Here is a simplified version of our workflow code:
func testWorkflow(ctx workflow.Context) (err error) {
defer func() {
// XXX: THIS IS WHERE WE PANIC
deferErr := workflow.ExecuteActivity(ctx, testActivity3).Get(ctx, nil)
if deferErr != nil {
err = deferErr
}
}()
err = workflow.ExecuteActivity(ctx, testActivity1).Get(ctx, nil)
if err != nil {
return err
}
// XXX: THIS IS WHERE THE WORKFLOW TASK FAILS / TIMES OUT
err = workflow.ExecuteActivity(ctx, testActivity2).Get(ctx, nil)
if err != nil {
return err
}
return nil
}
So it seems that when the workflow task fails or times out, our defer statement is triggered and tries to execute another activity from a “dead” workflow context, which panics.
After the workflow task failure, temporal indeed retries the workflow on the non-sticky queue, so the workflow is replayed (sometimes on the same worker) and continues normally.
This is not causing any functional problems, but is triggering alerts on our monitoring due to the panics. So I’m looking for help to answer these two questions:
- Why is the workflow task failing / timing out randomly?
- Why is the defer statement triggered when the workflow task failed or times out?
Temporal Go SDK version: 1.25.1
Temporal Server version: 1.19.1
Thanks for your help!