Irregular occurence of error: getState: illegal access from outside of workflow context

sgleizes · September 16, 2024, 2:47pm

Hello,

The issue we are facing is similar to this topic.

In some of our long-running workflows we are facing this panic from the worker:
getState: illegal access from outside of workflow context

StackTrace:

  File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/internal_workflow.go", line 653, in go.temporal.io/sdk/internal.getState
  File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/internal_workflow.go", line 659, in go.temporal.io/sdk/internal.assertNotInReadOnlyState
  File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/internal/workflow.go", line 568, in go.temporal.io/sdk/internal.ExecuteActivity
  File "/go/pkg/mod/go.temporal.io/sdk@v1.25.1/workflow/workflow.go", line 121, in go.temporal.io/sdk/workflow.ExecuteActivity

Using temporal UI, it seems we have two situations that trigger this error on the worker side:

WorkflowTaskTimedOut
Here the worker seems to hang once it has started the workflow task, even if there is practically no code between the previous activity execution and the next.

Example temporal history:

{
  "eventId": "11991",
  "eventTime": "2024-09-11T01:14:59.380443295Z",
  "eventType": "WorkflowTaskScheduled",
  "version": "0",
  "taskId": "21055344",
  "workerMayIgnore": false,
  "workflowTaskScheduledEventAttributes": {
    "taskQueue": {
      "name": "my-service-68686d5f9f-jd22n:4eacb871-7886-4f03-930a-6fc4f18e5ad0",
      "kind": "Sticky"
    },
    "startToCloseTimeout": "120s",
    "attempt": 1
  }
},
{
  "eventId": "11992",
  "eventTime": "2024-09-11T01:14:59.401451526Z",
  "eventType": "WorkflowTaskStarted",
  "version": "0",
  "taskId": "21055348",
  "workerMayIgnore": false,
  "workflowTaskStartedEventAttributes": {
    "scheduledEventId": "11991",
    "identity": "1@my-service-68686d5f9f-jd22n@",
    "requestId": "d7b75a92-bc00-409a-b8e2-60ef6660316d",
    "suggestContinueAsNew": false,
    "historySizeBytes": "0"
  }
},
{
  "eventId": "11993",
  "eventTime": "2024-09-11T01:16:59.403518309Z",
  "eventType": "WorkflowTaskTimedOut",
  "version": "0",
  "taskId": "21055352",
  "workerMayIgnore": false,
  "workflowTaskTimedOutEventAttributes": {
    "scheduledEventId": "11991",
    "startedEventId": "11992",
    "timeoutType": "StartToClose"
  }
}

I’m having trouble understanding what the worker is doing during these two minutes, since no deadlock is detected…
It does not appear to be under heavy load, since the same worker instance is casually executing other workflows during this time span, at regular intervals and with periods of inactivity.

WorkflowTaskFailed
Here a deadlock is detected 1s after the workflow task has started on the worker side.

Example temporal history:

{
  "eventId": "53613",
  "eventTime": "2024-09-15T03:24:32.848769352Z",
  "eventType": "WorkflowTaskScheduled",
  "version": "0",
  "taskId": "22183038",
  "workerMayIgnore": false,
  "workflowTaskScheduledEventAttributes": {
    "taskQueue": {
      "name": "my-service-7fd4b658b-hg248:36ca2738-b743-4428-9f1d-5e563715d855",
      "kind": "Sticky"
    },
    "startToCloseTimeout": "120s",
    "attempt": 1
  }
},
{
  "eventId": "53614",
  "eventTime": "2024-09-15T03:24:32.860723791Z",
  "eventType": "WorkflowTaskStarted",
  "version": "0",
  "taskId": "22183042",
  "workerMayIgnore": false,
  "workflowTaskStartedEventAttributes": {
    "scheduledEventId": "53613",
    "identity": "1@my-service-7fd4b658b-hg248@",
    "requestId": "d76d17f5-ae42-4adf-8696-59b89365b554",
    "suggestContinueAsNew": false,
    "historySizeBytes": "0"
  }
},
{
  "eventId": "53615",
  "eventTime": "2024-09-15T03:24:33.885275290Z",
  "eventType": "WorkflowTaskFailed",
  "version": "0",
  "taskId": "22183046",
  "workerMayIgnore": false,
  "workflowTaskFailedEventAttributes": {
    "scheduledEventId": "53613",
    "startedEventId": "53614",
    "cause": "WorkflowWorkerUnhandledFailure",
    "failure": {
      "message": "Potential deadlock detected: workflow goroutine \"root\" didn't yield for over a second",
      "source": "GoSDK",
      "stackTrace": "process event for default-queue [panic]:\ngo.temporal.io/sdk/internal.(*coroutineState).call(0xc008f33090, 0x3b9aca00)\n\t/go/pkg/mod/go.temporal.>
      "encodedAttributes": null,
      "cause": null,
      "applicationFailureInfo": {
        "type": "PanicError",
        "nonRetryable": true,
        "details": null
      }
    },
    "identity": "1@my-service-7fd4b658b-hg248@",
    "baseRunId": "",
    "newRunId": "",
    "forkEventVersion": "0",
    "binaryChecksum": "e0870f5e78775d3afcfcc5a97b5987a3"
  }
}

Here is a simplified version of our workflow code:


func testWorkflow(ctx workflow.Context) (err error) {
	defer func() {
		// XXX: THIS IS WHERE WE PANIC
		deferErr := workflow.ExecuteActivity(ctx, testActivity3).Get(ctx, nil)
		if deferErr != nil {
			err = deferErr
		}
	}()

	err = workflow.ExecuteActivity(ctx, testActivity1).Get(ctx, nil)
	if err != nil {
		return err
	}

	// XXX: THIS IS WHERE THE WORKFLOW TASK FAILS / TIMES OUT

	err = workflow.ExecuteActivity(ctx, testActivity2).Get(ctx, nil)
	if err != nil {
		return err
	}

	return nil
}

So it seems that when the workflow task fails or times out, our defer statement is triggered and tries to execute another activity from a “dead” workflow context, which panics.
After the workflow task failure, temporal indeed retries the workflow on the non-sticky queue, so the workflow is replayed (sometimes on the same worker) and continues normally.

This is not causing any functional problems, but is triggering alerts on our monitoring due to the panics. So I’m looking for help to answer these two questions:

Why is the workflow task failing / timing out randomly?
Why is the defer statement triggered when the workflow task failed or times out?

Temporal Go SDK version: 1.25.1
Temporal Server version: 1.19.1

Thanks for your help!

Virtual-Felix · February 24, 2025, 2:17pm

If someone have an experience with this, help would be much appreciated <3

Topic		Replies	Views
Infrequent Error: getState: illegal access from outside of workflow context Community Support go-sdk	0	125	August 29, 2024
Getting an error inside a test, it does not happen in when run in a real enviroment (test panicked: getState: illegal access from outside of workflow context) Community Support go-sdk , testing	6	43	May 30, 2025
Need help with A non-deterministic error has caused the Workflow Task to hang Community Support go-sdk	1	59	August 28, 2024
Potential deadlock detected: workflow goroutine "root" didn't yield for over a second Community Support go-sdk	7	7914	February 5, 2025
When starting worker, get the Potential Deadlock error, can anyone suggest? Community Support	4	2799	March 30, 2021

Irregular occurence of error: getState: illegal access from outside of workflow context

Related topics