Workflow doesn't save state

Hi!
My workflow consists of 3 sequential activites(RemoveEntity, WaitEntity, UpdateEntity). Two of them executes without any errors. The third returns error. As far as i understand termporal should save all activity calls and results to workflow history and start from the third activity “UpdateEntity”. But it doesn’t. When the last activity returns error, workflow starts from the begining. I see it at my console log because 1st activity makes http requests. At admin ui, i see that the 5 entries, and the last that activity “RemoveEntity” scheduled. Why state of activities doesn’t save to log and why workflow starts from the begining?

	ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{StartToCloseTimeout: time.Hour})
	slog.Debug("remove from", "ref", data.Ref)
	err := workflow.ExecuteActivity(ctx, a.RemoveEntity, data).Get(ctx, nil)
	if err != nil {
		return err
	}

	slog.Debug("waiting to finish")
	err = workflow.ExecuteActivity(ctx, ops.WaitEntity, data).Get(ctx, nil)
	if err != nil {
		return err
	}

	slog.Debug("error emulation!!!")

	return errors.New("failed to sync")
        //return workflow.ExecuteActivity(ctx, ops.UpdateEntity, data).Get(ctx, nil)
4:52PM DBG error emulation!!! group service=unknown

4:52PM DBG remove security group activity with  ref=9d6eac43-be2a-4f91-8e9b-e1cc365416d8 service=unknown
4:52PM DBG ExecuteActivity ActivityID=5 ActivityType=RemoveEntityActivity Attempt=2 Namespace=default RunID=15266c1e-624d-4fc2-976a-9bbb5260960a TaskQueue=edgeclient WorkerID=91173@sc-mac-01014@ WorkflowID="RemoveEntityWorkflow(GroupName: sg-xxxxx)" WorkflowType=RemoveEntityWorkflow service=unknown

Would you post the execution history of this workflow?

Would you post the execution history of this workflow?

sure

{
  "events": [
    {
      "eventId": "1",
      "eventTime": "2024-06-23T09:24:28.954758878Z",
      "eventType": "WorkflowExecutionStarted",
      "version": "0",
      "taskId": "1049582",
      "workerMayIgnore": false,
      "workflowExecutionStartedEventAttributes": {
        "workflowType": {
          "name": "RemoveEntityWorkflow"
        },
        "parentWorkflowNamespace": "",
        "parentWorkflowNamespaceId": "",
        "parentWorkflowExecution": null,
        "parentInitiatedEventId": "0",
        "taskQueue": {
          "name": "edgeclient",
          "kind": "Normal",
          "normalName": ""
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "encodingDecoded": "json/plain"
              }
            }
          ]
        },
        "workflowExecutionTimeout": "3600s",
        "workflowRunTimeout": "3600s",
        "workflowTaskTimeout": "10s",
        "continuedExecutionRunId": "8b1f8674-330f-4ae3-b07a-09be6519ac1b",
        "initiator": "Retry",
        "continuedFailure": {
          "message": "failed to sync",
          "source": "GoSDK",
          "stackTrace": "",
          "encodedAttributes": null,
          "cause": {
            "message": "failed to sync",
            "source": "GoSDK",
            "stackTrace": "",
            "encodedAttributes": null,
            "cause": null,
            "applicationFailureInfo": {
              "type": "",
              "nonRetryable": false,
              "details": null
            }
          },
          "applicationFailureInfo": {
            "type": "Error",
            "nonRetryable": false,
            "details": null
          }
        },
        "lastCompletionResult": null,
        "originalExecutionRunId": "e58b43d4-3a67-4a42-9b77-8fe5ae7a9dd0",
        "identity": "",
        "firstExecutionRunId": "8b1f8674-330f-4ae3-b07a-09be6519ac1b",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 5,
          "nonRetryableErrorTypes": []
        },
        "attempt": 2,
        "workflowExecutionExpirationTime": "2024-06-23T10:24:27.875Z",
        "cronSchedule": "",
        "firstWorkflowTaskBackoff": "1s",
        "memo": {
          "fields": {
            "EdgeId": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "MQ=="
            }
          }
        },
        "searchAttributes": {
          "indexedFields": {
            "ProjectIDAttribute": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "type": "S2V5d29yZA=="
              },
              "data": "IiI="
            }
          }
        },
        "prevAutoResetPoints": {
          "points": [
            {
              "binaryChecksum": "78439d26766b093dd8b7799315c3116d",
              "runId": "8b1f8674-330f-4ae3-b07a-09be6519ac1b",
              "firstWorkflowTaskCompletedId": "4",
              "createTime": "2024-06-23T09:24:27.893540503Z",
              "expireTime": "2024-06-24T09:24:28.954758878Z",
              "resettable": true
            }
          ]
        },
        "header": {
          "fields": {}
        },
        "parentInitiatedEventVersion": "0",
        "workflowId": "RemoveEntityWorkflow(GroupName: sg-e56701_XXX49)",
        "sourceVersionStamp": null
      }
    },
    {
      "eventId": "2",
      "eventTime": "2024-06-23T09:24:29.957190545Z",
      "eventType": "WorkflowTaskScheduled",
      "version": "0",
      "taskId": "1049590",
      "workerMayIgnore": false,
      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          "name": "edgeclient",
          "kind": "Normal",
          "normalName": ""
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }
    },
    {
      "eventId": "3",
      "eventTime": "2024-06-23T09:24:29.967224462Z",
      "eventType": "WorkflowTaskStarted",
      "version": "0",
      "taskId": "1049593",
      "workerMayIgnore": false,
      "workflowTaskStartedEventAttributes": {
        "scheduledEventId": "2",
        "identity": "30862@sc-mac-01014@",
        "requestId": "1442d663-f57e-4990-8c34-9ce71b3b7331",
        "suggestContinueAsNew": false,
        "historySizeBytes": "1791"
      }
    },
    {
      "eventId": "4",
      "eventTime": "2024-06-23T09:24:29.976581295Z",
      "eventType": "WorkflowTaskCompleted",
      "version": "0",
      "taskId": "1049597",
      "workerMayIgnore": false,
      "workflowTaskCompletedEventAttributes": {
        "scheduledEventId": "2",
        "startedEventId": "3",
        "identity": "30862@sc-mac-01014@",
        "binaryChecksum": "",
        "workerVersion": {
          "buildId": "78439d26766b093dd8b7799315c3116d",
          "bundleId": "",
          "useVersioning": false
        },
        "sdkMetadata": {
          "coreUsedFlags": [],
          "langUsedFlags": [
            3
          ],
          "sdkName": "temporal-go",
          "sdkVersion": "1.26.1"
        },
        "meteringMetadata": {
          "nonfirstLocalActivityExecutionAttempts": 0
        }
      }
    },
    {
      "eventId": "5",
      "eventTime": "2024-06-23T09:24:29.976685170Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "1049598",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "5",
        "activityType": {
          "name": "RemoveEntityActivity"
        },
        "taskQueue": {
          "name": "edgeclient",
          "kind": "Normal",
          "normalName": ""
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "encodingDecoded": "json/plain"
              },
            }
          ]
        },
        "scheduleToCloseTimeout": "3600s",
        "scheduleToStartTimeout": "3600s",
        "startToCloseTimeout": "3600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "4",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        },
        "useCompatibleVersion": true
      }
    }
  ]
}

It looks like your workflow is retrying from the beginning. Can you post the history of the previous workflow run that failed?

i don’t get what you mean by “previous run”. I get this json from “Event history”.


I see from my console logs that first 2 activities ran good until error returned in the end of workflow. And the second time workflow started from first activity again and met external http error, because entity was already deleted by first run. Looks like activity completion results don’t saved at all.

im pretty sure im not crazy because console logs shows exactly code output.

slog.Debug("remove security group activity with ", "ref", data.Ref)
	err := workflow.ExecuteActivity(ctx, nullVMOps.RemoveSecurityGroupActivity, data).Get(ctx, nil)
	if err != nil {
		return err
	}

	slog.Debug("waiting for security group to finish")
	err = workflow.ExecuteActivity(ctx, nullVMOps.WaitSecurityGroupActivity, data).Get(ctx, nil)
	if err != nil {
		return err
	}

	slog.Debug("syncing security group")

	fmt.Println("error emulation!!!")
        return errors.New("failed to sync")
12:01PM DBG remove security group activity with  ref=d9dbfa60-3569-4c03-b854-f1e9b78202b9 service=unknown
12:01PM DBG ExecuteActivity ActivityID=5 ActivityType=RemoveSecurityGroupActivity Attempt=1 Namespace=default RunID=be14bcde-7140-4153-a593-feaaaab4b1f2 TaskQueue=edgeclient WorkerID=85144@sc-mac-01014@ WorkflowID="RemoveSecurityGroupWorkflow(GroupName: sg-e56701_XXX53)" WorkflowType=RemoveSecurityGroupWorkflow service=unknown
12:01PM INF begin http request props="[method=DELETE url=https://some_url/v1/security-groups/d9dbfa60-3569-4c03-b854-f1e9b78202b9]" service=unknown
12:01PM INF finish http request http.latency=355.548333ms props="[method=DELETE url=https://some_url/v1/security-groups/d9dbfa60-3569-4c03-b854-f1e9b78202b9]" service=unknown
12:01PM DBG waiting for security group to finish service=unknown
12:01PM DBG ExecuteActivity ActivityID=11 ActivityType=WaitSecurityGroupActivity Attempt=1 Namespace=default RunID=be14bcde-7140-4153-a593-feaaaab4b1f2 TaskQueue=edgeclient WorkerID=85144@sc-mac-01014@ WorkflowID="RemoveSecurityGroupWorkflow(GroupName: sg-e56701_XXX53)" WorkflowType=RemoveSecurityGroupWorkflow service=unknown
12:01PM DBG syncing security group service=unknown
error emulation!!!
12:01PM DBG remove security group activity with  ref=d9dbfa60-3569-4c03-b854-f1e9b78202b9 service=unknown
12:01PM DBG ExecuteActivity ActivityID=5 ActivityType=RemoveSecurityGroupActivity Attempt=2 Namespace=default RunID=c63d83c0-d529-4a33-b34c-83e3d79fe29c TaskQueue=edgeclient WorkerID=85144@sc-mac-01014@ WorkflowID="RemoveSecurityGroupWorkflow(GroupName: sg-e56701_XXX53)" WorkflowType=RemoveSecurityGroupWorkflow service=unknown
12:01PM ERR error http request props="[method=DELETE url=https://some_url/v1/security-groups/d9dbfa60-3569-4c03-b854-f1e9b78202b9]" request failed: code=404 service=unknown

The second execution of activity RemoveSecurityGroupActivity failed with http = 404 because the first run was OK and deleted the entity.

i got the clue, if i put after println time.Sleep i finally can see previous activity results. Seems like workflow failed too fast and no enough time for activities to save their results. Can it be truth?

You failed the workflow, which caused it to retry (apparently, you specified the retry options). By design, when the workflow is retried, all its previous state is lost.

What are you trying to achieve? Failing and retrying workflows is an anti-pattern.

just trying to understand how it works. i thought that failed workflow starts from first undone activity, ignoring succesfully executed activities earlier. where in docs this behaviour described?

Your understanding is incorrect. A failed workflow is considered failed. It can be retried, but a retry executes it from the beginning.

Workflow (Durable Execution) is a function that executes exactly once and is unaffected by infrastructure and process failures.

What is the use case that you are trying to model?

What is the use case that you are trying to model?

trying to figure out why i get such 404 errors at production. i thought if my last activity failed even with retries, workflow starts execution from that failed activity again. i guess it’s better to make “removing group” activity idempotent(ignoring 404 errors if group was removed earlier).

If an activity failed (which means that you limited the duration or number of retries) and workflow didn’t handle the error then workflow fails. There is no “workflow starts execution from that failed activity again” unless you reset the workflow explicitly.

So you don’t want to fail the activity if you want to preserve the state. Why did you limit the retries?