Non-determinism issue while replaying mutable side-effect behind workflow versioning

Hi team,

We have encountered a potential workflow replaying issue in the go-sdk. Need your help to validate this and suggest on potential fixes.

Issue Description
We are seeing non-determinism error when using mutable side-effects behind the workflow versioning markers.


PanicError: lookup failed for scheduledEventID to activityID: scheduleEventID: 13, activityID: 13 
process event for example-task-queue [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_command_state_machine.go:455
go.temporal.io/sdk/internal.(*commandsHelper).handleActivityTaskScheduled(0x14001a315e0, {0x14001507798, 0x2}, 0x40?)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_command_state_machine.go:1038 +0xf0
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0x1400089ef90, 0x14001a29b00, 0x20?, 0x0)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_event_handlers.go:1031 +0x248
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0x140015fd3b0, 0x1400137c600)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_handlers.go:962 +0xed0
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0x14000cd6c30, 0x1400137c600, 0x140018daa20)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_handlers.go:780 +0x364
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0x14000ccc900, 0x1400137c600)
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_pollers.go:329 +0x2d0
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0x14000ccc900, {0x106c9dac0?, 0x1400137c600})
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_pollers.go:302 +0x70
go.temporal.io/sdk/internal.(*baseWorker).processTask(0x14000ca28c0, {0x106c9d640, 0x140019515a0})
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_worker_base.go:440 +0x15c
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
	/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_worker_base.go:334 +0x98

How to Reproduce
Try to replay workflow history or perform reset post-side-effect marker.

I have created a sample workflow to reproduce the same.

// Workflow is a Hello World workflow definition.
func Workflow(ctx workflow.Context, name string) (string, error) {
	uid := ""
	logger := workflow.GetLogger(ctx)
	logger.Info("HelloWorld workflow started", "name", name)

	v := workflow.GetVersion(ctx, "mutable-side-effect-bug", workflow.DefaultVersion, 1)
	if v == 1 {
		var err error
		uid, err = generateUUID(ctx)
		if err != nil {
			logger.Error("failed to generated uuid", "Error", err)
			return "", err
		}

		logger.Info("generated uuid", "uuid-val", uid)
	}

	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 10 * time.Second,
	}
	ctx = workflow.WithActivityOptions(ctx, ao)

	var result string
	err := workflow.ExecuteActivity(ctx, Activity, name).Get(ctx, &result)
	if err != nil {
		logger.Error("Activity failed.", "Error", err)
		return "", err
	}

	logger.Info("HelloWorld workflow completed.", "result", result)

	return uid, nil
}

func generateUUID(ctx workflow.Context) (string, error) {
	var generatedUUID string

	err := workflow.MutableSideEffect(ctx, "generate-random-uuid", func(ctx workflow.Context) interface{} {
		return uuid.NewString()
	}, func(a, b interface{}) bool {
		return a.(string) == b.(string)
	}).Get(&generatedUUID)
	if err != nil {
		return "", err
	}

	return generatedUUID, nil
}

func Activity(ctx context.Context, name string) (string, error) {
	logger := activity.GetLogger(ctx)
	logger.Info("Activity", "name", name)
	return "Hello " + name + "!", nil
}

Potential Hypothesis
As part of this change, we are not recording side effect marker command while replaying based on entry in wc.mutableSideEffectsRecorded map which works in normal scenario. However, there is a catch, it seems while replaying GetVersion doesn’t generate a command explicitly (don’t know the reason). This is leading to mis-match in wc.commandsHelper.nextCommandEventID which is used as key to perform marker record lookup and hence leading to non-determinism.

Again, this is based on the debugging that I was able to do based on my limited knowledge. Please feel free to correct me if I am wrong. Thanks.

1 Like

cc: @Chad_Retz

This looks like a bug and that’s a great simple replication, thanks! Can you open an issue on the Go SDK repo with this same information?

Sure, thanks for confirming.