Hi team,
We have encountered a potential workflow replaying issue in the go-sdk. Need your help to validate this and suggest on potential fixes.
Issue Description
We are seeing non-determinism error when using mutable side-effects behind the workflow versioning markers.
PanicError: lookup failed for scheduledEventID to activityID: scheduleEventID: 13, activityID: 13
process event for example-task-queue [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_command_state_machine.go:455
go.temporal.io/sdk/internal.(*commandsHelper).handleActivityTaskScheduled(0x14001a315e0, {0x14001507798, 0x2}, 0x40?)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_command_state_machine.go:1038 +0xf0
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0x1400089ef90, 0x14001a29b00, 0x20?, 0x0)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_event_handlers.go:1031 +0x248
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0x140015fd3b0, 0x1400137c600)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_handlers.go:962 +0xed0
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0x14000cd6c30, 0x1400137c600, 0x140018daa20)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_handlers.go:780 +0x364
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0x14000ccc900, 0x1400137c600)
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_pollers.go:329 +0x2d0
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0x14000ccc900, {0x106c9dac0?, 0x1400137c600})
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_task_pollers.go:302 +0x70
go.temporal.io/sdk/internal.(*baseWorker).processTask(0x14000ca28c0, {0x106c9d640, 0x140019515a0})
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_worker_base.go:440 +0x15c
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
/Users/nitesh/go/pkg/mod/go.temporal.io/sdk@v1.22.2/internal/internal_worker_base.go:334 +0x98
How to Reproduce
Try to replay workflow history or perform reset post-side-effect marker.
I have created a sample workflow to reproduce the same.
// Workflow is a Hello World workflow definition.
func Workflow(ctx workflow.Context, name string) (string, error) {
uid := ""
logger := workflow.GetLogger(ctx)
logger.Info("HelloWorld workflow started", "name", name)
v := workflow.GetVersion(ctx, "mutable-side-effect-bug", workflow.DefaultVersion, 1)
if v == 1 {
var err error
uid, err = generateUUID(ctx)
if err != nil {
logger.Error("failed to generated uuid", "Error", err)
return "", err
}
logger.Info("generated uuid", "uuid-val", uid)
}
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
}
ctx = workflow.WithActivityOptions(ctx, ao)
var result string
err := workflow.ExecuteActivity(ctx, Activity, name).Get(ctx, &result)
if err != nil {
logger.Error("Activity failed.", "Error", err)
return "", err
}
logger.Info("HelloWorld workflow completed.", "result", result)
return uid, nil
}
func generateUUID(ctx workflow.Context) (string, error) {
var generatedUUID string
err := workflow.MutableSideEffect(ctx, "generate-random-uuid", func(ctx workflow.Context) interface{} {
return uuid.NewString()
}, func(a, b interface{}) bool {
return a.(string) == b.(string)
}).Get(&generatedUUID)
if err != nil {
return "", err
}
return generatedUUID, nil
}
func Activity(ctx context.Context, name string) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("Activity", "name", name)
return "Hello " + name + "!", nil
}
Potential Hypothesis
As part of this change, we are not recording side effect marker command while replaying based on entry in wc.mutableSideEffectsRecorded
map which works in normal scenario. However, there is a catch, it seems while replaying GetVersion
doesn’t generate a command explicitly (don’t know the reason). This is leading to mis-match in wc.commandsHelper.nextCommandEventID
which is used as key to perform marker record lookup and hence leading to non-determinism.
Again, this is based on the debugging that I was able to do based on my limited knowledge. Please feel free to correct me if I am wrong. Thanks.