Creating a test case to exercise ScheduleToClose timeout on always failing Activity

So I have a workflow with an activity that sometimes will always fail. I want to give that activity a grace period to succeed, and at the end of that grace period if it hasn’t succeeded yet I want to continue the workflow anyway. This is a short activity, so I’m using a ScheduleToClose to set the grace period. StartToCloseTimeout is set short as the activity will ether complete (pass/fail) within a minute, or it got hung up and needs to be retried.

In the tests for my workflow I’m trying to create a test case that emulates retrying until we run out of ScheduleToClose time. I’m having a real hard time figuring out what kind of error to return from the mocked Activity. I can’t figure out how to use time skipping to get the error raised, and instead I’m trying to guess on how to construct an error, that I then can catch in my workflow code. I keep getting it wrong and in real life I’m not correctly catching the timeout and my workflow is erroring out.

Example workflow code:

	readyForRebootCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
		StartToCloseTimeout:    60 * time.Second,
		ScheduleToCloseTimeout: 15 * time.Minute,
	})

	if err := workflow.ExecuteActivity(readyForRebootCtx, a.LifecycleReadyForRebootSsh, destroyInstanceReq.Hostname).Get(ctx, &lifecycleReadyForRebootResp); err != nil {
		// Timeout error here would only be ScheduleToClose as StartToClose is retryable.
		// Any other error is logged and returned
		if !temporal.IsTimeoutError(err) {
			logger.Error("failed to run lifecycle ready for reboot", err)
			return err
		}
		logger.Info("lifecycle ready for reboot reached ReadyToDestroyTimeout... moving on to instance destruction", "hostname", destroyInstanceReq.Hostname)
	}

Test code

	testSuite := &testsuite.WorkflowTestSuite{}
	env := testSuite.NewTestWorkflowEnvironment()
        var a *activity.Activities

	env.OnActivity(a.LifecycleReadyForRebootSsh, mock.Anything, mock.Anything).Return(
		&static.LifecycleResp{
			StdOut:          "",
			StdErr:          "NOTOKAY",
			Rc:              1,
			Hostname:        hostname,
			LifecycleScript: static.LifecycleReadyForReboot,
		}, temporal.NewTimeoutError(3, fmt.Errorf("lifecycle ready for reboot failed")))

	env.ExecuteWorkflow(DestroyInstanceWorkflow, workflowoptions.DestroyInstanceWorkflowOptions{Hostname: hostname, ScaleSetID: 1, ExplicitDestroy: false})
	s.True(env.IsWorkflowCompleted())
	s.NoError(env.GetWorkflowError())

The tests pass, but in real life what I’m seeing is ActivityTaskFailed.

So, is there a way to use TestWorkflowEnvironment to simulate the timeout so that it returns the error to the workflow like it would in real life, and my workflow code can catch the right thing?

The tests pass, but in real life what I’m seeing is ActivityTaskFailed.

I think this is expected in case your activity fails on attempt N before StartToClose timeout (so it does not time out on last attempt), but pretty close to ScheduleToClose timeout where server determines it cannot schedule another retry of this activity based on its timeouts. In this case however you will be able to check ActivityError retry state, something like:

var actErr *temporal.ActivityError
isActivityError := errors.As(err, &actErr)
if isActivityError {
	// actErr.RetryState() should be enumspb.RETRY_STATE_TIMEOUT
}

If last attempt of activity times out on its StartToClose and then no more attempts can be scheduled due to ScheduleToClose then you would see ActivityTaskTimedOut event in history, if thats what you want to test

I can understand why that error type comes though. It makes sense that a retry that would happen after the timeout would be skipped.

Mostly I’m bothered that I haven’t figured out a way to synthesize this scenario in my test framework. While I could carefully construct an error object to return from mocking my activity, all I’m doing is testing an error I generate with code I wrote to catch the error. I’d much rather write a test that causes Temporal to raise the error itself so that I can truly ensure that I’m catching the right thing.