Testing robustness to worker failure

We have a workflow which spins up two activities:

  1. Kick off an external long running process and return immediately
  2. Poll on the long running process, returning only when the external process is complete

We hadn’t implemented heartbeats and heartbeat timeouts on the polling activity, so when the workers go down (eg during a deployment) the workflow wouldn’t time out and reschedule another polling activity. This would result in users waiting a long time for workflows to complete (or the workflow itself would time out). We have now implemented a heartbeat + timeout to remedy this issue. I’ve tested this with the actual temporal server and if I crash the worker the polling task is running on it is restarted on another worker.

I thought it would be nice to write a test to ensure our workflows are resistant to timing out. My hack at this was to mock the polling activity so the first time it runs it just loops without heartbeating (busy waits), and the second time it runs it returns a value. My objective here is the heartbeat timeouts should make the first activity fail, and the second will succeed.

However I noticed in my test that the in-memory temporal server wouldn’t actually time out the first activity when the heartbeat timeout was reached. I’ve created a minimal test case to display this behaviour:

func (s *TestSuite) SetupTest() {
	s.env = s.NewTestWorkflowEnvironment()
}

func TestTestSuite(t *testing.T) {
	suite.Run(t, new(TestSuite))
}

func (s *UnitTestSuite) Test_SimpleWorkflow_Success() {
	s.env.SetTestTimeout(time.Second * 10)

	s.env.RegisterActivity(ExampleActivity)
	s.env.RegisterWorkflow(ExampleWorkflow)

	s.env.ExecuteWorkflow(ExampleWorkflow)

	s.True(s.env.IsWorkflowCompleted())
	s.NoError(s.env.GetWorkflowError())

	var resp string
	err := s.env.GetWorkflowResult(&resp)
	s.NoError(err)
	s.Equal("hello", resp)
}

func ExampleWorkflow(ctx workflow.Context) (string, error) {
	var resp string
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 30 * time.Second,
		HeartbeatTimeout:    1 * time.Second,
	}
	innerCtx := workflow.WithActivityOptions(ctx, ao)
	err := workflow.ExecuteActivity(innerCtx, ExampleActivity).Get(ctx, &resp)
	return resp, err
}

func ExampleActivity(ctx context.Context) (string, error) {
	fmt.Println("Running activity...")
	for {
		time.Sleep(5 * time.Second)
	}
	return "hello", nil
}

I would expect the first activity to start, print “Running activity”, then time out, then another would start up and print “Running activity” again. Instead, I’m only seeing one “Running activity” being printed out.

Questions:

  1. Does the in-memory temporal server not respect heartbeat timeouts as I suspect?
  2. Is there a better way to test the resilience of workflows with heartbeat timeouts?
  3. Is this structure of polling inside an activity with a short sleep between polls okay?

Aha, it looks like an issue has been created for this Heartbeat timeout not raised while testing · Issue #1282 · temporalio/sdk-go · GitHub