Testing robustness to worker failure

ehmt · September 26, 2023, 2:46am

We have a workflow which spins up two activities:

Kick off an external long running process and return immediately
Poll on the long running process, returning only when the external process is complete

We hadn’t implemented heartbeats and heartbeat timeouts on the polling activity, so when the workers go down (eg during a deployment) the workflow wouldn’t time out and reschedule another polling activity. This would result in users waiting a long time for workflows to complete (or the workflow itself would time out). We have now implemented a heartbeat + timeout to remedy this issue. I’ve tested this with the actual temporal server and if I crash the worker the polling task is running on it is restarted on another worker.

I thought it would be nice to write a test to ensure our workflows are resistant to timing out. My hack at this was to mock the polling activity so the first time it runs it just loops without heartbeating (busy waits), and the second time it runs it returns a value. My objective here is the heartbeat timeouts should make the first activity fail, and the second will succeed.

However I noticed in my test that the in-memory temporal server wouldn’t actually time out the first activity when the heartbeat timeout was reached. I’ve created a minimal test case to display this behaviour:

func (s *TestSuite) SetupTest() {
	s.env = s.NewTestWorkflowEnvironment()
}

func TestTestSuite(t *testing.T) {
	suite.Run(t, new(TestSuite))
}

func (s *UnitTestSuite) Test_SimpleWorkflow_Success() {
	s.env.SetTestTimeout(time.Second * 10)

	s.env.RegisterActivity(ExampleActivity)
	s.env.RegisterWorkflow(ExampleWorkflow)

	s.env.ExecuteWorkflow(ExampleWorkflow)

	s.True(s.env.IsWorkflowCompleted())
	s.NoError(s.env.GetWorkflowError())

	var resp string
	err := s.env.GetWorkflowResult(&resp)
	s.NoError(err)
	s.Equal("hello", resp)
}

func ExampleWorkflow(ctx workflow.Context) (string, error) {
	var resp string
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 30 * time.Second,
		HeartbeatTimeout:    1 * time.Second,
	}
	innerCtx := workflow.WithActivityOptions(ctx, ao)
	err := workflow.ExecuteActivity(innerCtx, ExampleActivity).Get(ctx, &resp)
	return resp, err
}

func ExampleActivity(ctx context.Context) (string, error) {
	fmt.Println("Running activity...")
	for {
		time.Sleep(5 * time.Second)
	}
	return "hello", nil
}

I would expect the first activity to start, print “Running activity”, then time out, then another would start up and print “Running activity” again. Instead, I’m only seeing one “Running activity” being printed out.

Questions:

Does the in-memory temporal server not respect heartbeat timeouts as I suspect?
Is there a better way to test the resilience of workflows with heartbeat timeouts?
Is this structure of polling inside an activity with a short sleep between polls okay?

ehmt · October 30, 2023, 10:07pm

Aha, it looks like an issue has been created for this Heartbeat timeout not raised while testing · Issue #1282 · temporalio/sdk-go · GitHub

Topic		Replies	Views
Worker Heart beat Community Support go-sdk	7	1299	March 18, 2021
Unit testing timeouts Community Support go-sdk , testing	3	2263	July 25, 2022
Testing Activity Heartbeat and Start-To-Close Timeout Community Support java-sdk , testing	3	479	April 22, 2024
Activity Recovery: Worker behaviour in case of crash Community Support general-impl , workflow-options , dotnet-sdk	3	1401	May 23, 2023
Workflow not recovered after crash Community Support	12	1279	March 24, 2021

Testing robustness to worker failure

Related topics