TestEnv Sleep Behavior

Hello all,

I am trying to write some unit tests for my temporal code and am running into some odd behavior.

My temporal workflow runs some jobs according to a schedule and uses Workflow.sleep() to wait in between scheduled executions. I would like to write unit tests that can test that this behavior works correctly over a period of virtual months or years.

I have been following these examples for guidance on how to write these tests samples-java/src/test/java/io/temporal/samples at master · temporalio/samples-java · GitHub

I’ve been hitting issues when I sleep my TestEnv for too long. After much debug, I determined that the TestEnv sleeps instantaneously as long as the temporal workflow is also sleeping, but if the testEnv sleeps LONGER than the workflow sleeps, it blocks my test code.

For example: Let’s say my temporal workflow completes its work and sleeps 10 minutes. If I sleep the testEnv 10 minutes, then the testEnv sleep completes instantaneously. If I sleep 11 minutes, the 10 minutes sleeps instantaneously, and then my test blocks for the last 1 minute of sleep.

Is this the intended behavior? It has been very painful to write these tests because trying to sync up the testEnv sleep time with the worker sleep time exactly right creates needless complications.

1 Like

I’m not able to reproduce a problem.

  • What version of Temporal Java SDK do you use? If not the latest, would you try reproducing the problem with the latest version?
  • Could you post your unit test code?

I realized I was using version 0.27.0 of the SDK, but I’ve since upgraded to 1.0.0 and am still seeing the same issue. I’m currently working on getting upgraded to 1.0.6 (which I believe is the latest version).

As for posting the unit test code, I may be able to post it, but I’ll have to redact some things. Let me work on that.

So I upgraded temporal to 1.0.6 and I am still seeing the same problem. I’ve cleaned my test code so I can post it here. Most of the specifics of the workflows and activities have been scrubbed.

To give some context, REDACTED_WORKFLOW_1 asynchronously calls REDACTED_WORKFLOW_2 which asynchronously calls REDACTED_WORKFLOW_3. REDACTED_WORKFLOW_2 is the one that does most of the sleeping. In this case it sleeps for 10 minutes to ensure that REDACTED_WORKFLOW_3 ran successfully. This test passes, but it blocks for 1 minute due to the duration of testEnv.sleep(). I know that this 1 minute is unnecessary because if I change the plusMinutes(1) to plusSeconds(1), the test finishes after only 2 seconds.

Let me know if this is enough context or not.

protected TestWorkflowEnvironment testEnv;
protected Worker worker;
protected WorkflowClient workflowClient;

@Before
public void setUp() {
   initializeWorkflowClient(TASK_QUEUE, <REDACTED_WORKFLOW_1>.class);
   worker.registerWorkflowImplementationTypes(<REDACTED_WORKFLOW_2>.class);
   when(jobsActivities.monitorDataflowJobs(any())).thenReturn(true);
   worker.registerWorkflowImplementationTypes(<REDACTED_WORKFLOW_3>.class);

   DataflowJob dataflowJob = DataflowJob.newBuilder().setProjectId("0").build();
   DataflowJobs dataflowJobs = DataflowJobs.newBuilder().addJob(dataflowJob).build();
   when(activities.<REDACTED>(any())).thenReturn(dataflowJobs);
   when(activities.<REDACTED>(any())).thenReturn(dataflowJob);
}


 public void initializeWorkflowClient(String taskQueue, Class<?> workflowImplementationClass) {
     testEnv = TestWorkflowEnvironment.newInstance();
     worker = testEnv.newWorker(taskQueue);
     worker.registerWorkflowImplementationTypes(workflowImplementationClass);
     workflowClient = testEnv.getWorkflowClient();
 }


@Test
public void test_sendImmediately() throws IOException {
    <REDACTED_PROTOBUF_1> redactedProto1 =
        createRedactedProto1();
    <REDACTED_PROTOBUF_2> redactedProto2 =
        createRedcatedProto2(redactedProto1);
    when(activities.<REDACTED>(any())).thenReturn(redactedProto1);

    worker.registerActivitiesImplementations(activities, jobsActivities);
    testEnv.start();

    WorkflowOptions options = WorkflowOptions.newBuilder().setTaskQueue(TASK_QUEUE).build();
    <REDACTED_WORKFLOW_1> workflow = workflowClient.newWorkflowStub(<REDACTED_WORKFLOW_1>.class, options);
    workflow.start(redactedProto2);

    testEnv.sleep(Duration.ofMinutes(10).plusMinutes(1));

  // run some assertions down here
 }

You mentioned that workflows start each other asynchronously. How is it done?

ChildWorkflowOptions childOptions = ChildWorkflowOptions.newBuilder()
    .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
    .build();
REDACTED_WORKFLOW_2 redactedWorkflow2 = Workflow.newChildWorkflowStub(
    REDACTED_WORKFLOW_2.class, childOptions);

// Asynchronously start the workflow and wait for it to
// successfully start.
Async.procedure(redactedWorkflow2::start, updatedRequest);
Promise<WorkflowExecution> childExecution
   = Workflow.getWorkflowExecution(redactedWorkflow2);
childExecution.get();

If you could provide a repro that I could fork, that would really help in troubleshooting.

Unfortunately I cannot since my company’s repos are hosted on our intranet. I even had to get permission from my manager just to post the code snippets that I’ve provided here.

I see. I’ll try to reproduce this locally then.

As a workaround could you wait for the workflow completions instead of relying on sleep?

Yeah I think that would work. Would you recommend doing that withawaitTermination() or by implementing some kind of sleep loop where we loop until our workflow is completed and sleep the testEnv 5s in between checks?

If you know the workflow id you can use the client to wait for its completion:

client.newUntypedWorkflowStub(workflowId...).getResult(...)

Hm, that seems to also be a blocking operation, unless I’m doing something wrong?

public void initializeWorkflowClient(String taskQueue,
  Class<?> workflowImplementationClass) {
  testEnv = TestWorkflowEnvironment.newInstance();
  worker = testEnv.newWorker(taskQueue);
  worker.registerWorkflowImplementationTypes(workflowImplementationClass);
  workflowClient = testEnv.getWorkflowClient();
}

public void sleepUntilEndOfWorkflow() {
    ListOpenWorkflowExecutionsRequest listOpenWorkflowExecutionsRequest =
      ListOpenWorkflowExecutionsRequest.newBuilder().setNamespace(testEnv.getNamespace()).build();
    ListOpenWorkflowExecutionsResponse listOpenWorkflowExecutionsResponse  =
    testEnv.getWorkflowService().blockingStub().listOpenWorkflowExecutions(listOpenWorkflowExecutionsRequest);
    List<WorkflowExecutionInfo> openWorkflowInfo =
      filterByTypeName(listOpenWorkflowExecutionsResponse.getExecutionsList(), "REDACTED_WORKFLOW_2");
    assertTrue("There should be only one running REDACTED_WORKFLOW_2", openWorkflowInfo.size() == 1);
    WorkflowExecution workflow = openWorkflowInfo.get(0).getExecution();
    workflowClient.newUntypedWorkflowStub(workflow, Optional.of(workflow.getWorkflowId())).getResult(boolean.class);
}

It is by design a blocking operation as it waits for the workflow completion.

Nit: The second argument of the newUntypedWorkflowStub is the workflow type name, not its id. It is not a big deal as it is used only for error messages here.

Right, but my problem here is I need a non-blocking operation because I potentially need to wait days, weeks, or months for these workflows to complete. This is why testEnv.sleep() was appealing to me, but that one is blocking when there isn’t a workflow running, it seems.

It blocking until workflow completes. And if workflow takes weeks to complete it still can skip time to execute the test in milliseconds.

I’m sorry, I don’t understand. I just tried it out with a version of the workflow that runs immediately and then sleeps for 1 minute, and the test blocked for 1 minute while it waited for that sleep.

How can I skip time with this so I can complete the test in milliseconds?

What does it mean to run immediately and then sleep for 1 minute? What is the workflow code?

Workflow2 kicks off a child, workflow3, which runs some activities. In my unit test, the activities are mocked, so they just return success immediately.

Workflow2 sleeps an additional amount of time because in production the activities from workflow3 invoke an asynchronous service. The additional wait time is to allow for eventual consistency in the asynchronous service. This additional time is configurable, and I have configured it to just be 1 minute for the purposes of the test. Workflow 2 finishes by updating some metadata to mark the entire event complete and then terminates.

Note that this is a simplified version of how the whole system will work in production. Normally workflow2 would be kicking off multiple copies of workflow3 which will each sleep some amount of time before invoking their activities.

Is there any update here? any success reproducing this issue in your environment? I’ve pivoted to working on something else, but I’m planning on coming back to this to try to reproduce it with a simpler case.