testEnv.sleep() not returning

I have a jest unit test that’s interacting with a workflow and it’s running multiple testEnv.sleep() and signals etc fine up to a certain point where the unit test doesn’t complete a simple await testEnv.sleep(1000 * 60);
Here is a larger snippet that includes console.logs and comments that show were it gets stuck.

  await handle.signal(Signals.applicationUpdated, {
    ...input,
    newStatus: 'IN_PROGRESS',
    previousStatus: 'IN_PROGRESS',
    updatedAt: dateAfterThirdSleep.getTime(),
    saveAndComeBack: true,
  });
  console.log(`xxx unit test - after update with saveAndComeBack: true`);
  // ^ last logged entry from the test thread.

  // wait for a minute to let the workflow digest the signal
  await testEnv.sleep(1000 * 60);
  console.log(`xxx unit test - after 1 min sleep`);
  // ^ this entry is not being logged for the entire duration the test runs (stuck) until the jest timeout fires.

At the time when the unit test is stuck, the workflow is waiting in a condition for subsequent updates (effect from signal handling) or timeout.

if (await wf.condition(() => !!applicationUpdate, msUntilExpiration)) {

I’m not anticipating any root causes in the workflow code here but that’s just me.

Thank you for any insights or debug options you might have.

Just to confirm, you are creating your test environment with { server: { type: 'time-skipping' } }, right?

How long does it takes for your test to fails due to timeout? Do you have other tests in that same file? If so, can you please try executing only that specific test, skipping all others?

Yes, I’m doing…
testEnv = await TestWorkflowEnvironment.createTimeSkipping();

The test fails after the defined timeout. In below case, it’s 40 secs. I can observe that the unit test is not moving until the 40 sec timeout.

describe('workflows/application/application-lifecycle.workflow', () => {
  it('runs through a happy path', async () => {
    ...
  }, 40000); // giving it plenty of time
});

I see. I have quite a few ideas of what could go wrong.

Just to help narrowing it down, can you please try the following:

  • Do you have any other tests in that same file? If so, can you please try executing only that specific test (ie. skip all others)?
  • Instead of using the Time Skipping test server, use the Local test environment; you may need to adjust the test timeout, as await test.sleep(1000 * 60) will literally

@jwatkins, this is my only test so no need to skip.

Regarding the time skipping test server, you’re suggesting to use TestWorkflowEnvironment.createLocal(). I’m reading that this doesn’t support time skipping which I’m heavily dependent on for my workflow. Timers wait for days, not milliseconds and I understand that test.sleep will actually sleep for the given amount of time in real-time. Therefore, it doesn’t seem to be a practical solution for this situation.

Do you have any other ideas why await testEnv.sleep(1000 * 60) would get stuck?

I’m not suggesting this as a solution, just trying to narrow down on possible causes. If that helps, we’ll know to work in that direction.

The workflow has a lifecycle that passes through multiple timers or timeouts before it expires or finalizes. I converted the workflow timers from days to seconds and executed my test in seconds instead to see if local test server without time skipping works.

Before I got there, I executed the test with time skipping (seconds instead of days) and it worked beyond the point where it gets stuck with when it’s set to ‘days’. I don’t see what the difference is that makes one work over the other.

@jwatkins, the failing unit test that’s using ‘days’ is now working as well.
I made a few changes and it’s unclear to me what exactly caused it to work (or fail). My suspicion is that it was the presence of an async signal handler. I moved the async function call out of the handler and into the workflow function and that may have been it.

I don’t understand how that could make the testEnv.sleep not return in the unit test code.

Ok, good. Thanks for confirming this. That’s pretty much what I was expecting.

Before going farther, let me give some context. Essentially, the time skipping server alternates between real time and fast forward execution mode. For instance, while a workflow task is being processed by the worker or while an activity is executing, a counter is incremented, which forces the server to fall back into real time mode. The server only resume fast forward mode once the counter decrements down to 0 (ie. no activity is running).

The fact that your workflow completes without issue if sleep time are reduce to a few seconds demonstrates that your workflow itself is technically correct (ie. the fact that is stops making progress is not due to a bug in your workflow). We can therefore deduce that, for some reason, in your test, the server’s counter is stuck at some non-zero value (ie. real time mode).

This could be due either to an improper usage of the Time Skipping’s service (for example: sleeping in an activity using the non-time-skipping-aware sleep function), or to a bug in the Time Skipping service (we are aware of a few specific cases where the counter may not be decremented properly).

If you need help in figuring out why this happening in your specific case, I will need you to provide more details on what happens in your test/workflow? For example:

  • Are you running multiple workflows concurrently in that same test environment?
  • Child workflows?
  • Continue-as-new?

Also, I suppose that you are very likely executing activities…

  • How long do they take to execute?
  • Does some of them fail or time out?
  • Do you sleep() inside those activities?

It would also be great if you can try to simplify your Workflow down to some minimal reproduction code.

I just moved the async code back into the signal handler and no, it’s still running through the previously blocked sleep. If I can isolate the change that solved the issue, I’ll let you know here.
Sorry for the confusion and thank you for offering your help!

You just posted at the same time so let me answer a couple of things…

yes, the structure of the workflow is using continueAsNew in multiple places.
The activities I’m executing are all mocked using jest mocks with simple mock impls like jest.fn().mockImplementation(async () => true).

My suspicion is that it was the presence of an async signal handler. I moved the async function call out of the handler and into the workflow function and that may have been it.

Do you mean that, previously, it was possible for the workflow to complete while there was non-completed activities? If so, then that could possibly be an instance of that issue.

The workflow activity that ran async in the signal handler started but it did not complete (console.log in the jest mock of this activity didn’t show). The workflow has 4 days of real-time time to complete but the time skipping that followed the unit test’s signal invocation might have skipped ahead while the async activity call in the handler was ongoing. Maybe that’s the situation that can cause the testEnv.sleep to block?

Update, I’m now able to confirm that putting the async execution back into the handler is causing the sleep to block.

Ok… That smells a lot like the issue I mentioned above. The only thing though is that this would not affect the workflow that started the activity / completed. That would affect other workflows on the same server (ie. running concurrently with the first workflow, or started afterward). Are you starting other workflows / child workflows?

I’m now able to confirm that putting the async execution back into the handler is causing the sleep to block.

Yeah… We strongly recommend not doing anything async in signal handlers, as in-progress async handlers will not prevent your workflow from completing. See my other answer here for more details on this subject.

I’m not starting child workflows nor other workflows but I’m using return wf.continueAsNew(newInput)

Oh, I see. Yes, continue-as-new would suffer from that same problem.

In that case, you have two possible solutions:

  1. Make sure that all activities by your workflow have completed before the workflow itself completes or does continue-as-new.
  2. Manually unblock the time skipping test server by calling unlockTimeSkipping() as often as required to reduce its lock counter down to zero (that should be once for each uncompleted activity at the moment the workflow completed), as shown below. Note that this uses internal APIs, which could change in a future release.
import { Connection } from '@temporalio/testing/lib/connection';

await (testEnv.connection as Connection).testService.unlockTimeSkipping({});

Ok, for now, moving the async handling out of the signal handler solved that problem so I won’t need to do that but thank you for that option will I ever need it. And I’m sure others will benefit from it too.

Thank you for your insight and help @jwatkins!