I am trying to write some unit tests for my temporal code and am running into some odd behavior.
My temporal workflow runs some jobs according to a schedule and uses Workflow.sleep() to wait in between scheduled executions. I would like to write unit tests that can test that this behavior works correctly over a period of virtual months or years.
I’ve been hitting issues when I sleep my TestEnv for too long. After much debug, I determined that the TestEnv sleeps instantaneously as long as the temporal workflow is also sleeping, but if the testEnv sleeps LONGER than the workflow sleeps, it blocks my test code.
For example: Let’s say my temporal workflow completes its work and sleeps 10 minutes. If I sleep the testEnv 10 minutes, then the testEnv sleep completes instantaneously. If I sleep 11 minutes, the 10 minutes sleeps instantaneously, and then my test blocks for the last 1 minute of sleep.
Is this the intended behavior? It has been very painful to write these tests because trying to sync up the testEnv sleep time with the worker sleep time exactly right creates needless complications.
I realized I was using version 0.27.0 of the SDK, but I’ve since upgraded to 1.0.0 and am still seeing the same issue. I’m currently working on getting upgraded to 1.0.6 (which I believe is the latest version).
As for posting the unit test code, I may be able to post it, but I’ll have to redact some things. Let me work on that.
So I upgraded temporal to 1.0.6 and I am still seeing the same problem. I’ve cleaned my test code so I can post it here. Most of the specifics of the workflows and activities have been scrubbed.
To give some context, REDACTED_WORKFLOW_1 asynchronously calls REDACTED_WORKFLOW_2 which asynchronously calls REDACTED_WORKFLOW_3. REDACTED_WORKFLOW_2 is the one that does most of the sleeping. In this case it sleeps for 10 minutes to ensure that REDACTED_WORKFLOW_3 ran successfully. This test passes, but it blocks for 1 minute due to the duration of testEnv.sleep(). I know that this 1 minute is unnecessary because if I change the plusMinutes(1) to plusSeconds(1), the test finishes after only 2 seconds.
Unfortunately I cannot since my company’s repos are hosted on our intranet. I even had to get permission from my manager just to post the code snippets that I’ve provided here.
Yeah I think that would work. Would you recommend doing that withawaitTermination() or by implementing some kind of sleep loop where we loop until our workflow is completed and sleep the testEnv 5s in between checks?
It is by design a blocking operation as it waits for the workflow completion.
Nit: The second argument of the newUntypedWorkflowStub is the workflow type name, not its id. It is not a big deal as it is used only for error messages here.
Right, but my problem here is I need a non-blocking operation because I potentially need to wait days, weeks, or months for these workflows to complete. This is why testEnv.sleep() was appealing to me, but that one is blocking when there isn’t a workflow running, it seems.
I’m sorry, I don’t understand. I just tried it out with a version of the workflow that runs immediately and then sleeps for 1 minute, and the test blocked for 1 minute while it waited for that sleep.
How can I skip time with this so I can complete the test in milliseconds?
Workflow2 kicks off a child, workflow3, which runs some activities. In my unit test, the activities are mocked, so they just return success immediately.
Workflow2 sleeps an additional amount of time because in production the activities from workflow3 invoke an asynchronous service. The additional wait time is to allow for eventual consistency in the asynchronous service. This additional time is configurable, and I have configured it to just be 1 minute for the purposes of the test. Workflow 2 finishes by updating some metadata to mark the entire event complete and then terminates.
Note that this is a simplified version of how the whole system will work in production. Normally workflow2 would be kicking off multiple copies of workflow3 which will each sleep some amount of time before invoking their activities.
Is there any update here? any success reproducing this issue in your environment? I’ve pivoted to working on something else, but I’m planning on coming back to this to try to reproduce it with a simpler case.