Open long-running workflows waiitng on workflow.await()

Consider we have million workflows waiting on a signal() / user input. In this case, especially in java where threads are costly, we end up having million workflow instances/runIDs waiting on a signal.

  1. Does this mean we will be having all these million workflow instances/runIDs as threads in suspended state and in memory?
  2. WorkerFactoryOptions.Builder setWorkflowCacheSize​(int workflowCacheSize) and
    WorkerFactoryOptions.Builder setMaxWorkflowThreadCount​(int maxWorkflowThreadCount) does help in caching and avoiding creation/resuming of workflows from replay, but keeping millions of workflows in consideration, what would be the acceptable values?
  3. If for some reasons we hit limits of threads and are purged/killed, and the signals arrive after few days, will the workflow be still restarted and execution is guranteed? If yes, how is this achieved?
1 Like

Hello @Abhijith_K

  1. Does this mean we will be having all these million workflow instances/runIDs as threads in suspended state and in memory?

No, your workers created from same worker factory can utilize up to MaxWorkflowThreadCount. Worker can process however many many more workflow executions as it has thread eviction.

  1. WorkerFactoryOptions.Builder setWorkflowCacheSize​(int workflowCacheSize) and
    WorkerFactoryOptions.Builder setMaxWorkflowThreadCount​(int maxWorkflowThreadCount) does help in caching and avoiding creation/resuming of workflows from replay, but keeping millions of workflows in consideration, what would be the acceptable values?

It also plays a role, the number of pollers and ConcurrentWorkflowTasks and ConcurrentActivityTasks among others (please see Developer's guide - Worker performance | Temporal Documentation)

The configuration will depend on the workflow’s nature. How many of those workflows will be concurrently running? or do they be mostly sleeping/awaiting to be signaled? and the number of worker replicas too.

  1. If for some reasons we hit limits of threads and are purged/killed, and the signals arrive after few days, will the workflow be still restarted and execution is guranteed? If yes, how is this achieved?

The workflow history is persisted in the database. If a workflow execution is evicted from cache (or your worker crashes) , the same or another worker will poll the tasks, and if it can not find the workflow execution in cache (metric sticky_cache_miss) the worker will poll the workflow history and replay it to recover the workflow execution state and continue from there.

Let me know if it helps,

Thanks @antonio.perez for the help.