Are each workflow.sleep and worflow.await backed by a thread? if so when will the thread be relinquished?

Most of my workfows are long running workflow and they either sleep for a specific period or await for a signal/trigger.

When i create workflows , the workflow immediately enters sleep or await after doing little bit of bootstrapping. If i submit 600 such workflows i see each of this wait/sleep is actually backed by a thread from (workflow method pool ) i.e. almost 600 threads.

and i see all the threads are waiting notification

  "workflow-method": awaiting notification on [0x00000000f165ea08], holding [0x00000000f165ed58]
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
	at io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:83)
	at io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:410)
	at io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:45)
	at io.temporal.internal.sync.SyncWorkflowContext.await(SyncWorkflowContext.java:697)
	at io.temporal.internal.sync.WorkflowInternal.await(WorkflowInternal.java:309)
	at io.temporal.workflow.Workflow.await(Workflow.java:768)

I just see huge number of threads, but really dont see any issues or anything going wrong…
Is this normal?
Also would large number of threads cause cpu trashing etc?

How should i size my worker/java process?

Apart from huge number threads waiting on notification, i dont really see anything else going wrong per say.

However, if i restart the process, the thread level do come down significantly ( i am trying this on 1.0.5.snapshot) which alredy has a fix for dead lock issue

and workflow still continues as normal, is this because the sticky cache in worker? i seee the stick cache size in sdk dashboard grows over time.

E.g when i create 600 worflows i see

When i restart the worker i see

Overall my sticky cache in sdk dashboard looks like

Any tweaking required in confs or Service/Client Options etc?

2 Likes

TLDR; This is normal and doesn’t limit the number of parallel workflows that can be executed by the system.

Workflow Task

Every time the Temporal service receives a new external event like Workflow Start or Signal or Activity Completion it has to consult the Workflow worker for the next commands to execute. Each such consultation is called a workflow task. After a task is completed the workflow state can be removed from the SDK to be recovered when a next workflow task needs processing.

Recovering the state requires shipping the whole workflow event history which is a relatively expensive operation. So Temporal by default caches the workflow state in the SDK worker and only includes new events into workflow tasks.

Java SDK

Java SDK uses real threads. So a cached workflow blocks one or more threads. This severely limits the number of workflows that can be cached in Java. But it doesn’t limit the number of parallel workflows that can be executed by the system as any of the cached workflows can be kicked out of the cache when there are no available threads for other workflows to continue executing.

Questions

I just see huge number of threads, but really dont see any issues or anything going wrong…
Is this normal?

The number of threads is not huge as the maximum number of them is limited by FactoryOptions.MaxWorkflowThreadCount config. The default is 600 threads.

Also would large number of threads cause cpu trashing etc?

As these threads are blocked they shouldn’t cause CPU trashing. They can cause other issues like memory exaustion. So don’t change the config to a very large number of threads.

How should i size my worker/java process?

It heavily depends ony our use case. I would recommend scale testing of your worker before going to production.

However, if i restart the process, the thread level do come down significantly ( i am trying this on 1.0.5.snapshot) which alredy has a fix for dead lock issue
and workflow still continues as normal, is this because the sticky cache in worker? i seee the stick cache size in sdk dashboard grows over time.

Yes, you are correct. After the restart the cache is empty. So no threads are consumed.

3 Likes

Thanks for the detailed explanation maxim.
so what i need to ensure is if have enough memory to hold 600 workflows/hisotry in cache and the history do not go too deep.