Performance guidelines

Hi,

We are doing some performance test with temporal. Our first goal is to see the throughput we can get creating workflows and how we should scale it.

We are using the java SDK, and mysql, and currently just trying to understand what are the elements that will limit our throughput.

We’ve found 2 things we are trying to understand:

1.- Once we have started many workflows and they are running (right now just dummy activities), we see that in a thread dump we name a good number of threads waiting for a lock.
This is the stack from a dump:

"workflow-method" #294 prio=5 os_prio=31 tid=0x00007ffbae935000 nid=0x15607 waiting on condition [0x000070000fee9000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000717c102c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:83)
        at io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:406)
        at io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:45)
        at io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:84)
        at io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:74)
        at io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:44)
        at io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:59)
        at io.temporal.internal.sync.ActivityInvocationHandler$$Lambda$327/1133104136.apply(Unknown Source)
        at io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:65)

Any idea what is the reason for this locks and how can we scale this? We have a high number of threads locked on this condition.

2.- When starting instances, using WorkflowClient.start (workflow.&run **as** Functions.Func1<...>, arguments) we see it takes top to 1 second to start when we increase the number of ‘starts’ running in parallel.
With small concurrency (1-3) it is a bit faster (~300ms).

Any guidelines of what are the areas or settings we need to pay attention to is highly appreciated.

Kind regards

1 Like

Juan, would you be able to share the code that you’ve used for this benchmark as well as configuration that you’ve used for the test? (did you run everything on one box? how many workers, how many server nodes?)
Generally speaking if you see workflow start latency as high as 1 second it likely means that temporal server is overloaded.

  1. Any idea what is the reason for these locks and how can we scale this? We have a high number of threads locked on this condition.

These are cached workflows waiting on activity completion. In Java, you can control the number of workflows that are cached through WorkerFactoryOptions.workflowCacheSize. Cached workflows consume Java threads. So another option WorkerFactoryOptions.maxWorkflowThreadCount controls how many threads workflows can use. In the majority of cases, the number of cached workflows is limited by the number of available threads and not the cache size.

2.- When starting instances, using WorkflowClient.start (workflow.&run **as** Functions.Func1<...>, arguments) we see it takes top to 1 second to start when we increase the number of ‘starts’ running in parallel.
With small concurrency (1-3) it is a bit faster (~300ms).

How many history shards does your cluster use? To check you can use tctl:

./tctl admin cluster metadata

Any cluster that is configured for high throughput should have at least 200 shards per host.

Vitaly,

Thanks for your answer. The code is basically a http listener that just starts s workflow with the params it get from the request, We are trying to stress it with K6, right now only with as low as 6 Virtual Users.

We have 3 boxes for the tests. all loaded with the Http service and Temporal, and the load if balanced to the 3 temporal instances.
When you say ‘Temporal server is overloaded’ , what are you pointing to? Memory , I/O, CPU? All those params seem behaving well, so I was more looking at the DB (mysql), but don’t know how to see if that is the case.
What should I look at?
Thanks

-Juan

Can you check the shard count, that Maxim has pointed out above?

Maxim,

Thanks for yous answer.

I did update the number of shard to 512 and looks it behaves a bit better but still quite slow in a quick&dirty test on my local box.
In the docker-compose, the default value was 4 which seems might be reasonable for a dev box, but not for a prod box.
Are there any other settings in the docker-compose template that are just appropriate for a dev box and that will change if I use for a production environment?

Thanks

  • Juan

I put to 512 now.

What about the test on the real cluster? Did change to 512 help?

Will let you know the results once done on the cluster.

Maxim, Vitaly,

Thanks for your help. Shards and # on connections have improved the performance hugely.

The test we do is with a dummy activity that just logs the time, and we see that the limitations factor in how many we can start at the same time is more on the worker side that on the server now.
Any ideas of what parameters of the Java DSK worker factory affects to the start throughput?

Kind regards and thanks for your promptly help

I would recommend looking at metrics to understand where the bottleneck is. Start from the DB latency and error metrics.

1 Like

Thanks.
Looking at the DB (shards + conns) was the starting key. Then fine tuning and looks much better now.

Thanks for the support.

If you could share your setup and the numbers you are getting it might be helpful for others.

Absolutely.
We are still in the process of doing several performance, stress and throughput tests.
But so far, this is an overview of the scenario we are testing.

We are running on AWS:

  • MySql (Aurora) as database. On startup we used 1.5K shards, and each temporal node is using up to 300 conns

  • Temporal Cluster + Workers: 3 boxes with 4cores, 16GB- (C5.2xlarge):
    For this test, we have on each box we have one node for temporal and 1 node of our worker-

  • Our client is acting as worker for activities and also have a http listener to ‘start’ the workflow. It is based on Vertx 4

Right now we are just focusing on how many workflow we can start from HTTP request.

The first results after tuning with your help, was ~1000 starts/second, with a mead response time <100ms

We are still doing more tests to see what is the limiting factor and how to scale beyond this numbers.