Very big schedule to start workflow latency (Java SDK)

Hi,

We are doing perf testing for our workflow (java SDK) and have some strange results. In a nutshell, no matter how many workers we deploy, we still have at least 5 sec workflow schedule-to-start latency (tried 5, 10, 20 workers).
worker CPU / memory usage is very low.

I tested different values for wf cache size; wf/activity concurrent exec limit;’ polling thread number for wf/act - with no effect.

When we have sync poll percentage around 80-90% and better - there’s still 5 sec latency.
when sync poll falls to 40-50%,the latency increases to 20-40 sec!!

Our main WF has 5 normal activities and 2 local, and it starts another child WF with 2 normal activities (without waiting for its completion).
In our tests I emulate calls to ext services in activities via Thread.sleep 100-300ms. Activities / WF are not CPU-bound.
Tested with loads 50-80 main wf/s. Max achieved state transition rate is 3100.

(Cassandra CPU is under 60-70% -with 12 i3.2xlarge nodes; other server-side metrics looks OK to me.
I can provide more detailed metrics / cluster specs if needed)

Appreciate any ideas / explanations why it works this way…
Is it expected?
Can the situation be improved with separate workers for WF and activities?
What else can be done to solve this?
Thanks!

1 Like

Hi @Maxim_Chuvilyaev
you might be running into a bottleneck with the number of task queue partitions.

Task queue partitions can be set via “matching.numTaskqueueWritePartitions” and “matching.numTaskqueueReadPartitions” in dynamic config.
Dynamic config for docker compose
Dynamic config for helm charts
Dynamic config knobs on server side:
read
write

Without knowing the number of workflow tasks you are running / second it’s hard to tell what number you could tune that to, but try updating the default 4 to 10 for example and see if that makes a difference for you.

On the server metrics side I think you can also look at all the different matching service metrics. poll_success_sync and poll_success should give you and idea of the overall sync-match rate and then compare it when you tune the numbers.

Hope this helps.

Hi @tihomir ,

Thanks for your answer!

This situation drives me crazy already.
We tested with 8 and 15 partitions without any noticeable effect. Now we are back to 8.

We observe a real big difference in behaviour between golang and java workers.
Our test workflow has same structure in both cases.

When we scaled up cassandra (6 nodes 16 vCPU each) and increased history temporal nodes count, we can get up to 6000 state transitions / sec in golang client (with excellent sync poll > 99%)

But whatever we do with Java SDK worker (and that’s is our main platform) - we cannot go higher than 3000 state/sec :frowning: :frowning: . Cassandra load is about 40% CPU in this case.
I tried 5, 10, 20 worker instances without any output gain…

The only difference in metrics (besides that perf rate) that I was able to notice is that for Java SDK p90 PollActivityTaskQueue is 1 min under load. For golang it is several secs (even p95)

And like I said, we have huge client workflow task schedule-to-start latencies - last run up to 90 sec.
Same metric for activities stays around 2-3 sec max

We cannot just switch to golang for development, all our main services are in Java / kotlin.

Regards,
Maxim

Can you share info like server version, and go and java sdk versions you are using?
Can you also share your worker options? For example,
do you have in your test a large number of concurrent activity executions and do you set
WorkerOptions.maxConcurrentActivityExecutionSize? Any info could help.

You also mentioned you use Kotlin, can you share in what way?
Lastly do you see any possible issues on your jvm side?

Main performance difference between go and java SDKs comes from the fact that in java we have to allocate a thread in the thread pool for every cached (sticky) workflow, while in go we use more lightweight coroutines. When in java main thread pool is maxed out, SDK will try to purge older workflows in the LRU manner, however this garbage collection is not always fast and efficient.
PollActivityTaskQueue being at 1min indicates that there is not enough work on the activity task queue and activity workers are timing out on the long poll, which could indicate that workflow tasks are not getting processed enough and are not scheduling enough activities. I would recommend increasing number of concurrent workflow tasks by setting higher value on the maxConcurrentWorkflowTaskExecutionSize property in the WorkerOptions and seeing if it would change your performance profile. If workflow task polling is a bottleneck (I doubt it) then you may try increasing workflowPollThreadCount configuration.
I suggest looking at these two metrics:
temporal_workflow_task_execution_total_latency - will tell how fast workflow tasks are getting processed.
temporal_workflow_active_thread_count - this one will tell you how many threads in the thread pool are busy.
temporal_sticky_cache_total_forced_eviction - number of evicted workflows, if it’s too high then the system isn’t running very efficiently.

Thanks everyone for suggestions!

Some answers:

  • Versions in use: temporal server: 1.11.3; Java SDK: 1.1.0
  • Worker options:
    • activity poll thread count: 20 (also tested with default and with 100)
    • workflow poll thread count: 20 (also tested with default)
    • tested different values for max concurrent workflow task exec size (from default up to 2000)
    • tested different values for max concurrent activity exec size (up to 2000)
  • Factory options:
    • wf host local poll thread count : 20 (tested up to 100)
    • wf cache size and max threads: default (tested up to 2000)
  • Temporal task queue partitions set to 8 (but tested up to 15); we have 8192 shards.

We use kotlin for everything - WF and activities implementations etc. In our real WF we have to use java for async calls (when we run child WF and when we start 2 activities in parallel), otherwise tracing doesn’t work correctly (see Open tracing span context not propagated when activity or child workflow invoked asynchronously · Issue #537 · temporalio/sdk-java · GitHub - it kind of fixed but looks like not released yet). Anyway, in test WF we don’t use java at all.
I even tested coroutine wrappers around activities (with async completion) - that changes the profiling result, but doesn’t make things better. Internally almost all our real activities are coroutine-based.
In test activities I just use Thread.sleep (or delay() with coroutines)

Everything is deployed into our k8s (AWS EKS)
As we use NewRelic, I even tried to do remote JVM profiling, but it just shows, that 90% of time threads are in the blocked state :smiley:

Latest perf observations (load 80 parent wf/s, total 20 000 requests, 10 concurrent clients - using Maru framework):

  • Increasing maxConcurrentWorkflowTaskExecutionSize doesn’t improves performance.
  • temporal_workflow_task_execution_total_latency - max around 500ms, avg 20 ms
  • temporal_workflow_active_thread_count - this is a very strange metric, seems that it is only reported under heavy load, and values are around 10-150.
  • temporal_sticky_cache_total_forced_eviction - this one is 200 for child wf, 600 for parent wf
  • sync/async poll percentage is very bad, around 20%
  • huge schedule-to-start latencies
  • state transitions / sec - 3100 max

Everything works very good with 50 wf/s load, OK with 60 wf/s. With 80 wf/s - see results.
Adding more workers doesn’t help at all. Tested with 5, 10, 20 workers.

We created a similar WF structure with golang. Results are MUCH better. As I mentioned, we can get up to 6000 state transitions/sec with sync/async poll rate around 99% and quite good response to scaling.

This is the most puzzling - why golang workers can scale, but Java-based - no effect?
We can try implementing everything in pure java, with no kotlin - but I still don’t get why scaling doesn’t help.

We are obviously hitting some bottleneck(s), but have no idea what exactly :frowning:

Can you show activity metrics?
activity_schedule_to_start_latency, activity_poll_no_task, activity_execution_latency, activity_endtoend_latency, activity_execution_failed
Also interesting to compare workflow_task_schedule_to_start_latency between go and java.

  • activity_schedule_to_start is in the top right corner (pinkish line)
  • activity_endtoend_latency is in the bottom left :slight_smile:

I would think that activity_execution_failed is 0 (as it just Thread.sleep()), but will double check that.
Will provide other latencies soon.

Currently our Go worker doesn’t send any client metrics (unless this is supported by Maru?.. no idea, will ask). But the fact that sync/async poll is excellent, and all graphs are smooth - I would think that it should be OK as well…

PS
The metrics graphs that I provided are from NR dashboard. Our worker uses micrometer for all metrics (not just temporal), then we export those directly to NewRelic.

UPDATE 1 (I had some wrong headings on some charts, updated those):

UPDATE 2 (added more graphs, fixed more headings)

We are seeing similar things, and tried a lot of different knobs, without any significant success.

After scaling out temporal history nodes from 6 to 12, now we can get the double performance (6K+ state transition/sec). Which is very good, but one question is still valid - adding more workers doesn’t really help…
Also, we have now quite different client metrics. Workflow schedule-to-start latency is now fixed at 5 sec max (which is probably related to workflowHostLocalTaskQueueScheduleToStartTimeout). So when a worker is busy, the workflow task is jumping from one worker to another, and total execution latency is about 40 sec max.

It’s quite possible that now we can play with worker settings once again, but without deep understanding of Java SDK & temporal server internals, it looks like a cargo cult :smiley: