Thanks everyone for suggestions!
- Versions in use: temporal server: 1.11.3; Java SDK: 1.1.0
- Worker options:
- activity poll thread count: 20 (also tested with default and with 100)
- workflow poll thread count: 20 (also tested with default)
- tested different values for max concurrent workflow task exec size (from default up to 2000)
- tested different values for max concurrent activity exec size (up to 2000)
- Factory options:
- wf host local poll thread count : 20 (tested up to 100)
- wf cache size and max threads: default (tested up to 2000)
- Temporal task queue partitions set to 8 (but tested up to 15); we have 8192 shards.
We use kotlin for everything - WF and activities implementations etc. In our real WF we have to use java for async calls (when we run child WF and when we start 2 activities in parallel), otherwise tracing doesn’t work correctly (see Open tracing span context not propagated when activity or child workflow invoked asynchronously · Issue #537 · temporalio/sdk-java · GitHub - it kind of fixed but looks like not released yet). Anyway, in test WF we don’t use java at all.
I even tested coroutine wrappers around activities (with async completion) - that changes the profiling result, but doesn’t make things better. Internally almost all our real activities are coroutine-based.
In test activities I just use Thread.sleep (or delay() with coroutines)
Everything is deployed into our k8s (AWS EKS)
As we use NewRelic, I even tried to do remote JVM profiling, but it just shows, that 90% of time threads are in the blocked state
Latest perf observations (load 80 parent wf/s, total 20 000 requests, 10 concurrent clients - using Maru framework):
- Increasing maxConcurrentWorkflowTaskExecutionSize doesn’t improves performance.
temporal_workflow_task_execution_total_latency - max around 500ms, avg 20 ms
temporal_workflow_active_thread_count - this is a very strange metric, seems that it is only reported under heavy load, and values are around 10-150.
temporal_sticky_cache_total_forced_eviction - this one is 200 for child wf, 600 for parent wf
- sync/async poll percentage is very bad, around 20%
- huge schedule-to-start latencies
- state transitions / sec - 3100 max
Everything works very good with 50 wf/s load, OK with 60 wf/s. With 80 wf/s - see results.
Adding more workers doesn’t help at all. Tested with 5, 10, 20 workers.
We created a similar WF structure with golang. Results are MUCH better. As I mentioned, we can get up to 6000 state transitions/sec with sync/async poll rate around 99% and quite good response to scaling.
This is the most puzzling - why golang workers can scale, but Java-based - no effect?
We can try implementing everything in pure java, with no kotlin - but I still don’t get why scaling doesn’t help.
We are obviously hitting some bottleneck(s), but have no idea what exactly