I read various topics on community and was able to enable metrics on client side (that hosts the Workflow and Activity workers)
-
Is there a metric that is exposed by temporal server via which we could see the size of the task queue? Planning to plot it on Grafana to see if the backlog is okay on each task queue (similar to Kafka topic lag)
-
On the SDK / Client side metric - I could see “workflow_active_thread_count” - this would tell us how busy these workflow threads are (pls correct me if I am not) - in a similar way, is there a metric to see how busy are activity workers.
-
On performance test, we see below latencies. Can we know what other metrics which we can see to understand what is happening.
In Front End
StartWorkflowExecution → majorly above 5 seconds and peak is 15 seconds
SignalWorkflowExecution → 100 - 400 milliseconds
In Matching
AddWorkflowTask → Peaking at 2 minutes and average seem to be 50 seconds
AddActivityTask → 50 ms to 300 ms
SDK : Java
Numbers of History Shards : 512
All Components on Server (Matching, Worker, History, Frontend) are hosted with
CPU Max : 1000m
Memory Max : 800MB
Matching : 3 instances
History : 6 instances
Worker : 3 instances
Frontend : 3 instances
We are enabling the client side metrics to see whats going on. But would like to get some inputs on above server side latencies to see what can we start checking.
- For activity workers you can use the
temporal_worker_task_slots_available
SDK metrics with label
worker_type="ActivityWorker"
. This can help you tune your your WorkerOptions#getMaxConcurrentActivityExecutionSize
setting.
Note that your activity code can start as many threads as you want to, so in addition to this if you want to monitor how many threads are started by your activity code you would need to use JVM metics.
Thanks. Yet to monitor for “temporal_worker_task_slots_available”. Will do that. Update: I dont see this getting pushed to prometheus. Is this a gosdk only metrics?
We see a strange behavior on worker nodes. We have 6 history nodes running and only one of them is at 100% CPU (4) but others are at <1 %. Anything that we are missing? - 512 shards. We are wondering if sharding is pushing everything to one node (which would be very strange - as workflow id is random UUID) {or} front end some how sees only one history node and not the others for some reason?
Is there any thing that we can check to rule out both of the above behavior? or can we see the shard ranges assigned in front end for each history node - somewhere in db?
Its emitted from Java SDK as well. Do you have metrics enabled for your workers?
If you run the metrics sample you should see it being emitted on http://localhost:8080/prometheus
for example:
temporal_worker_task_slots_available{activity_type="none",exception="none",namespace="default",operation="none",query_type="none",signal_name="none",status_code="none",task_queue="metricsqueue",workerCustomTag1="workerCustomTag1Value",workerCustomTag2="workerCustomTag2Value",worker_type="WorkflowWorker",workflow_type="none",} 200.0
Which SDK version are you using? I think it needs to be 1.8.0 or greater (see release notes in Release v1.8.0 · temporalio/sdk-java · GitHub where this metrics was added)
I am on 1.7.1 - that might be the issue. Will update, thanks.
If you can help me with other observation above related to only one of the History servers being very busy than the rest - would be great.
If you can help me with other observation above related to only one of the History servers being very busy than the rest
Even tho there is no guarantee that you will have an even distribution regardless of the number of history services used, this high discrepancy is likely not caused by uneven load.
One way this could happen is if you use a very small number of shards. Otherwise would look into possible misconfiguration.
But would like to get some inputs on above server side latencies to see what can we start checking.
Our server team provided following inputs:
- Performance numbers do not seem good. Frontend API latency at p95 should be below 100ms.
- For matching service, AddWorkflowTask ideally should be 50ms to ~100ms
- For matching service, AddActivityTask might be slightly higher if there is contention between multiple concurrent activities for the same workflow exec.
- You should look at where your bottleneck lies. This could be compute on server nodes (CPU / Memory / Network) or possible persistence layer latencies.
Thanks @tihomir for the inputs - will look at the shards (I thought we have more shards 512) and also will look at the compute on server nodes
@tihomir Can I confirm few of the metrics that we are seeing based on above recommendation please?
We see activity execution latency and activity end to end latency.
-
Would like to confirm that activity execution latency is the time taken to execute the activity code (after the task gets picked up by a thread from the activity workers thread pool).
-
And what is activity end to end latency please? - is that the full round trip taken from Activity Task Creation - Placing the task on Queue - Activity poller picking up and education - Dropping Activity Task Completed on Queue and the response coming back to workflow?
-
In our case, the execution latencies are in range of 50-60 milliseconds but the end to end latency is slightly higher around a second - I am assuming we need to check the persistence latency on History server to see if thats adding up the extra time on end to end latency. Is there anything else that we need to check to see why end to end latencies are higher while the actual executing took only 60 milliseconds
-
On the same context can you pls confirm if the workflow execution latency covers the workflow code execution alone - from the time the workflow task got picked up till the task was completed
-
We are also seeing that schedule-to-start is high on workflow but okay on activities. I am guessing we might need to tune the workflow worker alone and not activity workers (as we have different task queues for activities and workflow tasks)
-
While tuning Workflow worker - we see that there is a localHostPollThreadCount - may I get some clarity on this please? What does this actually represent? With activities, pollers and threadpools are straight forward as there is only remote task queues. On workflows we see two thread counts - one is workflowWorkerPollThreadCount which is the poller count that would get the task from remote task queue and would place it on the internal executor. But was unable to understand the localHostPollThreadCount.
Can I get some understanding with the above points please which will help us tune accordingly to increase the throughput.
@tihomir / @maxim Can I get some views on above points please?
This may be a stupid question, I am using Java SDK, are these metrics automatically instrumented for all workers behind the scene? I got a Temporal Cluster with Prometheus and Garfana. The only metric recorded for them is for namespace=temporal_system
, what do I need to do to see my Activities metrics recorded.
This may be a stupid question, I am using Java SDK, are these metrics automatically instrumented for all workers behind the scene?
For SDK metrics you have to enable them (for both your workers and your client apis), see sample here, specifically here and here.
1 Like
Thanks @tihomir for the link. A follow up question, what do I need to change in Helm chart (one provided by Temporal) in order for Prometheus to scrape the metrics exposed in my workers and client? Apologies for the basic question as I am new to Prometheus.
To set up additional Prometheus scrape points via the mentioned helm charts, don’t use the prometheus config provided:
--set prometheus.enabled=false \
and instead install Prometheus Operator and use ServiceMonitor
to configure all needed scrape points.
1 Like