Difference between user and no user service latency metrics?

nitesh237 · April 18, 2022, 10:36am

Hi team,

Can someone help with the difference between these two metrics?

service_latency_nouserlatency
service_latency_userlatency

Context: We are trying to conduct load testing and seems like service_latency_nouserlatency is causing the major bottleneck in the history service. Understanding the metrics will help us identifying the bottleneck in the cluster.

~Nitesh

tihomir · April 18, 2022, 11:05pm

service_latency_userlatency

This metric measures latency to obtain the per-workflow lock.

service_latency_nouserlatency

This latency is the overall latency minus service_latency_userlatency

seems like service_latency_nouserlatency is causing the major bottleneck in the history service

This can be caused by workflows scheduling too many activities concurrently. Are there any other metrics that stick out as there could be other causes as well. Any error logs on db or history service?

Yimin_Chen · April 19, 2022, 4:26am

Which API calls is experiencing the high latency?
In general, first to check is persistence_latency, specifically look for operation=“UpdateWorkflowExecution”. If you see persistence_latency is high, it means you need to scale out your persistence.
Second is to check your CPU usage on your history service instance, are they saturate CPU limit.

nitesh237 · April 19, 2022, 4:42am

Thanks, folks for the replies. These were useful.
So, I suspected the history service CPU to be the bottleneck and ran the same benchmarking with 2x nodes. The CPU utilization for history service pods seems to be in range 40-50% range now.
However, I still see a difference in service latency vs DB latency as can be seen below.

@tihomir to answer your question task queue latency(both task_latency_queue and task_latency_queue_nouserlatency) and task processing latency(task_latency_processing) seem to be sticking out. Do we know what could lead to this ?

The persistence latency seems to be in decent range here @Yimin_Chen.

Based on this I checked my workflow/activity workers CPU utilization which was in the 10-15% range. Is there anything else I should be looking into ?

Yimin_Chen · April 19, 2022, 3:55pm

Could you group the service_latency metric by operation? Also check if there is any service_errors metrics? (right now there are different service_errors_xxx for different type of errors, but we will consolidate them into one metric with different error type tag.)

nitesh237 · April 19, 2022, 7:56pm

Hi Yimin,

My bad PFA the metrics grouped by operations.

Found error graph for the following:

service_errors_entity_not_found

client_errors

Screenshot 2022-04-20 at 1.23.15 AM2428×888 263 KB
service_errors_resource_exhausted from frontend

Screenshot 2022-04-20 at 1.25.22 AM2424×864 169 KB

Yimin_Chen · April 20, 2022, 5:08pm

service_errors_resource_exhausted is what is causing trouble now. Look for the resource_exhausted_cause of that metric. There are 2 types of reasons: RpsLimit or ConcurrentLimit. Both are dynamic configured at frontend.
frontend.rps → rps limit overall
frontend.namespaceRPS → rps limit per namespace
frontend.namespaceCount → concurrent poller limit per namespace
Adjust these 3 configs. They are all per frontend instance limit.

nitesh237 · April 28, 2022, 5:10am

Hey folks thanks for your response.

Turns out this was the root cause of the problem. The workflow specs with which I was benchmarking were producing too many activities concurrently. I configured it to the write specs and it was working as expected.

To reduce the resource exhausted errors in FE layer, I limited the number of task pollers per workers and it worked fine post that.

Topic		Replies	Views
Temporal Server Metrics Community Support	4	761	September 19, 2023
How to read Grafana Performance metrics Community Support	6	81	January 28, 2025
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2674	April 13, 2022
Latency Metric that excludes sleep or idle time cases Community Support metrics	3	986	February 27, 2023
Troubles shoot of workflow execution latency Community Support performance	1	889	August 11, 2022

Difference between user and no user service latency metrics?

Related topics