Can someone help with the difference between these two metrics?
Context: We are trying to conduct load testing and seems like
service_latency_nouserlatency is causing the major bottleneck in the history service. Understanding the metrics will help us identifying the bottleneck in the cluster.
This metric measures latency to obtain the per-workflow lock.
This latency is the overall latency minus service_latency_userlatency
service_latency_nouserlatency is causing the major bottleneck in the history service
This can be caused by workflows scheduling too many activities concurrently. Are there any other metrics that stick out as there could be other causes as well. Any error logs on db or history service?
Which API calls is experiencing the high latency?
In general, first to check is persistence_latency, specifically look for operation=“UpdateWorkflowExecution”. If you see persistence_latency is high, it means you need to scale out your persistence.
Second is to check your CPU usage on your history service instance, are they saturate CPU limit.
Thanks, folks for the replies. These were useful.
So, I suspected the history service CPU to be the bottleneck and ran the same benchmarking with 2x nodes. The CPU utilization for history service pods seems to be in range 40-50% range now.
However, I still see a difference in service latency vs DB latency as can be seen below.
@tihomir to answer your question task queue latency(both
task_latency_queue_nouserlatency) and task processing latency(
task_latency_processing) seem to be sticking out. Do we know what could lead to this ?
The persistence latency seems to be in decent range here @Yimin_Chen.
Based on this I checked my workflow/activity workers CPU utilization which was in the 10-15% range. Is there anything else I should be looking into ?
Could you group the service_latency metric by operation? Also check if there is any service_errors metrics? (right now there are different service_errors_xxx for different type of errors, but we will consolidate them into one metric with different error type tag.)
My bad PFA the metrics grouped by operations.
Found error graph for the following:
service_errors_resource_exhausted is what is causing trouble now. Look for the resource_exhausted_cause of that metric. There are 2 types of reasons: RpsLimit or ConcurrentLimit. Both are dynamic configured at frontend.
frontend.rps → rps limit overall
frontend.namespaceRPS → rps limit per namespace
frontend.namespaceCount → concurrent poller limit per namespace
Adjust these 3 configs. They are all per frontend instance limit.
Hey folks thanks for your response.
Turns out this was the root cause of the problem. The workflow specs with which I was benchmarking were producing too many activities concurrently. I configured it to the write specs and it was working as expected.
To reduce the resource exhausted errors in FE layer, I limited the number of task pollers per workers and it worked fine post that.