Seeing high latencies between two subsequent activity task executions

You should see why you still get the resource exhausted errors. Check the resource_exhausted_cause tag on that metrics to see if it is rps limit or concurrent limit or system overload. You may need to increase frontend.namespaceCount if you see concurrent limit. You need to increase persistence rate limit if you see system overload: [frontend|history|matching].persistenceMaxQPS.
And you probably need to increase frontend.namespaceRPS as well, but I guess you already done so.

The StickyCacheEviction indicate your worker’s sticky cache might not be big enough. You can either increase worker count, or increase sticky cache size.

@Yimin_Chen Can you please explain the difference between below attributes :
FrontendRPS: “frontend.rps”,
FrontendMaxNamespaceRPSPerInstance: “frontend.namespaceRPS”,
FrontendMaxNamespaceCountPerInstance: “frontend.namespaceCount”,
FrontendGlobalNamespaceRPS: “frontend.globalNamespacerps”,

  1. We have a single namespace “default” and tried increasing “frontend.rps” values till 48K, but still we are getting resource exhausted errors with cause as rps limit. Hence, need a clarification on “namespaceRPS” attribute as well.
    Also, what are the max values that these attributes support?

  2. We also see service_errors_entity_not_found in dashboard, what config needs to be verified for these kind of errors?

frontend.rps / history.rps / matching.rps sets RPS limit per service pod.
frontend.namespaceRPS sets per namespace RPS limit.
There is no max value limit on those configs.

service_errors_entity_not_found is expected error, it means workflow (or other entities like activity) cannot be found. This could happen if there are some tasks (timer/transfer/visibility tasks) after workflow is deleted (like due to retention). Or if you try to send signal to non-exists workflow.

1 Like