Increase in workflow schedule to start times during specific intervals

Hello

We have been noticing increased scheduled to start times for our workflows during specific intervals. For some workflows it also increases to 1 hr, usually it gets scheduled within 100-150ms.

Upon further investigation, we found that there’s a spike in persistence requests for GetWorkflowExecution, although we did not experience any spikes when it comes to new workflow creations during this specific hour

There’s also the following warn logs that appear during the same interval. Following is an example but it’s not specific to any namespace

	
{
  "level": "warn",
  "ts": "2023-12-04T12:38:03.484Z",
  "msg": "Transfer Task Processor: workflow mutable state not found, skip.",
  "shard-id": 1576,
  "address": "10.115.241.3:7234",
  "component": "transfer-queue-processor",
  "cluster-name": "active",
  "wf-namespace-id": "873381e3-3823-40ad-9992-610e2945fb90",
  "wf-id": "BGV_DCP_ACC3791486273713_sjDCOhTjfEknPMxc",
  "wf-run-id": "1f73edc4-5c48-4176-a35c-9a60fde454f1",
  "queue-task-id": 18162649786,
  "queue-task-visibility-timestamp": "2023-10-09T14:51:37.287Z",
  "queue-task-type": "TransferActivityTask",
  "queue-task": {
    "NamespaceID": "873381e3-3823-40ad-9992-610e2945fb90",
    "WorkflowID": "BGV_DCP_ACC3791486273713_sjDCOhTjfEknPMxc",
    "RunID": "1f73edc4-5c48-4176-a35c-9a60fde454f1",
    "VisibilityTimestamp": "2023-10-09T14:51:37.287475389Z",
    "TaskID": 18162649786,
    "TaskQueue": "digilocker-dcp-digio-queue-prod",
    "ScheduledEventID": 61,
    "Version": 0
  },
  "wf-history-event-id": 61,
  "logging-call-at": "nDCTaskUtil.go:102"
}

These workflows had ended long time ago successfully as suggested by the visibility timestamp. Currently we’ve disabled the archival for old workflows

Could you please help? This issue is occurring almost everyday during the same hours

Hi, whats service version do you deploy?
Can you check your history hosts stability during this time?

sum(rate(service_error_with_type{service_type="history"}[5m])) by (error_type)

(check service_type for “frontend” and “matching” as well as restarts server metric just in case)

Can you check also persistence latencies?

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and resource exhausted:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Hey hi @tihomir , currently we’re using temporal 1.18.1

Starting with service errors with type

Temporal history, we started receiving NotFound errors 2:15 pm onwards, same time as when the history-scanner job is scheduled

Temporal frontend, same trend as above

Temporal matching, this one has direct correlation wherein the “Cancelled” service errors started coming up at the same time as the persistence request incident

Persistence latencies, 5:30 onwards there were latencies in GetAllHistoryTreeBranches, can’t see any anomalies apart from that

Resource exhaustion and restart graphs were stable during this time period

Can you add (sum by clause) operation to your service error graphs, specifically for one that shows serviceError.Cancelled please?

Regarding GetAllHistoryTreeBranches latencies, this api is called by worker service (history scavenger) and accesses persistence layer directly and then executions mutable state. My guess at this point is that the notfound errors come from trying to access the mutable state.

Sure

Yes we had noticed that yesterday and also it’s logging multiple times in a day for a single workflow id, could it be that it’s retrying multiple times even in case of not found which is causing the spike in GetWorkflowExecutions as well?

Looks like we’ve found the issue. Shard rebalancing was also occurring at the same time, so digging further we found that pods were also getting restarted at the same time because of node pool downscaling.
Just to confirm this, we had disabled node pool downscaling for some time and we no longer experienced spikes. Now we’ve shifted our history service to a nodepool that does not have scaling enabled