Increase in workflow schedule to start times during specific intervals

Arvind_Narayanan · December 4, 2023, 2:59pm

Hello

We have been noticing increased scheduled to start times for our workflows during specific intervals. For some workflows it also increases to 1 hr, usually it gets scheduled within 100-150ms.

Upon further investigation, we found that there’s a spike in persistence requests for GetWorkflowExecution, although we did not experience any spikes when it comes to new workflow creations during this specific hour

There’s also the following warn logs that appear during the same interval. Following is an example but it’s not specific to any namespace

	
{
  "level": "warn",
  "ts": "2023-12-04T12:38:03.484Z",
  "msg": "Transfer Task Processor: workflow mutable state not found, skip.",
  "shard-id": 1576,
  "address": "10.115.241.3:7234",
  "component": "transfer-queue-processor",
  "cluster-name": "active",
  "wf-namespace-id": "873381e3-3823-40ad-9992-610e2945fb90",
  "wf-id": "BGV_DCP_ACC3791486273713_sjDCOhTjfEknPMxc",
  "wf-run-id": "1f73edc4-5c48-4176-a35c-9a60fde454f1",
  "queue-task-id": 18162649786,
  "queue-task-visibility-timestamp": "2023-10-09T14:51:37.287Z",
  "queue-task-type": "TransferActivityTask",
  "queue-task": {
    "NamespaceID": "873381e3-3823-40ad-9992-610e2945fb90",
    "WorkflowID": "BGV_DCP_ACC3791486273713_sjDCOhTjfEknPMxc",
    "RunID": "1f73edc4-5c48-4176-a35c-9a60fde454f1",
    "VisibilityTimestamp": "2023-10-09T14:51:37.287475389Z",
    "TaskID": 18162649786,
    "TaskQueue": "digilocker-dcp-digio-queue-prod",
    "ScheduledEventID": 61,
    "Version": 0
  },
  "wf-history-event-id": 61,
  "logging-call-at": "nDCTaskUtil.go:102"
}

These workflows had ended long time ago successfully as suggested by the visibility timestamp. Currently we’ve disabled the archival for old workflows

Could you please help? This issue is occurring almost everyday during the same hours

tihomir · December 4, 2023, 4:09pm

Hi, whats service version do you deploy?
Can you check your history hosts stability during this time?

sum(rate(service_error_with_type{service_type="history"}[5m])) by (error_type)

(check service_type for “frontend” and “matching” as well as restarts server metric just in case)

Can you check also persistence latencies?

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and resource exhausted:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Arvind_Narayanan · December 5, 2023, 3:04am

Hey hi @tihomir , currently we’re using temporal 1.18.1

Starting with service errors with type

Temporal history, we started receiving NotFound errors 2:15 pm onwards, same time as when the history-scanner job is scheduled

Temporal frontend, same trend as above

Temporal matching, this one has direct correlation wherein the “Cancelled” service errors started coming up at the same time as the persistence request incident

Persistence latencies, 5:30 onwards there were latencies in GetAllHistoryTreeBranches, can’t see any anomalies apart from that

Resource exhaustion and restart graphs were stable during this time period

tihomir · December 6, 2023, 12:13am

Can you add (sum by clause) operation to your service error graphs, specifically for one that shows serviceError.Cancelled please?

Regarding GetAllHistoryTreeBranches latencies, this api is called by worker service (history scavenger) and accesses persistence layer directly and then executions mutable state. My guess at this point is that the notfound errors come from trying to access the mutable state.

Arvind_Narayanan · December 6, 2023, 2:25am

Sure

Yes we had noticed that yesterday and also it’s logging multiple times in a day for a single workflow id, could it be that it’s retrying multiple times even in case of not found which is causing the spike in GetWorkflowExecutions as well?

Arvind_Narayanan · December 8, 2023, 1:58am

Looks like we’ve found the issue. Shard rebalancing was also occurring at the same time, so digging further we found that pods were also getting restarted at the same time because of node pool downscaling.
Just to confirm this, we had disabled node pool downscaling for some time and we no longer experienced spikes. Now we’ve shifted our history service to a nodepool that does not have scaling enabled

Topic		Replies	Views
Temporal Sever errors ; workflow failures and all request to history client failed Community Support	6	1215	May 8, 2023
Workflow stuck at `WorkflowTaskScheduled` due to `Persistent store operation Failure` on the Temporal History Server Deployment	2	324	January 30, 2024
updateCurrentExecution failed. 0 rows of current_executions updated instead of 1 Community Support go-sdk	4	48	April 21, 2025
"Workflow execution history not found" errors Server Deployment	2	662	September 10, 2024
Some activities seem to be stuck & not starting Server Deployment	3	1351	December 10, 2023

Increase in workflow schedule to start times during specific intervals

Related topics