Buffering WFT because cache is full whn using local activities

prema01 · March 18, 2025, 5:57pm

Hello team,

we have some issues with our worker and “maybe” with local activities. This is a Kubernetes based deployment of Temporal. Let me describe what we see. We have a small PoC setup with one worker, we have recently seen a problem where all new workflow executions got stuck with message:

temporal_sdk_core::worker::workflow::workflow_stream: Buffering WFT because cache is full

The suspicious thing is that it started happening when there were only 4 workflow executions running in parallel (and they are all basically step-by-step serial business flows with no parallelism) while limit for WFT cache should be 100.

The log message that we see is this:

2025-03-07T15:50:05.819196883Z stdout F 2025-03-07T15:50:05.819152Z DEBUG new_stream_input:instantiate_or_update: temporal_sdk_core::worker::workflow::workflow_stream: Buffering WFT because cache is full run_id=3cc05818-9964-4b8c-8226-2fe32f0776e8 action=NewWft(PermittedWft { ValidWFT { task_token: 
CiRiNGM5N2M4Mi01YmNkLTQ4N2QtOTkxOC04NTBmZThkZDU5MzASJGYwNGI1YTZlLThhM2ItNDE2NC1hNDVkLWQyNTlkOWE1Zjk1ZRokM2NjMDU4MTgtOTk2NC00YjhjLTgyMjYtMmZlMzJmMDc3NmU4IAIoAUoKCNYyEKeRgFQYAVADYgwIrausvgYQhvSsgwM=, task_queue: headless-cc-ingress-s2s-queue, workflow_execution: WorkflowExecution { workflow_id: "f04b5a6e-8a3b-4164-a45d-d259d9a5f95e", run_id: "3cc05818-9964-4b8c-8226-2fe32f0776e8" }, workflow_type: headless-cc-ingress-s2s, attempt: 1, previous_started_event_id: 0, started_event_id 3, history_length: 3, first_evt_in_hist_id: Some(1), legacy_query: None, queries: [] } }) run_id=3cc05818-9964-4b8c-8226-2fe32f0776e8 workflow_id=f04b5a6e-8a3b-4164-a45d-d259d9a5f95e

We recently started using local activities, and we see some weird problem with them:

2025-03-07T15:49:32.895782098Z stdout F 2025-03-07T15:49:32.895750Z DEBUG poll_activity_task: temporal_sdk_core::worker::workflow::workflow_stream: Processing run update response from machines resp=FailRunUpdate(run_id: 15305b70-f122-4022-9060-fd521fe8d093, error: Fatal("Invalid transition resolving local activity (seq 2) in WaitingMarkerEventPreResolved"))
2025-03-07T15:49:32.895759748Z stdout F 2025-03-07T15:49:32.895726Z ERROR poll_activity_task: temporal_sdk_core::worker::workflow::managed_run: Error in run machines error=RunUpdateErr(Fatal("Invalid transition resolving local activity (seq 2) in WaitingMarkerEventPreResolved"))
2025-03-07T15:49:32.895734977Z stdout F 2025-03-07T15:49:32.895691Z DEBUG poll_activity_task: temporal_sdk_core::worker::workflow::managed_run: Applying local resolution resolution=LocalActivity(LocalActivityResolution { seq: 2, result: TimedOut(Failure { failure: Some(Failure { message: "Activity timed out", source: "", stack_trace: "", encoded_attributes: None, cause: None, failure_info: Some(TimeoutFailureInfo(TimeoutFailureInfo { timeout_type: StartToClose, last_heartbeat_details: None })) }) }), runtime: 5.001215574s, attempt: 1, backoff: None, original_schedule_time: Some(SystemTime { tv_sec: 1741362567, tv_nsec: 894254339 }) })

There is definitely a “time correlation” with these local activities problem and time when we started seeing issues with WFT.

I have following questions:

Is it possible to exactly see what WFTs currently occupy worker cache ?
Is it possible that what we in fact see is not that we ran our of WFT cache but instead we ran out of local activity cache? There is this config attribute in WorkerConfig called max_outstanding_local_activities (Rust sdk) and default is 5, which would coincide with amount of running workflows before we started seeing the problem. Is it possible that the log message is inaccurate in this sense ?
In the Temporal UI we see those local activities were finished, is it possible that they stayed in cache because of that error visible in the log ?

tihomir · March 23, 2025, 10:03pm

Buffering WFT because cache is full

When you see this can you look at your temporal_sticky_cache_size worker metric and compare it to your cache size you set in worker options? Also look at temporal_sticky_cache_total_forced_eviction around same time.

I think this is related to workflow cache (in-memory cache of executions) and log is printed when worker has to do forceful eviction from cache to allow execution for this workflow task to make progress (which was not in cache)

tihomir · March 23, 2025, 10:06pm

Could you share event history json for the execution where you see the other failures (local activities)

prema01 · March 24, 2025, 9:44pm

Unfortunately our retention config made it gone so I cannot retrieve the event history.

I will now work on reproduction of this problem and will come back to this thread ASAP.

Thanks in advance

Topic		Replies	Views
Local Activity vs Activity Community Support activity , local-activity	23	20809	June 23, 2022
Workflow Task Failure in Local Activity Community Support java-sdk , general-impl , activity	2	176	June 17, 2024
Execute_local_actiivty use a different worker when some activity timed out Community Support local-activity	2	17	March 4, 2025
Local Activity restrictions Community Support activity , workflow-implementat	5	1095	May 6, 2021
Workflow timedout due to activity BLOCKED on Feature.get Community Support java-sdk , general-impl	5	949	July 11, 2022

Buffering WFT because cache is full whn using local activities

Related topics