Hello team,
we have some issues with our worker and “maybe” with local activities. This is a Kubernetes based deployment of Temporal. Let me describe what we see. We have a small PoC setup with one worker, we have recently seen a problem where all new workflow executions got stuck with message:
temporal_sdk_core::worker::workflow::workflow_stream: Buffering WFT because cache is full
The suspicious thing is that it started happening when there were only 4 workflow executions running in parallel (and they are all basically step-by-step serial business flows with no parallelism) while limit for WFT cache should be 100.
The log message that we see is this:
2025-03-07T15:50:05.819196883Z stdout F 2025-03-07T15:50:05.819152Z DEBUG new_stream_input:instantiate_or_update: temporal_sdk_core::worker::workflow::workflow_stream: Buffering WFT because cache is full run_id=3cc05818-9964-4b8c-8226-2fe32f0776e8 action=NewWft(PermittedWft { ValidWFT { task_token:
CiRiNGM5N2M4Mi01YmNkLTQ4N2QtOTkxOC04NTBmZThkZDU5MzASJGYwNGI1YTZlLThhM2ItNDE2NC1hNDVkLWQyNTlkOWE1Zjk1ZRokM2NjMDU4MTgtOTk2NC00YjhjLTgyMjYtMmZlMzJmMDc3NmU4IAIoAUoKCNYyEKeRgFQYAVADYgwIrausvgYQhvSsgwM=, task_queue: headless-cc-ingress-s2s-queue, workflow_execution: WorkflowExecution { workflow_id: "f04b5a6e-8a3b-4164-a45d-d259d9a5f95e", run_id: "3cc05818-9964-4b8c-8226-2fe32f0776e8" }, workflow_type: headless-cc-ingress-s2s, attempt: 1, previous_started_event_id: 0, started_event_id 3, history_length: 3, first_evt_in_hist_id: Some(1), legacy_query: None, queries: [] } }) run_id=3cc05818-9964-4b8c-8226-2fe32f0776e8 workflow_id=f04b5a6e-8a3b-4164-a45d-d259d9a5f95e
We recently started using local activities, and we see some weird problem with them:
2025-03-07T15:49:32.895782098Z stdout F 2025-03-07T15:49:32.895750Z DEBUG poll_activity_task: temporal_sdk_core::worker::workflow::workflow_stream: Processing run update response from machines resp=FailRunUpdate(run_id: 15305b70-f122-4022-9060-fd521fe8d093, error: Fatal("Invalid transition resolving local activity (seq 2) in WaitingMarkerEventPreResolved"))
2025-03-07T15:49:32.895759748Z stdout F 2025-03-07T15:49:32.895726Z ERROR poll_activity_task: temporal_sdk_core::worker::workflow::managed_run: Error in run machines error=RunUpdateErr(Fatal("Invalid transition resolving local activity (seq 2) in WaitingMarkerEventPreResolved"))
2025-03-07T15:49:32.895734977Z stdout F 2025-03-07T15:49:32.895691Z DEBUG poll_activity_task: temporal_sdk_core::worker::workflow::managed_run: Applying local resolution resolution=LocalActivity(LocalActivityResolution { seq: 2, result: TimedOut(Failure { failure: Some(Failure { message: "Activity timed out", source: "", stack_trace: "", encoded_attributes: None, cause: None, failure_info: Some(TimeoutFailureInfo(TimeoutFailureInfo { timeout_type: StartToClose, last_heartbeat_details: None })) }) }), runtime: 5.001215574s, attempt: 1, backoff: None, original_schedule_time: Some(SystemTime { tv_sec: 1741362567, tv_nsec: 894254339 }) })
There is definitely a “time correlation” with these local activities problem and time when we started seeing issues with WFT.
I have following questions:
- Is it possible to exactly see what WFTs currently occupy worker cache ?
- Is it possible that what we in fact see is not that we ran our of WFT cache but instead we ran out of local activity cache? There is this config attribute in WorkerConfig called max_outstanding_local_activities (Rust sdk) and default is 5, which would coincide with amount of running workflows before we started seeing the problem. Is it possible that the log message is inaccurate in this sense ?
- In the Temporal UI we see those local activities were finished, is it possible that they stayed in cache because of that error visible in the log ?