History Service OOM exception

Dhanraj · July 17, 2023, 12:24pm

Hi,

We are seeing lot of OOM exceptions in History service pods. We are currently running 3 pods with each pod having 8gb of memory.

We continue to see pods getting restarted when upper limit of memory is reached. Attached grafana dashboard for memory. We are running temporal server v1.20.2.

We continue to see below entries into history logs:

{“level”:“warn”,“ts”:“2023-07-17T01:50:07.050Z”,“msg”:“Critical attempts processing workflow task”,“service”:“history”,“component”:“shard-controller”,“address”:“10.1.1.207:7234”,“shard-id”:7998,“address”:“10.1.1.207:7234”,“wf-namespace”:“kairos-pp-main”,“wf-id”:“273230334”,“wf-run-id”:“61d50edd-a306-460d-a83c-e73eea7f3d34”,"attempt":407,“logging-call-at”:“workflow_task_state_machine.go:887”}

Looking at logs, it looks like some stale workflows are attempting retry and never gets executed and are retained in memory and getting piled up. Can this be the reason for memory leak?

Also, we have noticed that only one pod is doing heavy lifting and other pods remain stable. When the pod in question is killed, other pod does heavy lifting and cycle continues. Is this expected behaviour? How can we achieve even memory distribution?

Could you please help us troubleshoot this issue?

@Andrey_Dubnik @Kishore_Gunda @peeyushchawla @sidhu.sb - fyi

Thanks,
Dhanraj

Topic		Replies	Views
Temporal History Service Memory Usage Community Support history , metrics	4	2478	June 29, 2022
History service memory usage Community Support history	18	2571	February 22, 2023
Memory leak in Temporal History service v1.18.3 Community Support history , server	3	1477	November 2, 2022
History Mem Usage, Cache Size & TTL Community Support docker	7	2072	April 27, 2023
Temporal Server Memory Bloat: Anyone Found a Fix? Community Support typescript-sdk , server	1	23	June 1, 2025

History Service OOM exception

Related topics