History Service OOM exception

Hi,

We are seeing lot of OOM exceptions in History service pods. We are currently running 3 pods with each pod having 8gb of memory.

We continue to see pods getting restarted when upper limit of memory is reached. Attached grafana dashboard for memory. We are running temporal server v1.20.2.

We continue to see below entries into history logs:

{“level”:“warn”,“ts”:“2023-07-17T01:50:07.050Z”,“msg”:“Critical attempts processing workflow task”,“service”:“history”,“component”:“shard-controller”,“address”:“10.1.1.207:7234”,“shard-id”:7998,“address”:“10.1.1.207:7234”,“wf-namespace”:“kairos-pp-main”,“wf-id”:“273230334”,“wf-run-id”:“61d50edd-a306-460d-a83c-e73eea7f3d34”,"attempt":407,“logging-call-at”:“workflow_task_state_machine.go:887”}

Looking at logs, it looks like some stale workflows are attempting retry and never gets executed and are retained in memory and getting piled up. Can this be the reason for memory leak?

Also, we have noticed that only one pod is doing heavy lifting and other pods remain stable. When the pod in question is killed, other pod does heavy lifting and cycle continues. Is this expected behaviour? How can we achieve even memory distribution?

Could you please help us troubleshoot this issue?

@Andrey_Dubnik @Kishore_Gunda @peeyushchawla @sidhu.sb - fyi

Thanks,
Dhanraj