Memory OOM issues with History Pod and Size-Based Cache configuration

Hello,

I am experiencing persistent Memory OOM (Out Of Memory) issues with my Temporal History pods and would like to seek some guidance.

Currently, I have configured both MutableState and Events caches to be size-based for precise memory management. My configuration is as follows:

YAML

history.cacheSizeBasedLimit:
  - value: true
    constraints: {}
history.hostLevelCacheMaxSizeBytes:
  - value: 134217728  # 128MB
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Environment Details:

  • Total History Shards: 2048

  • Pod Resource: Request 4GB / Limit 5GB

  • Runtime Env: GOMEMLIMIT is set to 3000MiB.

Problem: Even though I limited the caches to 128MB each, the pod memory usage continuously increases under load, exceeding the 5GB limit and triggering an OOM kill.

Interestingly, when I use a count-based configuration for the MutableState cache (as shown below), the memory usage stabilizes around 3GB and never triggers an OOM:

YAML

history.cacheSizeBasedLimit:
  - value: false
    constraints: {}
history.hostLevelCacheMaxSize:
  - value: 10000
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Questions:

  1. Is there a known issue or bug regarding size-based cache limits (cacheSizeBasedLimit: true)?

  2. Why does the size-based limit fail to constrain the memory, whereas the count-based limit works fine in the same environment?

  3. I prefer to manage both caches using size-based limits. Are there any additional configurations I might be missing (e.g., shard-level settings like history.cacheMaxSizeBytes) to make this work reliably?

I would appreciate any insights or recommendations on how to properly unify the cache settings to size-based without hitting OOM.

Hi, whats server version you are deploying?

1.29.1 latest version

Hi, We are using the latest version of Temporal (1.29.1). I would appreciate it if you could advise on any configuration settings related to memory usage for the history pods. We are monitoring memory using the cache_usage metric, but managing memory as intended has been quite challenging.