Memory OOM issues with History Pod and Size-Based Cache configuration

Hello,

I am experiencing persistent Memory OOM (Out Of Memory) issues with my Temporal History pods and would like to seek some guidance.

Currently, I have configured both MutableState and Events caches to be size-based for precise memory management. My configuration is as follows:

YAML

history.cacheSizeBasedLimit:
  - value: true
    constraints: {}
history.hostLevelCacheMaxSizeBytes:
  - value: 134217728  # 128MB
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Environment Details:

  • Total History Shards: 2048

  • Pod Resource: Request 4GB / Limit 5GB

  • Runtime Env: GOMEMLIMIT is set to 3000MiB.

Problem: Even though I limited the caches to 128MB each, the pod memory usage continuously increases under load, exceeding the 5GB limit and triggering an OOM kill.

Interestingly, when I use a count-based configuration for the MutableState cache (as shown below), the memory usage stabilizes around 3GB and never triggers an OOM:

YAML

history.cacheSizeBasedLimit:
  - value: false
    constraints: {}
history.hostLevelCacheMaxSize:
  - value: 10000
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Questions:

  1. Is there a known issue or bug regarding size-based cache limits (cacheSizeBasedLimit: true)?

  2. Why does the size-based limit fail to constrain the memory, whereas the count-based limit works fine in the same environment?

  3. I prefer to manage both caches using size-based limits. Are there any additional configurations I might be missing (e.g., shard-level settings like history.cacheMaxSizeBytes) to make this work reliably?

I would appreciate any insights or recommendations on how to properly unify the cache settings to size-based without hitting OOM.

Hi, whats server version you are deploying?

1.29.1 latest version

Hi, We are using the latest version of Temporal (1.29.1). I would appreciate it if you could advise on any configuration settings related to memory usage for the history pods. We are monitoring memory using the cache_usage metric, but managing memory as intended has been quite challenging.

Hi @andropler , You configuration looks correct.

what was the values of cache_usage{cache_type=“mutablestate”} and
cache_usage{cache_type=“events”} you observed when this happens? Did you see that value increase continuously when you configured size based limits? I tried running a test and I observed both of these values got capped at configured value and memory usage was not increasing continuously.

Hi @prathyushpv, thanks for jumping in and for offering to investigate.

Just to clarify my original issue (in case I didn’t explain it clearly):

  • The mixed configuration works fine for us:

    • MutableState cache = count-based (history.hostLevelCacheMaxSize)

    • Events cache = size-based (history.eventsHostLevelCacheMaxSizeBytes)

    • With GOMEMLIMIT set, memory stays stable and we don’t see OOM.

  • The problem happens when we switch to size-based limits for the MutableState cache as well. In the problematic setup we use:

    • history.cacheSizeBasedLimit: true

    • history.hostLevelCacheMaxSizeBytes: <value>

    • history.enableHostLevelEventsCache: true

    • history.eventsHostLevelCacheMaxSizeBytes: <value>

    • Under load, the history pod’s RSS keeps growing, appears to exceed the configured cache byte limits, and can even surpass GOMEMLIMIT, eventually getting OOM-killed.

So my question is specifically: why does the size-based limit for the MutableState cache not effectively bound memory in our case, while the count-based limit does? Are there any known caveats/bugs, additional shard-level settings, or metrics we should use to validate what’s happening?

Additionally, could you please confirm how to interpret the cache_usage metric:

  1. If a given cache type is configured with count-based limits, does cache_usage{cache_type="..."} report number of entries?

  2. If it is configured with size-based limits, does cache_usage{cache_type="..."} report bytes (capacity/size)?

If helpful, I can share:

  • cache_usage{cache_type="mutablestate"} and cache_usage{cache_type="events"} time series for both configs

  • history pod memory RSS / Go heap metrics (go_memstats_heap_*), plus workload characteristics (shards=2048, pod limit=5Gi, GOMEMLIMIT=3000Mi)

  • exact dynamic config + Temporal version (1.29.1)

Appreciate any guidance on what to check next or what additional data would be most useful for you to debug this.