Memory OOM issues with History Pod and Size-Based Cache configuration

andropler · December 25, 2025, 3:06am

Hello,

I am experiencing persistent Memory OOM (Out Of Memory) issues with my Temporal History pods and would like to seek some guidance.

Currently, I have configured both MutableState and Events caches to be size-based for precise memory management. My configuration is as follows:

YAML

history.cacheSizeBasedLimit:
  - value: true
    constraints: {}
history.hostLevelCacheMaxSizeBytes:
  - value: 134217728  # 128MB
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Environment Details:

Total History Shards: 2048
Pod Resource: Request 4GB / Limit 5GB
Runtime Env: GOMEMLIMIT is set to 3000MiB.

Problem: Even though I limited the caches to 128MB each, the pod memory usage continuously increases under load, exceeding the 5GB limit and triggering an OOM kill.

Interestingly, when I use a count-based configuration for the MutableState cache (as shown below), the memory usage stabilizes around 3GB and never triggers an OOM:

YAML

history.cacheSizeBasedLimit:
  - value: false
    constraints: {}
history.hostLevelCacheMaxSize:
  - value: 10000
    constraints: {}
history.enableHostLevelEventsCache:
  - value: true
    constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
  - value: 134217728   # 128MB
    constraints: {}

Questions:

Is there a known issue or bug regarding size-based cache limits (cacheSizeBasedLimit: true)?
Why does the size-based limit fail to constrain the memory, whereas the count-based limit works fine in the same environment?
I prefer to manage both caches using size-based limits. Are there any additional configurations I might be missing (e.g., shard-level settings like history.cacheMaxSizeBytes) to make this work reliably?

I would appreciate any insights or recommendations on how to properly unify the cache settings to size-based without hitting OOM.

tihomir · December 26, 2025, 2:43pm

Hi, whats server version you are deploying?

andropler · December 26, 2025, 3:01pm

1.29.1 latest version

andropler · January 6, 2026, 9:00am

Hi, We are using the latest version of Temporal (1.29.1). I would appreciate it if you could advise on any configuration settings related to memory usage for the history pods. We are monitoring memory using the cache_usage metric, but managing memory as intended has been quite challenging.

prathyushpv · February 25, 2026, 12:13am

Hi @andropler , You configuration looks correct.

what was the values of cache_usage{cache_type=“mutablestate”} and
cache_usage{cache_type=“events”} you observed when this happens? Did you see that value increase continuously when you configured size based limits? I tried running a test and I observed both of these values got capped at configured value and memory usage was not increasing continuously.

andropler · February 26, 2026, 2:12am

Hi @prathyushpv, thanks for jumping in and for offering to investigate.

Just to clarify my original issue (in case I didn’t explain it clearly):

The mixed configuration works fine for us:
- MutableState cache = count-based (history.hostLevelCacheMaxSize)
- Events cache = size-based (history.eventsHostLevelCacheMaxSizeBytes)
- With GOMEMLIMIT set, memory stays stable and we don’t see OOM.
The problem happens when we switch to size-based limits for the MutableState cache as well. In the problematic setup we use:
- history.cacheSizeBasedLimit: true
- history.hostLevelCacheMaxSizeBytes: <value>
- history.enableHostLevelEventsCache: true
- history.eventsHostLevelCacheMaxSizeBytes: <value>
- Under load, the history pod’s RSS keeps growing, appears to exceed the configured cache byte limits, and can even surpass GOMEMLIMIT, eventually getting OOM-killed.

So my question is specifically: why does the size-based limit for the MutableState cache not effectively bound memory in our case, while the count-based limit does? Are there any known caveats/bugs, additional shard-level settings, or metrics we should use to validate what’s happening?

Additionally, could you please confirm how to interpret the cache_usage metric:

If a given cache type is configured with count-based limits, does cache_usage{cache_type="..."} report number of entries?
If it is configured with size-based limits, does cache_usage{cache_type="..."} report bytes (capacity/size)?

If helpful, I can share:

cache_usage{cache_type="mutablestate"} and cache_usage{cache_type="events"} time series for both configs
history pod memory RSS / Go heap metrics (go_memstats_heap_*), plus workload characteristics (shards=2048, pod limit=5Gi, GOMEMLIMIT=3000Mi)
exact dynamic config + Temporal version (1.29.1)

Appreciate any guidance on what to check next or what additional data would be most useful for you to debug this.

IlijaNL · July 8, 2026, 10:16am

We have the same issue. The pod outgrows in memory. This is probably due the wrong estimation (according to AI this can be up to 40x bigger than what is calculated now). There is also an additional bug which can crash your pod

Topic		Replies	Views
History Mem Usage, Cache Size & TTL Community Support docker	7	2700	April 27, 2023
Config to restrict history pod size Community Support history , configuration	3	1713	June 15, 2021
Recommendation for K8S Cluster; currently using default values Server Deployment helm , general-impl	11	1698	October 30, 2023
History service memory usage Community Support history	18	2989	February 22, 2023
History Service OOM exception Community Support history , server	0	663	July 17, 2023

Memory OOM issues with History Pod and Size-Based Cache configuration

Related topics