Hello,
I am experiencing persistent Memory OOM (Out Of Memory) issues with my Temporal History pods and would like to seek some guidance.
Currently, I have configured both MutableState and Events caches to be size-based for precise memory management. My configuration is as follows:
YAML
history.cacheSizeBasedLimit:
- value: true
constraints: {}
history.hostLevelCacheMaxSizeBytes:
- value: 134217728 # 128MB
constraints: {}
history.enableHostLevelEventsCache:
- value: true
constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
- value: 134217728 # 128MB
constraints: {}
Environment Details:
-
Total History Shards: 2048
-
Pod Resource: Request 4GB / Limit 5GB
-
Runtime Env: GOMEMLIMIT is set to 3000MiB.
Problem: Even though I limited the caches to 128MB each, the pod memory usage continuously increases under load, exceeding the 5GB limit and triggering an OOM kill.
Interestingly, when I use a count-based configuration for the MutableState cache (as shown below), the memory usage stabilizes around 3GB and never triggers an OOM:
YAML
history.cacheSizeBasedLimit:
- value: false
constraints: {}
history.hostLevelCacheMaxSize:
- value: 10000
constraints: {}
history.enableHostLevelEventsCache:
- value: true
constraints: {}
history.eventsHostLevelCacheMaxSizeBytes:
- value: 134217728 # 128MB
constraints: {}
Questions:
-
Is there a known issue or bug regarding size-based cache limits (
cacheSizeBasedLimit: true)? -
Why does the size-based limit fail to constrain the memory, whereas the count-based limit works fine in the same environment?
-
I prefer to manage both caches using size-based limits. Are there any additional configurations I might be missing (e.g., shard-level settings like
history.cacheMaxSizeBytes) to make this work reliably?
I would appreciate any insights or recommendations on how to properly unify the cache settings to size-based without hitting OOM.