History Mem Usage, Cache Size & TTL

Hi,
I have some questions regarding memory usage of history pods.

  • Scenario
    I use 1 pod frontend, 1 pod history, 1 pod matching, and 1 pod worker.
    I use dynamic config below and not assign the history cache size config at all (use default), and use NUM_HISTORY_SHARDS=4096.
matching.numTaskqueueReadPartitions:
- value: 4
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 4
  constraints: {}
matching.rps:
- value: 204800
  constraints: {}
frontend.rps:
- value: 204800
  constraints: {}
history.rps:
- value: 204800
  constraints: {}

I test the pods with load with 100 workflow/secs to see the behaviour.

  • Question 1
    History pod used big memory usage during load (as expected).
    But when idle for 1 hours, the memory of the history pod did not decrease at all and stay around 60% mem usage.


    I read the default config for HistoryCacheTTL & EventsCacheTTL is 1 hour (time.Hour), but why history mem usage did not decrease at all after 1 hour idle?
    I use docker temporalio/server:1.16.2 for the pods.

  • Question 2
    So, regarding this post ,
    my 1 pod history should be using 4096 * 512 (default HistoryCacheMaxSize) = 2,097,152 cached items.
    How to calculate the required memory for the max cached items?

3 Likes

Hi, anyone can clarify things raised here?
Let me know if there is any data or configuration needed.

For 1. Cached items should be removed if cache size limit is reached. There is no background thread running that cleans it up.

For 2. I think it would depend on your workflows, there is a number of server metrics you can use:

execution_state_size
execution_info_size
mutable_state_size
history_size
buffered_events_size
signal_info_size
request_cancel_info_size
child_info_size
timer_info_size
activity_info_size

1 Like

Hi @tihomir,
Thank you for your feedback.

Cached items should be removed if cache size limit is reached.

  1. if the cached item not removed within some interval, then what are HistoryCacheTTL & EventsCacheTTL in dynamic config for? I thought it was for cache TTL in dynamic conf.
// HistoryCacheTTL is TTL of history cache
HistoryCacheTTL = history.cacheTTL
// EventsCacheTTL is TTL of events cache
EventsCacheTTL = history.eventsCacheTTL
  1. Can you provide the example query how to query the metrics history_size ?

  2. So, currently I have problem history pods always OOM.
    I already increased the history pods to 2 pods, and max mem 24GB, but still got OOM.
    current configuration I use 4096 shards, and default HistoryCacheMaxSize.
    Do you have any recommendation how to manage the history pods so do not get OOM with 4096 shards?


    pods temporal-server-history always got OOM.

Hi we are also experiencing OOM errors on history service
Any update here?
Specifically if ttl is not working how do we remove cached items from history service?

It looks like that temporal history service do need not acceptable high memory.
There’re many issues about high memory, and he team of temporal has not yet given a valid feedback.

Do you have any recommendation how to manage the history pods so do not get OOM with 4096 shards?

How many history hosts do you deploy? Temporal tries to evenly distribute shards across history hosts.

team of temporal has not yet given a valid feedback

Can you give more info, server version, persistence store used, namespace retention duration. I dont think that we haven’t provided “valid” feedback rather that often solution is pretty dependent on user deployment setup.
Temporal does provide dynamic configs:

history.cacheInitialSize default 128
history.cacheMaxSize default 512
history.eventsCacheInitialSize default 128
history.eventsCacheMaxSize default 512

That you can tune while your service is running if needed.

@tihomir
Thanks for your reply. I have created a issue here.