History Mem Usage, Cache Size & TTL

Hi,
I have some questions regarding memory usage of history pods.

  • Scenario
    I use 1 pod frontend, 1 pod history, 1 pod matching, and 1 pod worker.
    I use dynamic config below and not assign the history cache size config at all (use default), and use NUM_HISTORY_SHARDS=4096.
matching.numTaskqueueReadPartitions:
- value: 4
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 4
  constraints: {}
matching.rps:
- value: 204800
  constraints: {}
frontend.rps:
- value: 204800
  constraints: {}
history.rps:
- value: 204800
  constraints: {}

I test the pods with load with 100 workflow/secs to see the behaviour.

  • Question 1
    History pod used big memory usage during load (as expected).
    But when idle for 1 hours, the memory of the history pod did not decrease at all and stay around 60% mem usage.


    I read the default config for HistoryCacheTTL & EventsCacheTTL is 1 hour (time.Hour), but why history mem usage did not decrease at all after 1 hour idle?
    I use docker temporalio/server:1.16.2 for the pods.

  • Question 2
    So, regarding this post ,
    my 1 pod history should be using 4096 * 512 (default HistoryCacheMaxSize) = 2,097,152 cached items.
    How to calculate the required memory for the max cached items?

Hi, anyone can clarify things raised here?
Let me know if there is any data or configuration needed.

For 1. Cached items should be removed if cache size limit is reached. There is no background thread running that cleans it up.

For 2. I think it would depend on your workflows, there is a number of server metrics you can use:

execution_state_size
execution_info_size
mutable_state_size
history_size
buffered_events_size
signal_info_size
request_cancel_info_size
child_info_size
timer_info_size
activity_info_size

Hi @tihomir,
Thank you for your feedback.

Cached items should be removed if cache size limit is reached.

  1. if the cached item not removed within some interval, then what are HistoryCacheTTL & EventsCacheTTL in dynamic config for? I thought it was for cache TTL in dynamic conf.
// HistoryCacheTTL is TTL of history cache
HistoryCacheTTL = history.cacheTTL
// EventsCacheTTL is TTL of events cache
EventsCacheTTL = history.eventsCacheTTL
  1. Can you provide the example query how to query the metrics history_size ?

  2. So, currently I have problem history pods always OOM.
    I already increased the history pods to 2 pods, and max mem 24GB, but still got OOM.
    current configuration I use 4096 shards, and default HistoryCacheMaxSize.
    Do you have any recommendation how to manage the history pods so do not get OOM with 4096 shards?


    pods temporal-server-history always got OOM.