History Mem Usage, Cache Size & TTL

Hilman_Adam · June 2, 2022, 12:48pm

Hi,
I have some questions regarding memory usage of history pods.

Scenario
I use 1 pod frontend, 1 pod history, 1 pod matching, and 1 pod worker.
I use dynamic config below and not assign the history cache size config at all (use default), and use NUM_HISTORY_SHARDS=4096.

matching.numTaskqueueReadPartitions:
- value: 4
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 4
  constraints: {}
matching.rps:
- value: 204800
  constraints: {}
frontend.rps:
- value: 204800
  constraints: {}
history.rps:
- value: 204800
  constraints: {}

I test the pods with load with 100 workflow/secs to see the behaviour.

Question 1
History pod used big memory usage during load (as expected).
But when idle for 1 hours, the memory of the history pod did not decrease at all and stay around 60% mem usage.

mem_usage1564×418 34.4 KB

I read the default config for HistoryCacheTTL & EventsCacheTTL is 1 hour (time.Hour), but why history mem usage did not decrease at all after 1 hour idle?
I use docker temporalio/server:1.16.2 for the pods.
Question 2
So, regarding this post ,
my 1 pod history should be using 4096 * 512 (default HistoryCacheMaxSize) = 2,097,152 cached items.
How to calculate the required memory for the max cached items?

Hilman_Adam · June 6, 2022, 8:24am

Hi, anyone can clarify things raised here?
Let me know if there is any data or configuration needed.

tihomir · June 6, 2022, 8:31pm

For 1. Cached items should be removed if cache size limit is reached. There is no background thread running that cleans it up.

For 2. I think it would depend on your workflows, there is a number of server metrics you can use:

execution_state_size
execution_info_size
mutable_state_size
history_size
buffered_events_size
signal_info_size
request_cancel_info_size
child_info_size
timer_info_size
activity_info_size

Hilman_Adam · June 14, 2022, 10:47am

Hi @tihomir,
Thank you for your feedback.

Cached items should be removed if cache size limit is reached.

if the cached item not removed within some interval, then what are HistoryCacheTTL & EventsCacheTTL in dynamic config for? I thought it was for cache TTL in dynamic conf.

// HistoryCacheTTL is TTL of history cache
HistoryCacheTTL = history.cacheTTL
// EventsCacheTTL is TTL of events cache
EventsCacheTTL = history.eventsCacheTTL

Can you provide the example query how to query the metrics history_size ?
So, currently I have problem history pods always OOM.
I already increased the history pods to 2 pods, and max mem 24GB, but still got OOM.
current configuration I use 4096 shards, and default HistoryCacheMaxSize.
Do you have any recommendation how to manage the history pods so do not get OOM with 4096 shards?

image834×260 43.8 KB

pods temporal-server-history always got OOM.

Dhruva_Upamanyu · September 8, 2022, 10:08pm

Hi we are also experiencing OOM errors on history service
Any update here?
Specifically if ttl is not working how do we remove cached items from history service?

Steeven · April 26, 2023, 3:47pm

It looks like that temporal history service do need not acceptable high memory.
There’re many issues about high memory, and he team of temporal has not yet given a valid feedback.

tihomir · April 26, 2023, 11:38pm

Do you have any recommendation how to manage the history pods so do not get OOM with 4096 shards?

How many history hosts do you deploy? Temporal tries to evenly distribute shards across history hosts.

team of temporal has not yet given a valid feedback

Can you give more info, server version, persistence store used, namespace retention duration. I dont think that we haven’t provided “valid” feedback rather that often solution is pretty dependent on user deployment setup.
Temporal does provide dynamic configs:

history.cacheInitialSize default 128
history.cacheMaxSize default 512
history.eventsCacheInitialSize default 128
history.eventsCacheMaxSize default 512

That you can tune while your service is running if needed.

Steeven · April 27, 2023, 1:54am

@tihomir
Thanks for your reply. I have created a issue here.

github.com/temporalio/temporal

Too high memory usage on history service

opened 10:25AM - 26 Apr 23 UTC

koolay

potential-bug

## Expected Behavior Require less memory usage. ## Actual Behavior contai…ner resources: ![image](https://user-images.githubusercontent.com/4273490/234535219-0a0f3858-bb33-4202-8726-46fa96059027.png) server metrics: ![image](https://user-images.githubusercontent.com/4273490/234550779-036d1636-bfd4-4308-b29d-98000fccfefa.png) go tool pprof "http://localhost:7936/debug/pprof/heap" ![image](https://user-images.githubusercontent.com/4273490/234534796-fa97c39c-64ed-49d8-94bd-f1150712fa30.png) ## Steps to Reproduce the Problem 1. create one thousand workflows `temporal-dynamic-config` ```yaml apiVersion: v1 data: dynamic_config.yaml: |- history.cacheInitialSize: - value: "10" history.cacheMaxSize: - value: "50" history.cacheTTL: - value: "0.5" history.eventsCacheInitialSize: - value: "10" history.eventsCacheMaxSize: - value: "50" matching.useOldRouting: - value: "false" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: temporal meta.helm.sh/release-namespace: temporal creationTimestamp: "2022-11-09T04:02:55Z" labels: app.kubernetes.io/instance: temporal app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: temporal app.kubernetes.io/part-of: temporal app.kubernetes.io/version: 1.20.2 helm.sh/chart: temporal-0.20.2 name: temporal-dynamic-config namespace: temporal resourceVersion: "64727569" uid: 01925a28-4772-42e9-891e-bdb467bf496f ``` ## Specifications - Version: temporalio/server:1.20.2 - Platform: k8s

Topic		Replies	Views
History service memory usage Community Support history	18	2648	February 22, 2023
Temporal History Service Memory Usage Community Support history , metrics	4	2552	June 29, 2022
Config to restrict history pod size Community Support history , configuration	3	1605	June 15, 2021
Recommendation for K8S Cluster; currently using default values Server Deployment helm , general-impl	11	1240	October 30, 2023
Metrics about history node host elvel cache usage Server Deployment	1	13	December 4, 2024

History Mem Usage, Cache Size & TTL

Related topics