Temporal History Service Memory Usage

cg1972 · June 27, 2022, 10:28pm

Hi,

We’ve been running temporal in kubernetes for a while now and have noticed that compared to the other services the history service is using substantially more memory. For context:

frontend: ~150MB
matching ~140MB
worker ~50MB
history ~6.8GB

We have the retention period set at 3 days but I am wondering why it would be using so much memory? There is currently only a single pod for each service. Are there any suggestions on why this service would be using so much memory?

I did notice that the Settings from the webapp have the HISTORY ARCHIVAL set to disabled. Is this relevant?

There also appear to be large cpu spikes in the history service occurring every couple of minutes where the avergae goes from 0.25 cpu to 1.5 cpu

tihomir · June 27, 2022, 11:08pm

I did notice that the Settings from the webapp have the HISTORY ARCHIVAL set to disabled. Is this relevant?

This has to do with the Archival feature, not the history service itself.

There also appear to be large cpu spikes in the history service occurring every couple of minutes

Whats your average CPU usage for history service? What is the CPU limit?
How much memory do you allocate to the history service? Can you check its pods restart count (could indicate oom issues).
Would try increasing the memory limit first the see if it makes a difference.
If increasing memory is not an option, you could try decreasing history cache size (note this does come at a cost of increased persistence loads and latencies).
Dynamic configuration knobs for this:

history.cacheInitialSize default 128

history.cacheMaxSize default 512

history.eventsCacheInitialSize default 128

history.eventsCacheMaxSize default 512

(just note that decreasing cache size should only be considered when giving history service more memory is not an option).

cg1972 · June 27, 2022, 11:21pm

Thanks @tihomir,

I will have a look at the options you have suggested.

Question: is the memory usage in the history service linked to the current number of “Running” workflows i.e. the more currently running workflows the more memory used? I did notice in our system a number of “old” workflows that were left over from testing but are still in a “Running” status.

cg1972 · June 28, 2022, 2:06am

@tihomir here are the stats you requested:

There is currently no cpu limit imposed. The average is 0.25 cpu with a spike every 2-3 mins where it is 1 - 1.5 cpu for a minute
There is currently no memory limits imposed so the pod has never restarted. It seems to just be gradually accumulating more memory over time

I did delete ~300 workflows that were in a status of “Running” but I haven’t seen any reduction in the memory usage. Could you confirm the following points to improve my understanding of the history service:

History service memory usage is relative to number of active running workflows. For example, after cleaning up unwanted workflows we now have ~30 cron workflows running. These workflows only trigger a couple of times a day.
What else could account for such large memory usage? Even before cleaning up the 300 workflows I would not have thought 300 workflows was a lot to have running
Would the large memory footprint relate to some large history events? i.e. events where the inputs / outputs would be large. If so, is there anyway to check this e.g. query one of the tables?

I’m trying to get a better understanding on what could be consuming so much memory.

cg1972 · June 29, 2022, 4:04am

Hi @tihomir,

I created an additional 2 history pods i.e. 3 replicas in total and could see from the logs that some of the shards were allocated to the new pods. The memory usage also went down marginally which surprised me.

If there are 3 history pods are the shards allocated evenly amoungst the pods? If so, I would have suspected the memory usage to go down to a 3rd.

I also noticed in the logs there are some errors such as:

"level":"error","ts":"2022-06-28T08:00:18.800Z","msg":"uncategorized error","operation":"RecordWorkflowTaskStarted",......

"level":"error","ts":"2022-06-28T08:00:24.550Z","msg":"Fail to process task","service":"history","shard-id":627,......

"level":"error","ts":"2022-06-29T04:00:31.931Z","msg":"Persistent store operation Failure","service":"history".....

Can you confirm if there are errors like this in the logs could this cause memory leaks in the history service where a shard is not released after processing a request? Also is there any commands/tools available to know how many shards are being used?

Topic		Replies	Views
Memory leak in Temporal History service v1.18.3 Community Support history , server	3	1511	November 2, 2022
History service memory usage Community Support history	18	2672	February 22, 2023
History Mem Usage, Cache Size & TTL Community Support docker	7	2156	April 27, 2023
Temporal PostgresQL RAM usage too high Server Deployment typescript-sdk , postgresql	0	94	January 31, 2025
History Service CPU usage Community Support	5	964	March 26, 2021

Temporal History Service Memory Usage

Related topics