Temporal History Service Memory Usage

Hi,

We’ve been running temporal in kubernetes for a while now and have noticed that compared to the other services the history service is using substantially more memory. For context:

  • frontend: ~150MB

  • matching ~140MB

  • worker ~50MB

  • history ~6.8GB

We have the retention period set at 3 days but I am wondering why it would be using so much memory? There is currently only a single pod for each service. Are there any suggestions on why this service would be using so much memory?

I did notice that the Settings from the webapp have the HISTORY ARCHIVAL set to disabled. Is this relevant?

There also appear to be large cpu spikes in the history service occurring every couple of minutes where the avergae goes from 0.25 cpu to 1.5 cpu

I did notice that the Settings from the webapp have the HISTORY ARCHIVAL set to disabled. Is this relevant?

This has to do with the Archival feature, not the history service itself.

There also appear to be large cpu spikes in the history service occurring every couple of minutes

Whats your average CPU usage for history service? What is the CPU limit?
How much memory do you allocate to the history service? Can you check its pods restart count (could indicate oom issues).
Would try increasing the memory limit first the see if it makes a difference.
If increasing memory is not an option, you could try decreasing history cache size (note this does come at a cost of increased persistence loads and latencies).
Dynamic configuration knobs for this:

history.cacheInitialSize default 128

history.cacheMaxSize default 512

history.eventsCacheInitialSize default 128

history.eventsCacheMaxSize default 512

(just note that decreasing cache size should only be considered when giving history service more memory is not an option).

Thanks @tihomir,

I will have a look at the options you have suggested.

Question: is the memory usage in the history service linked to the current number of “Running” workflows i.e. the more currently running workflows the more memory used? I did notice in our system a number of “old” workflows that were left over from testing but are still in a “Running” status.

@tihomir here are the stats you requested:

  1. There is currently no cpu limit imposed. The average is 0.25 cpu with a spike every 2-3 mins where it is 1 - 1.5 cpu for a minute

  2. There is currently no memory limits imposed so the pod has never restarted. It seems to just be gradually accumulating more memory over time

I did delete ~300 workflows that were in a status of “Running” but I haven’t seen any reduction in the memory usage. Could you confirm the following points to improve my understanding of the history service:

  1. History service memory usage is relative to number of active running workflows. For example, after cleaning up unwanted workflows we now have ~30 cron workflows running. These workflows only trigger a couple of times a day.

  2. What else could account for such large memory usage? Even before cleaning up the 300 workflows I would not have thought 300 workflows was a lot to have running

  3. Would the large memory footprint relate to some large history events? i.e. events where the inputs / outputs would be large. If so, is there anyway to check this e.g. query one of the tables?

I’m trying to get a better understanding on what could be consuming so much memory.

Hi @tihomir,

I created an additional 2 history pods i.e. 3 replicas in total and could see from the logs that some of the shards were allocated to the new pods. The memory usage also went down marginally which surprised me.

If there are 3 history pods are the shards allocated evenly amoungst the pods? If so, I would have suspected the memory usage to go down to a 3rd.

I also noticed in the logs there are some errors such as:

"level":"error","ts":"2022-06-28T08:00:18.800Z","msg":"uncategorized error","operation":"RecordWorkflowTaskStarted",......

"level":"error","ts":"2022-06-28T08:00:24.550Z","msg":"Fail to process task","service":"history","shard-id":627,......

"level":"error","ts":"2022-06-29T04:00:31.931Z","msg":"Persistent store operation Failure","service":"history".....

Can you confirm if there are errors like this in the logs could this cause memory leaks in the history service where a shard is not released after processing a request? Also is there any commands/tools available to know how many shards are being used?