Memory leak in Temporal History service v1.18.3

During our performance testing we noticed that the History service memory usage rises over time and does not decrease, even if there are no running workflows:

  1. We started 30k workflows, waited for all of them to complete and then observed that the memory_heapinuse metric stayed at 4Gb (with no running workflows), even after some time after completion. This was also confirmed by container memory metrics.
  2. Then we started another 30k workflows, once they completed the memory was at 8 Gb.
  3. After some time we started yet another 30k workflows and again the memory was at 12 Gb after completion.

We are running Temporal v1.18.3 with the following settings:
Number of shards: 4096
Service Scale: 1 / 1 / 1 / 1 (1 frontend container, 1 history container, etc)
Dynamic Config:

  - value: "on"
    constraints: {}
  - value: true
    constraints: {}
  - value: 255
    constraints: {}
  - value: true
  - value: true
  - value: true
  - value: true
  - value: true
  - value: true
# Remove this after upgrading to 1.19:
  - value: false

This is the heap profile of the history service after p2 (via pprof):

This is the heap profile diff of the heap state after p2 and before p2 (go tool pprof -diff_base):

Edit: in the initial post was 10k workflows, but I got it wrong - we start about 30k

Whats the persistence store you are using? What’s the retention period set on the namespace where you start your workflow executions?
My guess currently is that you might have a long retention period set, try setting it to a smaller value and monitor your memory use from workflow close to past the set retention period.

Also I think for the number of history shards you have too few history nodes, typically you want around max 500 shards per history node, would suggest adding more to distribute the shards across of them.

Hi @tihomir ! We are using MySQL as a persistence store, retention is 14 days. IMO the retention period should not be an issue here, because we are trying to execute 30k workflows multiple times during one day, I was not clear in the initial post.

We start 30k workflows - they all complete within 70 minutes. After 5 minutes we start another 30k workflows and etc, we are seeing the memory usage rise until it reaches the limit and then the container is restarted (but not before some period of slowness - probably because of more frequent GCs).

I’ve also run the tests with 512 shards - seeing the same behaviour, although the memory increments are smaller (+2.5 GB with 512 shards after 30k workflows vs +4 GB with 4096 shards after 30k workflows)

Maybe the root cause is in the LRU cache implementation for workflow executions? I would expect the entries to be evicted after TTL (which is 1 hour by default for history.cacheTTL), but looks like they still remain because the eviction for old workflows is never triggered (there is no Get / Delete for old workflow executions and there is no iteration over all history cache entries).

So in other words what happens (I assume) is old workflows remain in the history cache and are not evicted after TTL, they can only be evicted if the cache reaches its max size. This is because the cache implementation evicts the entry based on TTL only on Get/Delete/Put for this specific entry and also on iteration over entries (which for history cache is never performed).

If we take 512 shards, we get 30k workflows / 512 shards = 59 workflows per cache. To reach the max cache size of 512 we would need to execute our 30k workflows 512 / 59 = 8 times, but the RAM runs out already after 3 or 4 runs.

Is this by design? If not, maybe the cache implementation could be improved by adding configurable periodic TTL eviction for all entries (occasionally triggered on write for any entry for example)?

Since you have only one service pod to hold all your shards, your would want to reduce the per shard cache size. The config history.cacheMaxSize is entry count per shard.