We now have a memory leak problem

Our backend uses the Python flask framework and temporal as an asynchronous task queue to handle our scheduled or asynchronous tasks. In our system, temporal is treated as a Celery-like component. We redeployed a server to run temporal workers.
Then we encountered a memory leak on tenporal, which caused the memory of the server running the temporal worker to continue to grow.
I would like to ask for troubleshooting measures and whether anyone has encountered similar problems.

image
We now use the method of restarting Docker with a scheduled task to temporarily solve this problem, but the server memory still accounts for nearly 100%

We are not aware of any memory leaks, but memory usage can grow until limits are reached. Specifically, running activities use memory but the amount of running activities can be bounded by max_concurrent_activities. Also, workflows use memory in a cache for optimization and will fill the cache when it can, and that can be controlled by max_cached_workflows.

I have a question. I feel that what we are facing now is this situation. If our worker activity is always running, and after the old worker ends, a new worker comes out, then there will be workflows using memory in a cache for optimization. In this case, our memory is gradually growing to 100%. Is this the reason?

image
We configured the values ​​of “max_cached_workflows” and “max_concurrent_activities” in the old code, but the server’s memory continued to grow, which confused me, so I wondered if it was the cause of the memory leak problem.

Usually workers live for the life of the process, but regardless, workflow cache is per worker and is collected when the worker is. So yes, more workers mean more memory usage.

In the past few months, our server memory has continued to grow due to temporal‘s reasons. We have upgraded the memory of the cloud server from 8G to 16G, and now it is 64G.

If we don’t upgrade the physical memory, the proportion of our temporal memory will continue to increase until it occupies full memory space (this did happen)

We now reduce the memory consumption of temporary by restarting our docker with daily scheduled tasks. So I suspect that there may be a problem with our temporal settings, or maybe there is a memory leak in our system.

As time goes by, our memory usage will eventually reach 100%, even if we use a server with 64G memory. This is a serious problem we are facing now.

This could also be due to activities not completing and continually increasing memory. Are you properly heartbeating in activities? Granted the number of activities will still be bound by max concurrent activities.

We are aware of no memory leaks. If you are able to replicate and then continually reduce this replication to a small standalone replication, we can help debug and see if there is an issue with the SDK.

In our code, our activities will be automatically terminated after a certain duration. There is no historical residual activity that causes memory increase. I can guarantee that there is no situation where the activity is not completed and continues to occupy memory.
So this is why I suspect there is a memory leak problem.

1 Like

So this is where I’m confused now.

I am afraid from these posts alone I cannot diagnose where the leak is. We are unaware of any leaks in the SDK at this time (but it is possible they could exist in certain circumstances we haven’t found before). We would need a small standalone replication to debug. If you can reliably replicate, can you continually reduce your replication down to a small replication and provide it so that we can debug the memory leak?