I am using temporal to run scheduler for syncing data from outside datasource, it works good for serveral month until hours ago, I found service was unavailable, and postgresql container was down.
Check the message of pg containers it said device has no space, I run docker system prune -af to release some space and service recover in minutes, but it is not a final solution.
I have 60 GB space in vm and I am sure that my data should not eat all my storage, then I do some research, I query the storage consumption of database and table and finally found temporal database use 38GB and most space is used by activity_info_map table
I am setting retension as 1 day for namespace, and there is only around 100 workflow records in ui, but I can found record in activity_info_map stored for scheduler run six months ago.
So my question is how to mannually clean up activity_info_map as a short term solution and while how to configure to rotate clean up out of dated record?
Hello, it looks we may be missing the deletion of activity info in workflow executions somehow. This table contains information about activities in the currently running workflows. Entries in this table are not needed when a workflow execution is closed.
However I could not reproduce this issue. I setup a simple scheduled workflow (with activities) running every minute. I did not notice any records left behind after a scheduled run was complete.
Could you please provide some additional information listed below?
The Temporal Server version & Postgres version you are running.
The schedule (i.e how frequently does the workflow run)
How many activities does it start in each run?
Any relevant errors you notice in server logs?
Regarding manual cleanup - you may be able to remove data that does not belong to any of the currently running workflows. Could you please share the output of the following 2 queries?
SELECT count(DISTINCT workflow_id) FROM activity_info_maps;
SELECT count(DISTINCT workflow_id) from current_executions WHERE workflow_id LIKE 'schedule_%';
The schedule (i.e how frequently does the workflow run)
I have 2 schedule running and one will run, one run each 30s(although it always costs more than 30s so its interval may be longer) and one run each minute
How many activities does it start in each run?
the schedule run each 30s will starts hunders of activity(maybe around 700 in this case?), each is a data syncing from outside datasource. another one will only start one activity.
Any relevant errors you notice in server logs
I found some error message in pg container after I release some space, but there were not more error message after a while
2024-10-22 02:21:43.842 UTC [380] STATEMENT: DELETE FROM history_node WHERE shard_id = $1 AND tree_id = $2 AND branch_id = $3 AND node_id >= $4
2024-10-22 02:21:44.162 UTC [345] ERROR: canceling statement due to user request
before is
2024-10-22 02:21:08.858 UTC [27] LOG: database system was not properly shut down; automatic recovery in progress
2024-10-22 02:21:08.874 UTC [27] LOG: redo starts at 534/F7BC40
2024-10-22 02:21:09.411 UTC [28] FATAL: the database system is starting up
and
2024-10-22 02:20:08.278 UTC [1] FATAL: could not write lock file "postmaster.pid": No space left on device
PostgreSQL Database directory appears to contain a database; Skipping initialization
Hello @Sczlog I was able to reproduce the issue in my test setup. What I noticed was that we are not cleaning up activity info map when the activities are abandoned by the workflow. I was also not able to reproduce the issue with a temporal internal workflow. So, at this point I think the leakage is caused by one of your workflows that may not be waiting for all the activities to be completed.
To verify this, could you please confirm the following?
Check and make sure all of your workflows wait until all the activities are completed.
Please report the total number of rows in activity_info_maps - SELECT count(1) from activity_info_maps;
Interesting. I’m curious to know what are some of the top workflow IDs in that table. Would it be possible to provide the results of the following queries?
SELECT count(1) from activity_info_maps;
SELECT DISTINCT workflow_id, count(1) as activity_count FROM activity_info_maps GROUP BY workflow_id ORDER BY activity_count DESC LIMIT 100;
SELECT DISTINCT workflow_id, count(1) as activity_count FROM activity_info_maps WHERE workflow_id LIKE 'temporal-sys%' GROUP BY workflow_id ORDER BY activity_count DESC LIMIT 100;
I’m mainly looking for any workflows that start with temporal-sys- prefix.