I have been playing around with temporal. Created several workflow executions on a namespace with 3 days retention.
Workflow executions seems to be purged from the UI after that period of time, but history_node keeps growing, I don’t see it purged. Workflow executions have local and non-local activities and are completed in less than 1 minute (no long executions or strange timers).
Activities’s arguments and return values are complex objects (I know this is not recommended and I should change that), but table size is not stabilizing as expected after 4 days running on stabilized traffic.
Tried to build up a query over this table to understand if the purged is being done or not, but I failed.
Is there any metric / log that I can check to see if history_node is being purged correctly?
What’s persistence do you configure? tctl adm cl d | jq .persistenceStore
What’s your set namespace retention? tctl --ns <namespace_name> n desc
Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence. Will keep checking for SQL and report back.
It’s configured a sql datastore, namespace retention was 3 days, but I changed that yesterday to 1 day.
Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence.
So on SQL datastore, Workflow Execution data remains forever? That’s a huge problem.
Was able to get more info on this
Yes “garbage history” cleanup does not happen for SQL currently like it does for Cassandra. This is something that will be addressed in future releases.
Was told however that the footprint of these “garbage” records that are not removed should be very small (should contain only the first event of the execution history) and it should not cause significant issues in almost all cases.
We had made some changes on the workflow to avoid storing data not related to the workflow itself, that made an average reduction of 25% of the event-data. By doing so, I got rid of the message saying that the Blob size warning log, too.
On the other hand, I reduced the retention days of the namespace drastically.
Finally, after applying the below changes, we ended up with 50% of dead-tuples on that table, so we changed some configuration (I don’t remember which ones right now) over the VACCUM strategy for that table and the TOAST associated.
The table seems not to be growing so far, VACCUM seems to be triggering normally and the percentage of dead-tuples seems under control; but I am monitoring closely.
which version should I upgrade to if need this scavenger feature? can I use tctl adm db clean to cleanup db without upgrade temporal version(we are at 1.17.0 and mysql engine)?
@tihomir Or any manually cleanup sql I can used to cleanup history_node table? then I can upgrade temporal cluster later.