History_node keeps growing

I have been playing around with temporal. Created several workflow executions on a namespace with 3 days retention.
Workflow executions seems to be purged from the UI after that period of time, but history_node keeps growing, I don’t see it purged. Workflow executions have local and non-local activities and are completed in less than 1 minute (no long executions or strange timers).

Activities’s arguments and return values are complex objects (I know this is not recommended and I should change that), but table size is not stabilizing as expected after 4 days running on stabilized traffic.

Tried to build up a query over this table to understand if the purged is being done or not, but I failed.

Is there any metric / log that I can check to see if history_node is being purged correctly?

Thanks in advance!

What’s persistence do you configure?
tctl adm cl d | jq .persistenceStore

What’s your set namespace retention?
tctl --ns <namespace_name> n desc

Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence. Will keep checking for SQL and report back.

It’s configured a sql datastore, namespace retention was 3 days, but I changed that yesterday to 1 day.

Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence.

So on SQL datastore, Workflow Execution data remains forever? That’s a huge problem.

Working with server team to clarify this and will get back to you.

1 Like

The version of temporal we are using is 1.16.2

I have made a quick search on temporal code and I found temporal/scavenger.go at master · temporalio/temporal · GitHub, which seems like a cleaner task for history events. I’m not a good Go dev, but the comment here bring some light: temporal/scavenger.go at master · temporalio/temporal · GitHub

Was able to get more info on this
Yes “garbage history” cleanup does not happen for SQL currently like it does for Cassandra. This is something that will be addressed in future releases.
Was told however that the footprint of these “garbage” records that are not removed should be very small (should contain only the first event of the execution history) and it should not cause significant issues in almost all cases.

Is there anyway I can track those records manually?

I’m interested in any updates to this conversation. We use MySQL for persistence and have a history_node table that’s 1.6 TB and growing steadily.

Thanks for reporting, which server version are you running?

This cluster is on 1.8. The size of the history_node table is one reason we haven’t upgraded past that version.

We had made some changes on the workflow to avoid storing data not related to the workflow itself, that made an average reduction of 25% of the event-data. By doing so, I got rid of the message saying that the Blob size warning log, too.

On the other hand, I reduced the retention days of the namespace drastically.

Finally, after applying the below changes, we ended up with 50% of dead-tuples on that table, so we changed some configuration (I don’t remember which ones right now) over the VACCUM strategy for that table and the TOAST associated.

The table seems not to be growing so far, VACCUM seems to be triggering normally and the percentage of dead-tuples seems under control; but I am monitoring closely.

Team opened issue here for this to be addressed for SQL persistence.

which version should I upgrade to if need this scavenger feature? can I use tctl adm db clean to cleanup db without upgrade temporal version(we are at 1.17.0 and mysql engine)?

@tihomir Or any manually cleanup sql I can used to cleanup history_node table? then I can upgrade temporal cluster later.