We are running Temporal in our production environment, but we occasionally experience significant CPU spikes in our PostgreSQL database (specifically, temporal_visibility). Given the high number of DELETE queries, we suspect this is due to the retention cleanup process.
Is there a way to optimize this process or restrict its execution to avoid peak hours? At times, it runs during our busiest periods, which impacts performance.
Do you have server metrics and can share your persistence requests graph during your peak times when you think retention deletion kicked in?
sum by(operation) (rate(persistence_requests[1m]))
you can try increasing dynamic config history.retentionTimerJitterDuration (default 30mins)
to spread out deletion of event histories for workflow executions that completed when their retention period is reached. ty setting it maybe to 3 hours and see if that helps
DELETE FROM executions_visibility WHERE namespace_id = $1 AND run_id = $2
We had to offload our workflows from temporal, as you can see it produced an spike in our database. And looks like this DELETE command is causing the issue.
We recommend elastic search for visibility for high load environments. If ES is not an option you can use a separate database instance just for visibility.