We have been using temporal in production for a couple of months now. We have the following setup in our production env:
We are using AWS RDS for PostgreSQL.
We have a temporal cluster with 3000 shards distributed across 5 history instances (2 core 8GB), 3 matching instances (2 core 4 GB), 3 frontend instances (2 core 4GB) and 2 worker instances (0.5 core 1 GB).
Temporal cluster is being hosted in k8s with 30d retention for all the namespaces.
RDS backup is disabled for the moment.
We are seeing unexpected intermittent spikes in RDS read/write IOPS which is not correlating to the persistence request graph that is plotted using sum (rate(persistence_requests[5m]))
Our initial instinct was to turn off archival for all the namespace to see if it helps with the spikes, but even after turning off the archival, we still see similar spikes. Has anyone seen this behaviour?
The sync match rate is on the upside in the interval but that is mostly co-relating with the persistence request plot that I have pinged in the post. This doesn’t seem to be co-relating with the iops.
Your sync match rate graph should stay at 0 (should not include the spikes). When > 0 is typically result of not having enough workers provisioned, meaning matching service was not able to deliver tasks to your pollers and had to write it to persistence and later when pollers are available retrieve it again.
Check also sum(rate(persistence_requests{operation="CreateTask"}[1m]))
which would give you the number of tasks that had to be written to db due to no pollers being available.
Hey, @tihomir apologies for the late response. It’s been holiday season in India
Thanks for sharing the insight. I wanted to understand how to go about making the sync match rate to 0?
Based on the metrics, the workers’ slots seem to be available.
On SDK metrics, check if you have elevated workflow_task_schedule_to_start_latency / activity_schedule_to_start_latency.
If those are elevated but on worker pods you have low resource utilization and don’t have issues with task slots availability try
increasing your poller count from 5 to 8 or 16 and see if that helps.
Increase the WorkflowTaskPollers/ActivityTaskPollers depending on which schedule_to_start latency is increased.
Also see this forum thread on poller count that might help.