Unexpected spikes in Postgres DB iops

Hi all,

We have been using temporal in production for a couple of months now. We have the following setup in our production env:

  • We are using AWS RDS for PostgreSQL.
  • We have a temporal cluster with 3000 shards distributed across 5 history instances (2 core 8GB), 3 matching instances (2 core 4 GB), 3 frontend instances (2 core 4GB) and 2 worker instances (0.5 core 1 GB).
  • Temporal cluster is being hosted in k8s with 30d retention for all the namespaces.
  • RDS backup is disabled for the moment.

We are seeing unexpected intermittent spikes in RDS read/write IOPS which is not correlating to the persistence request graph that is plotted using sum (rate(persistence_requests[5m]))

Our initial instinct was to turn off archival for all the namespace to see if it helps with the spikes, but even after turning off the archival, we still see similar spikes. Has anyone seen this behaviour?


Temporal Server Persistence Request Plot

Can you show the persistence_requests by operation?

Also check persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and see if the sync match rate goes up on similar intervals like your spikes:

sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m]))

Thanks for your response @tihomir,

Sharing the metrics as requested.

Persistence request per operation

Matching Persistence by Operation

History Persistence by Operation

Persistence Operation Latency is mostly in sub-millis

The sync match rate is on the upside in the interval but that is mostly co-relating with the persistence request plot that I have pinged in the post. This doesn’t seem to be co-relating with the iops.

I couldn’t find any co-relation with the spike plot yet.

Your sync match rate graph should stay at 0 (should not include the spikes). When > 0 is typically result of not having enough workers provisioned, meaning matching service was not able to deliver tasks to your pollers and had to write it to persistence and later when pollers are available retrieve it again.

Check also
which would give you the number of tasks that had to be written to db due to no pollers being available.

Hey, @tihomir apologies for the late response. It’s been holiday season in India :smile:

Thanks for sharing the insight. I wanted to understand how to go about making the sync match rate to 0?
Based on the metrics, the workers’ slots seem to be available.

Only 10-15 slots are occupied at any time and I am running with the below config in prod.

    MaxConcurrentActivityExecutionSize: 100
    WorkerActivitiesPerSecond: 200.0
    MaxConcurrentLocalActivityExecutionSize: 100
    WorkerLocalActivitiesPerSecond: 200.0
    TaskQueueActivitiesPerSecond: 600.0
    MaxConcurrentActivityTaskPollers: 5
    MaxConcurrentWorkflowTaskExecutionSize: 300
    MaxConcurrentWorkflowTaskPollers: 5

We are running on a 1 core <> 2 GB resource config with the worker pod and even the CPU utilisation avg is around 4-5%.

On SDK metrics, check if you have elevated workflow_task_schedule_to_start_latency / activity_schedule_to_start_latency.
If those are elevated but on worker pods you have low resource utilization and don’t have issues with task slots availability try
increasing your poller count from 5 to 8 or 16 and see if that helps.
Increase the WorkflowTaskPollers/ActivityTaskPollers depending on which schedule_to_start latency is increased.

Also see this forum thread on poller count that might help.

Thanks, I will increase the pollers and tune the workers as per the above-mentioned threads and let you know in this thread.