Unexpected spikes in Postgres DB iops

nitesh237 · October 26, 2022, 5:32am

Hi all,

We have been using temporal in production for a couple of months now. We have the following setup in our production env:

We are using AWS RDS for PostgreSQL.
We have a temporal cluster with 3000 shards distributed across 5 history instances (2 core 8GB), 3 matching instances (2 core 4 GB), 3 frontend instances (2 core 4GB) and 2 worker instances (0.5 core 1 GB).
Temporal cluster is being hosted in k8s with 30d retention for all the namespaces.
RDS backup is disabled for the moment.

We are seeing unexpected intermittent spikes in RDS read/write IOPS which is not correlating to the persistence request graph that is plotted using sum (rate(persistence_requests[5m]))

Our initial instinct was to turn off archival for all the namespace to see if it helps with the spikes, but even after turning off the archival, we still see similar spikes. Has anyone seen this behaviour?

RDS IOPS Plot

Temporal Server Persistence Request Plot

tihomir · October 26, 2022, 2:18pm

Can you show the persistence_requests by operation?

Also check persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and see if the sync match rate goes up on similar intervals like your spikes:

sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m]))

nitesh237 · October 26, 2022, 3:16pm

Thanks for your response @tihomir,

Sharing the metrics as requested.

Persistence request per operation

Matching Persistence by Operation

History Persistence by Operation

Persistence Operation Latency is mostly in sub-millis

The sync match rate is on the upside in the interval but that is mostly co-relating with the persistence request plot that I have pinged in the post. This doesn’t seem to be co-relating with the iops.

I couldn’t find any co-relation with the spike plot yet.

tihomir · October 26, 2022, 11:41pm

Your sync match rate graph should stay at 0 (should not include the spikes). When > 0 is typically result of not having enough workers provisioned, meaning matching service was not able to deliver tasks to your pollers and had to write it to persistence and later when pollers are available retrieve it again.

Check also
sum(rate(persistence_requests{operation="CreateTask"}[1m]))
which would give you the number of tasks that had to be written to db due to no pollers being available.

nitesh237 · November 9, 2022, 4:57am

Hey, @tihomir apologies for the late response. It’s been holiday season in India

Thanks for sharing the insight. I wanted to understand how to go about making the sync match rate to 0?
Based on the metrics, the workers’ slots seem to be available.

Only 10-15 slots are occupied at any time and I am running with the below config in prod.

    MaxConcurrentActivityExecutionSize: 100
    WorkerActivitiesPerSecond: 200.0
    MaxConcurrentLocalActivityExecutionSize: 100
    WorkerLocalActivitiesPerSecond: 200.0
    TaskQueueActivitiesPerSecond: 600.0
    MaxConcurrentActivityTaskPollers: 5
    MaxConcurrentWorkflowTaskExecutionSize: 300
    MaxConcurrentWorkflowTaskPollers: 5

We are running on a 1 core <> 2 GB resource config with the worker pod and even the CPU utilisation avg is around 4-5%.

tihomir · November 9, 2022, 5:00pm

On SDK metrics, check if you have elevated workflow_task_schedule_to_start_latency / activity_schedule_to_start_latency.
If those are elevated but on worker pods you have low resource utilization and don’t have issues with task slots availability try
increasing your poller count from 5 to 8 or 16 and see if that helps.
Increase the WorkflowTaskPollers/ActivityTaskPollers depending on which schedule_to_start latency is increased.

Also see this forum thread on poller count that might help.

nitesh237 · November 12, 2022, 12:12pm

Thanks, I will increase the pollers and tune the workers as per the above-mentioned threads and let you know in this thread.

Topic		Replies	Views
Database CPU spike when multiple workflows are triggered simultaneously Community Support go-sdk , aws , postgresql	1	26	March 25, 2025
PostgreSQL (AWS RDS) - Scale up Community Support scaling , postgresql	6	2415	April 9, 2024
Database spike because of retention deletion process Community Support general-impl , database , visibility	3	19	February 24, 2025
Maru Spike Test Corrupts AWS RDS Postgresql DB (Context Deadline Exceeded) Community Support aws , kubernetes , postgresql	2	636	July 26, 2022
[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres Community Support typescript-sdk	19	171	October 23, 2024

Unexpected spikes in Postgres DB iops

Related topics