PostgreSQL (AWS RDS) - Scale up

Hi, actually we have temporal.io running on AWS RDS PostgreSQL 13.7, and we are fronting some bottlenecks with history_node table elevating our CPU cost drastically.

If possible, can you guys give me some insights to solve this problem?

Below we have some metrics collected at high load time.

SELECT’s: 6345/min
INSERT’s: 15685/min
DELETE’s: 758/min (14589 rows/min)

Observation: Our project are going to add many times this workload.

Row Octet Size

MIN: 246 (246 B)
AVG: 5613 (5.48 KB)
MAX: 73538 (71.81 KB)

We notice that is generating too much toast data.

If we isolate this inserts event type at AWS performance insight, we can find these values.

LWLock:WALWrite: 58%
IO:WALSync: 24%
CPU: 9%
Client:ClientRead: 9%

Can you guys help me to remove that bottleneck and scale up temporal application?

1 Like

Can you show your persistence requests:

sum by (operation) (rate(persistence_requests[1m]))

and check for “UpdateWorkflowExecution” operation.

Also can you share the DB instance type, the number of CPUs, and memory size set for your db?

Hi, @tihomir.
Thank you for your answer.

Now my instance is an AWS RDS m5.2xlarge with 8 vCPU and 32 GB of RAM.

Bellow I have a print showing persistence_requests by operation.

Bellow we have a print exclusive of UpdateWorkflowExecution.

Hello @natan.araujo, I’m seeing something similar using the same RDS Postgres, did you managed to find the bottleneck ?

In my case, sometimes, we have more than 40KB per line in history_node, so PostgreSQL uses toast to store data and this is bad on some scenarios.

After 2000B PostgreSQL writes data to toast and compress it.

Cons:
Every time pg have to write full page at wal log even if tuple it is not related chunk.
In my comprehension, if you have a page with 2 chunks of id1 and 2 chunks of id2, and you updated only id1, pg will write id2 chunks at wal log too and because of that generates to much IOWAL WRITE and WALWRITE LOCK.
Cant read data without decompress, even partially (I’m at pg13 and can’t).

Hey @natan.araujo did you find any solution to this?

Even I see this issue. The pattern I see between problematic and non-problematic clusters in our setup is that the pg_statistic and pg_toast tables have too many dead tuples in the problematic clusters. This doesn’t exist in other clusters.