[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres

anthony-k · October 14, 2024, 5:32pm

We just upgraded our clients, worker, and workflow to version 1.11.2 in Node.js.

For some reason, it’s barraging our RDS database (Postgres) with a ridiculous 11k+ calls per second while reading from the tasks table. From the looks of it, it’s probably some kind of infinite loop somewhere, but I do not have enough knowledge of Temporal to figure it out.

Offending SQL query:

SELECT task_id, data, data_encoding FROM tasks WHERE range_hash = $1 AND task_queue_id=$2 AND task_id >= $3 AND task_id < $4 ORDER BY task_id LIMIT $5

We’ve added the following to our worker but it did not help:

maxConcurrentActivityTaskExecutions: 20,
maxConcurrentActivityTaskPolls: 10,
maxConcurrentWorkflowTaskPolls: 10,

I can’t attach another image, but our workflows are clean and none are hanging. It’s just polling that seems to be the issue.

Any ideas?

tihomir · October 14, 2024, 7:38pm

What Temporal service version do you run on? Did this start immediately after worker restart? so its not affecting your executions from completing?

tihomir · October 14, 2024, 7:43pm

The query itself is matching query that reads taks out of bscklog (tasks are written to db if they cannot be dispatched to your worker pollers within short period of time). Seems something has created a large backlog of tasks and its being drained now.

tihomir · October 14, 2024, 7:45pm

If your worker cpu and memory utilization is normal i think you should try to increase executor slots and num pollers instead of restricting to try to drain backlog of tasks faster. Or consider adding more worker pods to help

tihomir · October 14, 2024, 7:53pm

one more question, do you use worker versioning?

anthony-k · October 14, 2024, 7:59pm

Thanks for the reply @tihomir.

Temporal service version: temporalio/auto-setup:1.25.1 (prev. temporalio/auto-setup:1.20.1)
It started a few hours after the upgrade 2 days ago and hasn’t ceased. CPU and RAM usage in the temporal service grows over time and only resets at restart (kind of like a memory leak).

Above screenshot shows the CPU and memory usage before and after the upgrade (came on Sat 12 ~ 3am).

We don’t have worker versioning or useVersioning enabled.

Workflow executions are unaffected, but our database performance is severely degraded.

anthony-k · October 14, 2024, 8:03pm

Is there a way to quickly drain the task queue of all tasks? Not seeing it in tctl.

anthony-k · October 14, 2024, 8:15pm

Fwiw, there are only 6 rows in the tasks table. Also, we have a staging environment with the same settings. It’s not happening there.

tihomir · October 14, 2024, 8:25pm

Sorry so did this start happening after server or sdk upgrade or were both performed at the same time? Can you share steps on how you upgraded server version and db schemas? I dont think in general using autosetup image for prod env is advised just fyi. Was there any other
changes done like dynamic config or did you change numhistoryshards in static config? Whats the diffs between your two envs if any?

anthony-k · October 14, 2024, 8:28pm

@tihomir There’s no difference between the two envs besides load. Staging is not receiving many tasks, it’s at 1k workflows in total. Production is 6k workflows in the past 3 days.

We did the SDK upgrade and server upgrade at the same time.

The server version was upgraded solely by changing the auto-setup version to the one I mentioned above.

Auto-setup in prod has worked for us thus far until this happened… What’s the suggested alternative?

No changes to dynamic config or anything else.

Here it is in its entirety:

limit.maxIDLength:
  - value: 255
    constraints: {}

anthony-k · October 14, 2024, 8:29pm

The entire upgrade was done after we ran a backfill and were unable to stop workflows from the UI due to bugs in the UI (the version was quite old).

Before the upgrade, I had manually removed the namespace and cleared all associated tables from Postgres and let auto-setup recreate the default namespace. Something residual must have been left over.

anthony-k · October 14, 2024, 9:17pm

@tihomir any way to escalate this? This is happening in our production environment.

anthony-k · October 14, 2024, 9:26pm

Just looking to terminate whatever process is hitting the following query 11k times a second

SELECT task_id, data, data_encoding FROM tasks WHERE range_hash = $1 AND task_queue_id=$2 AND task_id >= $3 AND task_id < $4 ORDER BY task_id LIMIT $5

anthony-k · October 14, 2024, 9:44pm

Mitigated by clearing the tasks and task_queues tables, but I’m worried this is going to rear its ugly head in the future again.

tihomir · October 14, 2024, 9:48pm

Im not sure exactly what happened during your deploy as you keep adding more info in each post This is fine as I undestand frustrating situation. Why did you clear your db after upgrade? you should run upgrade scripts imo and follow sequential upgrade process as described in docs. Maybe would help if you can compile exact steps you performed and we can try to find time to zoom this week to go over it. Also note we cannot provide production level support in community. For that please consider using Temporal Cloud

anthony-k · October 19, 2024, 6:33pm

@tihomir Following up here.

The issue still persists even if we clear the tasks and task queues. We ended up even moving the visibility store to a fresh RDS database instance.

We suspect the root causes are the new schedules we added around the same time as our upgrade. The schedules create new tasks to be executed in the future, and the server polls for these tasks more and more times a second. As of now, it’s back to polling 3395.94 times/sec.

Our schedules were created using your tutorial. We have one that runs every minute, and removing this schedule actually flatlines the growth in calls/sec, so I believe it’s something on your end that’s incessantly creating more and more tasks to be polled.

Any ideas?

maxim · October 19, 2024, 8:42pm

Schedules are implemented as Temporal workflows, which rely on long polling, which is pretty efficient.

anthony-k · October 19, 2024, 9:01pm

@maxim I agree that it’s efficient, but my concern is how come the polling is taking up 80% of the Postgres database’s CPU and happening 3k times a second? Have you seen this before? My Schedule cron is defined via * * * * * (every minute).

anthony-k · October 22, 2024, 9:29pm

Any updates here, and can your team please look into it? Thanks!

bergundy · October 23, 2024, 4:56pm

AFAIU, you mentioned that you upgraded 5 minor versions in one go. That’s not something we support, you’re supposed to go through every minor version.

Other than that, it’s a bit hard to understand what happened from the limited information you’ve given. Is the situation still happening? Can you provide a count of open workflows and schedules? Anything apparent in the tasks table?

Topic		Replies	Views
Stuck workflows after hight database load Community Support general-impl	11	432	July 11, 2024
Temporal seems to hit scale wall Community Support performance	6	3385	March 29, 2024
Navigating through the internal of workflow lifecycle Community Support	3	339	April 7, 2024
Workflow Performance with Java SDK Community Support java-sdk	1	742	February 20, 2023
Database CPU spike when multiple workflows are triggered simultaneously Community Support go-sdk , aws , postgresql	1	34	March 25, 2025

[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres

Related topics