[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres

We just upgraded our clients, worker, and workflow to version 1.11.2 in Node.js.

For some reason, it’s barraging our RDS database (Postgres) with a ridiculous 11k+ calls per second while reading from the tasks table. From the looks of it, it’s probably some kind of infinite loop somewhere, but I do not have enough knowledge of Temporal to figure it out.

Offending SQL query:

SELECT task_id, data, data_encoding FROM tasks WHERE range_hash = $1 AND task_queue_id=$2 AND task_id >= $3 AND task_id < $4 ORDER BY task_id LIMIT $5

We’ve added the following to our worker but it did not help:

maxConcurrentActivityTaskExecutions: 20,
maxConcurrentActivityTaskPolls: 10,
maxConcurrentWorkflowTaskPolls: 10,

I can’t attach another image, but our workflows are clean and none are hanging. It’s just polling that seems to be the issue.

Any ideas?

What Temporal service version do you run on? Did this start immediately after worker restart? so its not affecting your executions from completing?

The query itself is matching query that reads taks out of bscklog (tasks are written to db if they cannot be dispatched to your worker pollers within short period of time). Seems something has created a large backlog of tasks and its being drained now.

If your worker cpu and memory utilization is normal i think you should try to increase executor slots and num pollers instead of restricting to try to drain backlog of tasks faster. Or consider adding more worker pods to help

one more question, do you use worker versioning?

Thanks for the reply @tihomir.

Temporal service version: temporalio/auto-setup:1.25.1 (prev. temporalio/auto-setup:1.20.1)
It started a few hours after the upgrade 2 days ago and hasn’t ceased. CPU and RAM usage in the temporal service grows over time and only resets at restart (kind of like a memory leak).

Above screenshot shows the CPU and memory usage before and after the upgrade (came on Sat 12 ~ 3am).

We don’t have worker versioning or useVersioning enabled.

Workflow executions are unaffected, but our database performance is severely degraded.

Is there a way to quickly drain the task queue of all tasks? Not seeing it in tctl.

Fwiw, there are only 6 rows in the tasks table. Also, we have a staging environment with the same settings. It’s not happening there.

Sorry so did this start happening after server or sdk upgrade or were both performed at the same time? Can you share steps on how you upgraded server version and db schemas? I dont think in general using autosetup image for prod env is advised just fyi. Was there any other
changes done like dynamic config or did you change numhistoryshards in static config? Whats the diffs between your two envs if any?

@tihomir There’s no difference between the two envs besides load. Staging is not receiving many tasks, it’s at 1k workflows in total. Production is 6k workflows in the past 3 days.

We did the SDK upgrade and server upgrade at the same time.

The server version was upgraded solely by changing the auto-setup version to the one I mentioned above.

Auto-setup in prod has worked for us thus far until this happened… What’s the suggested alternative?

No changes to dynamic config or anything else.

Here it is in its entirety:

limit.maxIDLength:
  - value: 255
    constraints: {}

The entire upgrade was done after we ran a backfill and were unable to stop workflows from the UI due to bugs in the UI (the version was quite old).

Before the upgrade, I had manually removed the namespace and cleared all associated tables from Postgres and let auto-setup recreate the default namespace. Something residual must have been left over.

@tihomir any way to escalate this? This is happening in our production environment.

Just looking to terminate whatever process is hitting the following query 11k times a second :confused:

SELECT task_id, data, data_encoding FROM tasks WHERE range_hash = $1 AND task_queue_id=$2 AND task_id >= $3 AND task_id < $4 ORDER BY task_id LIMIT $5

Mitigated by clearing the tasks and task_queues tables, but I’m worried this is going to rear its ugly head in the future again.

Im not sure exactly what happened during your deploy as you keep adding more info in each post :slight_smile: This is fine as I undestand frustrating situation. Why did you clear your db after upgrade? you should run upgrade scripts imo and follow sequential upgrade process as described in docs. Maybe would help if you can compile exact steps you performed and we can try to find time to zoom this week to go over it. Also note we cannot provide production level support in community. For that please consider using Temporal Cloud

@tihomir Following up here.

The issue still persists even if we clear the tasks and task queues. We ended up even moving the visibility store to a fresh RDS database instance.

We suspect the root causes are the new schedules we added around the same time as our upgrade. The schedules create new tasks to be executed in the future, and the server polls for these tasks more and more times a second. As of now, it’s back to polling 3395.94 times/sec.

Our schedules were created using your tutorial. We have one that runs every minute, and removing this schedule actually flatlines the growth in calls/sec, so I believe it’s something on your end that’s incessantly creating more and more tasks to be polled.

Any ideas?

Schedules are implemented as Temporal workflows, which rely on long polling, which is pretty efficient.

@maxim I agree that it’s efficient, but my concern is how come the polling is taking up 80% of the Postgres database’s CPU and happening 3k times a second? Have you seen this before? My Schedule cron is defined via * * * * * (every minute).

Any updates here, and can your team please look into it? Thanks!

AFAIU, you mentioned that you upgraded 5 minor versions in one go. That’s not something we support, you’re supposed to go through every minor version.

Other than that, it’s a bit hard to understand what happened from the limited information you’ve given. Is the situation still happening? Can you provide a count of open workflows and schedules? Anything apparent in the tasks table?