Extremely high database calls

We have setup temporal.io with mysql database. Currently we have around 2000 workflows in open state. We are observing extremely high requests per minute(rpm) SQL calls to database.

1
INSERT INTO buffered_events(shard_id, namespace_id, workflow_id, run_id, data, data_encoding)
VALUES (?, ?, ?, ?, ?, ?) -
63292 rpm

2
SELECT
shard_id, namespace_id, workflow_id, run_id, create_request_id, state, status, start_version, last_write_version
FROM current_executions WHERE shard_id = ? AND namespace_id = ? AND workflow_id = ?
72181 rpm

3
SELECT next_event_id FROM executions
WHERE shard_id = ? AND namespace_id = ? AND workflow_id = ? AND run_id = ? FOR UPDATE
77756 rpm

4
UPDATE executions SET
next_event_id = ?, last_write_version = ?, data = ?, data_encoding = ?, state = ?, state_encoding = ?
WHERE shard_id = ? AND namespace_id = ? AND workflow_id = ? AND run_id = ?
77756 rpm

5
SELECT
shard_id, namespace_id, workflow_id, run_id, create_request_id, state, status, start_version, last_write_version
FROM current_executions WHERE shard_id = ? AND namespace_id = ? AND workflow_id = ? FOR UPDATE
77757 rpm

6
UPDATE current_executions SET
run_id = ?,
create_request_id = ?,
state = ?,
status = ?,
start_version = ?,
last_write_version = ?
WHERE
shard_id = ? AND
namespace_id = ? AND
workflow_id = ?
77788 rpm

7
SELECT range_id FROM shards WHERE shard_id = ? LOCK IN SHARE MODE
77824 rpm

What tuning we should look into?

1 Like

What is the rate at which your application is doing state transitions? Looks like you are doing updates at a very high rate. Can you describe what you are doing within your workflows?

We have 4 workflows which are causing all update queries

Workflow is going in an infinite loop of signaling and even though we have terminated workflow, it gets started again.

Can you look at the WorkflowExecutionStarted event to see who is starting the workflow execution? It has identity field on it.
Signals are coming from outside the workflow implementation, so you need to chase who is sending all those signals.
But this would explain the high database load.

Signals also have identity field of the process sending them.

We terminated sender workflow, which is generating the events. However the sender workflow generated signals in millions and it used SignalWithStartWorkflow API, this is causing the receiver workflow to start again (even if we terminate manually), Is there a way to clean the signals to the receiver workflow ? Source workflow is not starting any more on it’s own.

This is the receiver workflow history event count in the current execution

[Continued]

we terminated Receiver workflow manually many times

Since you are using the SignalWithStartWorkflow API it will start a new workflow execution if one is not running. So terminating the receiver workflow execution will not have any effect as the next call to SignalWithStart will spin up a new workflow execution. You need to chase the source which is sending all those signals and clear it. Can you provide more information about the source which is generating those signals?

The source has a bug, which is generating more events due to some issue, we found the issue and fixed it, but the existing workflows running on the system are making lot of DB calls, causing SLA reduction from temporal service (it’s delaying other workflows execution),