Temporal in idle state generating huge read/write traffic to Cassandra

Temporal in idle state is generating a huge read/write traffic to Cassandra. Not sure under what circumstances this starts happening but after running for a couple of days at 20K to 24K/Sec with Local Quorum. There were about 3% errors during this.
Even when the incoming requests is stopped, the traffic keeps going on at 10K reads/sec almost an hour after the traffic is stopped. And when I start the traffic again, the total traffic on the DB goes back to 34K/sec.
Only startToClose timeout is set to 1 second. Retry options are set to max attempts is at 3.

there are 2 cases:

  • timeout tasks, these tasks will be loaded from DB and processed even if workflow / activity has already gone
  • background task maintenance, e.g. periodically scan for new tasks in case write to DB timeouts.
1 Like

How can I reduce the timeout tasks? The application does not need the tasks to be retried if it did not successfully finish in the first 5 seconds. So, I had set the scheduleToClose and scheduleToStart to maybe 5 seconds, will this case of 10K/sec be mitigated?
Also we want to make sure the DB can handle any such background traffic, will it always be 10K/sec and/or how to throttle it?

by not setting a timeout, workflow can have no timeout if timeout set to 0
activity must have at least one timeout, e.g. start to close

1 Like

depends on your specific workflow (logic), timer (sleep) or timeout will always be persisted to DB and loaded for processing

1 Like

Yes, I have startToClose timeout set to 1 second. My logic is to just try to perform the local* activity in 1 second else discard it, as the calling client system through an API takes the error and goes away. There is no purpose for my application to retry.
I am not sure about the timer you are referring to here, but I have retry interval as 1 second and total retry attempts as max 3.
How can I avoid this huge local quorum and serial reads and writes as it is choking the Cassandra cluster.

On the right panel I have attached the screenshot at table level, it shows 20K because it is not at co-ordinator level so it doubles, but it is 10K/sec local quorum reads happening primary on executions and history_tree tables.

Here is the write throughput which is also at table level and not co-ordinator level. The activity is mostly on executions and history node and tree tables.

lots of read from history_tree table does not seems to be right unless you have already run a load tests and finished workflow are being deleted.

lots of write to history_tree table seems to suggest you are still trying to start lots of new workflow.

any mutation of workflow (create & update) will be translate to a LTW to cassandra
delivering new workflow events will cause lots of read

can you also provide these 2 metrics?

  • poll_success
  • poll_success_sync

poll_success seems to be 0 when the traffic was stopped. The time between 23:00 to 00:00 you see and the other times there is no activity is when the traffic was paused.

Here’s poll_success_sync which also looks similar to the above one.

One more thing I’d like to add is that we have observed this behavior before and even if I left the system idle for the next 12 hours, the constant 10K/sec will continue to happen and they don’t come down. If need be, I can leave the system idle until tomorrow to prove it.

these 2 metrics indicate how efficient is your setup, i.e. enough number of pollers or not, according to the metrics provided, this poll_success_sync/poll_success ratio is high enough.

the 2 metrics below will show the async task processing task / second, e.g. processing of activity timers; user timers; workflow close, etc
can you also provide these?

sum by (operation) (rate(task_requests{operation=~"TransferActive.*"}[1m]))
sum by (operation) (rate(task_requests{operation=~"TimerActive.*"}[1m]))

Here are the two graphs for the past 2 days. The 1.6K requests/sec is the actual incoming traffic and we have stopped incoming traffic yesterday after my comment, as we wanted to see if the traffic would go down which is about 17 hours since when the traffic is stopped, and yet there are lingering reads/writes happening at the same rate. A couple of times the DB traffic went down for 30 minutes but it’s back at it again.

DB read/write patterns in the last 20 hours:

After almost 27 hours the activity stopped, during all this time there was no incoming traffic.

Latest time active graph. cc: @Wenquan_Xing

sorry about the delay, was sick in the past few days

this graph shows that even after a workflow is finished, there will be other timeout related task still in DB pending processing, e.g. workflow timeout task, workflow cleanup task (delete the finished workflow after retention), etc.