Temporal in idle state generating huge read/write traffic to Cassandra

ramyamagham · July 12, 2021, 10:58pm

Temporal in idle state is generating a huge read/write traffic to Cassandra. Not sure under what circumstances this starts happening but after running for a couple of days at 20K to 24K/Sec with Local Quorum. There were about 3% errors during this.
Even when the incoming requests is stopped, the traffic keeps going on at 10K reads/sec almost an hour after the traffic is stopped. And when I start the traffic again, the total traffic on the DB goes back to 34K/sec.
Only startToClose timeout is set to 1 second. Retry options are set to max attempts is at 3.

Wenquan_Xing · July 12, 2021, 11:02pm

there are 2 cases:

timeout tasks, these tasks will be loaded from DB and processed even if workflow / activity has already gone
background task maintenance, e.g. periodically scan for new tasks in case write to DB timeouts.

ramyamagham · July 12, 2021, 11:06pm

How can I reduce the timeout tasks? The application does not need the tasks to be retried if it did not successfully finish in the first 5 seconds. So, I had set the scheduleToClose and scheduleToStart to maybe 5 seconds, will this case of 10K/sec be mitigated?
Also we want to make sure the DB can handle any such background traffic, will it always be 10K/sec and/or how to throttle it?

Wenquan_Xing · July 12, 2021, 11:23pm

by not setting a timeout, workflow can have no timeout if timeout set to 0
activity must have at least one timeout, e.g. start to close

Wenquan_Xing · July 12, 2021, 11:25pm

depends on your specific workflow (logic), timer (sleep) or timeout will always be persisted to DB and loaded for processing

ramyamagham · July 12, 2021, 11:37pm

Yes, I have startToClose timeout set to 1 second. My logic is to just try to perform the local* activity in 1 second else discard it, as the calling client system through an API takes the error and goes away. There is no purpose for my application to retry.
I am not sure about the timer you are referring to here, but I have retry interval as 1 second and total retry attempts as max 3.
How can I avoid this huge local quorum and serial reads and writes as it is choking the Cassandra cluster.

ramyamagham · July 12, 2021, 11:40pm

On the right panel I have attached the screenshot at table level, it shows 20K because it is not at co-ordinator level so it doubles, but it is 10K/sec local quorum reads happening primary on executions and history_tree tables.

ramyamagham · July 12, 2021, 11:42pm

Here is the write throughput which is also at table level and not co-ordinator level. The activity is mostly on executions and history node and tree tables.

Wenquan_Xing · July 13, 2021, 12:45am

lots of read from history_tree table does not seems to be right unless you have already run a load tests and finished workflow are being deleted.

lots of write to history_tree table seems to suggest you are still trying to start lots of new workflow.

Wenquan_Xing · July 13, 2021, 12:46am

any mutation of workflow (create & update) will be translate to a LTW to cassandra
delivering new workflow events will cause lots of read

Wenquan_Xing · July 13, 2021, 12:46am

can you also provide these 2 metrics?

poll_success
poll_success_sync

ramyamagham · July 13, 2021, 1:26am

poll_success seems to be 0 when the traffic was stopped. The time between 23:00 to 00:00 you see and the other times there is no activity is when the traffic was paused.

Here’s poll_success_sync which also looks similar to the above one.

ramyamagham · July 13, 2021, 1:30am

One more thing I’d like to add is that we have observed this behavior before and even if I left the system idle for the next 12 hours, the constant 10K/sec will continue to happen and they don’t come down. If need be, I can leave the system idle until tomorrow to prove it.

Wenquan_Xing · July 13, 2021, 7:48am

these 2 metrics indicate how efficient is your setup, i.e. enough number of pollers or not, according to the metrics provided, this poll_success_sync/poll_success ratio is high enough.

the 2 metrics below will show the async task processing task / second, e.g. processing of activity timers; user timers; workflow close, etc
can you also provide these?

sum by (operation) (rate(task_requests{operation=~"TransferActive.*"}[1m]))
sum by (operation) (rate(task_requests{operation=~"TimerActive.*"}[1m]))

ramyamagham · July 13, 2021, 6:11pm

Here are the two graphs for the past 2 days. The 1.6K requests/sec is the actual incoming traffic and we have stopped incoming traffic yesterday after my comment, as we wanted to see if the traffic would go down which is about 17 hours since when the traffic is stopped, and yet there are lingering reads/writes happening at the same rate. A couple of times the DB traffic went down for 30 minutes but it’s back at it again.

ramyamagham · July 13, 2021, 6:12pm

ramyamagham · July 13, 2021, 6:12pm

DB read/write patterns in the last 20 hours:

ramyamagham · July 14, 2021, 5:54pm

After almost 27 hours the activity stopped, during all this time there was no incoming traffic.

ramyamagham · July 14, 2021, 5:56pm

Latest time active graph. cc: @Wenquan_Xing

Wenquan_Xing · July 20, 2021, 12:43am

sorry about the delay, was sick in the past few days

this graph shows that even after a workflow is finished, there will be other timeout related task still in DB pending processing, e.g. workflow timeout task, workflow cleanup task (delete the finished workflow after retention), etc.

Topic		Replies	Views
Workflows Stuck in Running Mode for Several Days Community Support java-sdk , cassandra	2	944	August 24, 2021
Large number of database calls by temporal to datastax cassandra Community Support	1	405	January 21, 2023
Temporal Performance with golang microservices Community Support go-sdk , mysql , cassandra	9	1759	August 7, 2022
Workflow Performance with Java SDK Community Support java-sdk	1	740	February 20, 2023
Local Activity time limits Community Support general-impl	1	953	June 2, 2022

Temporal in idle state generating huge read/write traffic to Cassandra

Related topics