Start and cancellation of workflows timing out

Guruprasad_Venkatesh · August 6, 2020, 5:56am

Hi,

Seeing startWorkflowExecution and RequestCancelWorkflowExecution intermittently fail with request timeouts:

"stack_trace":"org.apache.thrift.TException: Rpc error:<ErrorResponse id=12 errorType=Timeout message=Request timeout after 1006ms>\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.requestCancelWorkflowExecution(WorkflowServiceTChannel.java:1455)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RequestCancelWorkflowExecution$21(WorkflowServiceTChannel.java:1443)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$measureRemoteProc$1(WorkflowServiceTChannel.java:559)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:533)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteProc(WorkflowServiceTChannel.java:556)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.RequestCancelWorkflowExecution(WorkflowServiceTChannel.java:1441)\n\tat com.cloudera.cdp.proxy.services.HealthCheckService.$anonfun$cancelHealthCheckWorkflow$1(HealthCheckService.scala:31)\n\tat scala.collection.Iterator.foreach(Iterator.scala:941)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:941)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1429)\n\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n\tat com.cloudera.cdp.proxy.services.HealthCheckService.cancelHealthCheckWorkflow(HealthCheckService.scala:23)\n\tat com.cloudera.cdp.proxy.handlers.RpcHandlerImpl.$anonfun$removeConfig$4(RpcHandler.scala:124)\n\tat slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContextInline$1(BasicBackend.scala:171)\n\tat scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)\n\tat scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\n"}

"stack_trace":"org.apache.thrift.TException: Rpc error:<ErrorResponse id=5 errorType=Timeout message=Request timeout after 1005ms>\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.startWorkflowExecution(WorkflowServiceTChannel.java:773)\n\tat

The workflow client retries start of workflow over 30 times but does not succeed.

We have two domains registered, each not running more than a 100 workflows, one of the domains runs long running activity heart beating every 5 sec.

Infra details (K8S):

Mysql 8 RDS with 1000 shards
3 frontend pods, 6 each of history, worker and matching

Not seeing a CPU usage spike on the DB host when this happens, cpu usage is below 15%. Write IOPs maxes at 400 with latency at 1 ms.
Resource usage on the K8S service pods is also not high.

Seeing the following errors in cadence history service logs:

“Failed to lock shard with ID: 965. Error: Error 1205: Lock wait timeout exceeded; try restarting transaction”

Could you please help?

Guruprasad_Venkatesh · August 6, 2020, 2:32pm

I just learnt that Mysql 8 is not supported. We are using Mysql 8.0.11 RDS instance. Could this be causing these issues?

Guruprasad_Venkatesh · August 7, 2020, 9:57am

Hi,

Could you please help with the following questions:

It appears that when thrift rpc calls are made to the frontend services to start/cancel workflows, under certain conditions the rpc call times out. The client in the worker retries a few times and in some cases is able to start/cancel workflows on retries. Why would this timeout happen? The client is clearly able to connect to the front end service. Is it that the front end service is timing out in calling history services? We are running a couple hundred workflows with 6 history pods with enough resource requests. (There are a lot of heartbeat requests made from some of the workflows). The history pods also do not seem to be under load. Could you please explain this behaviour?
Would you strongly discourage using Mysql 8 with Cadence? Do you think using Mysql 8 could cause issues under load?

Many thanks,
Guru.

Guruprasad_Venkatesh · August 11, 2020, 6:26pm

Any pointers please? Could the timeouts happen when a workflow is assigned a shard on which db updates fail because of translation lock timeout? Has anybody faced this issue with MySQL?

shawn · August 14, 2020, 3:47pm

Hey @Guruprasad_Venkatesh,

Do you see any logs in the frontend service itself indicating timeouts? Do you see any other errors elsewhere? Also do you have any metrics that you can share?

I can’t comment on Cadence support of MySQL 8, but let me verify behavior of Temporal on 8.0 in the coming days to give you a better reference.

Aditya_Maheshwari · August 26, 2020, 3:33pm

Thanks @shawn ! Please find the stack trace of client error -

Caused by: org.apache.thrift.TException: Rpc error:<ErrorResponse id=38 errorType=Timeout message=Request timeout after 1004ms>
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.startWorkflowExecution(WorkflowServiceTChannel.java:773)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$StartWorkflowExecution$7(WorkflowServiceTChannel.java:761)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:533)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.StartWorkflowExecution(WorkflowServiceTChannel.java:760)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.lambda$startWorkflowInternal$0(GenericWorkflowClientExternalImpl.java:134)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.startWorkflowInternal(GenericWorkflowClientExternalImpl.java:132)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.startWorkflow(GenericWorkflowClientExternalImpl.java:77)
at com.uber.cadence.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:116)
… 32 more
}

In frontend service I do not see any logs for the request. I seems like the request is getting timedout before reaching to the frontend.

I am not sure which metrics I should share so I am sharing here important ones. Please let me know If you want to know any specific metrics.
Please find the metrics as below -
Add Activity Task Operation Requests - Max 700
Add decision task Operation Requests - Max - 250
CancelOutstandingPoll Operation Requests - Max - 25
DescribeWorkflowExecution Operation Requests - Max - 250
GetWorkflowExecutionHistory Operation Requests - Max -9
PollForActivityTask Operation Requests - Max -90
PollForDecisiontask Operation Requests - Max - 125
PollMutableState Operation Requests - Max - 10
RecordActivityTaskHeartbeat Operation Requests - Max - 919940
We are using long running workflow with frequent heartbeats(in every 5 seconds)
RecordActivityTaskStarted Operation Requests - Max - 778

There are other metrics also. But I am not seeing very high values for them and think they might not be relevant. Please let me know if you want to know any specific metrics.

cc @Guruprasad_Venkatesh

Aditya_Maheshwari · September 1, 2020, 3:36pm

@Shawn any idea what might cause above client timeouts?

shawn · September 4, 2020, 11:48pm

Hey Aditya, apologies for my delayed reply here - What are your connection pool settings for MySQL? Additionally, what are the timeframe for the metrics listed above?

Aditya_Maheshwari · September 5, 2020, 11:45am

Thanks @Shawn ! Please find below the connection pool settings -

maxConns: 20
maxIdleConns: 20
maxConnLifetime: “1h”

The shared metrics are for 1 day timeframe. I have shared the maximum values within that timeframe. We are using heartbeats to run long running activities. And running approximately 200 workflows for now. But we want to run it on scale of at least 10,000 workflows. We send heartbeat in every 5 seconds and the heartbeat timeout is 30 seconds. Also the cadence service is shared across other projects also, so we want to make sure that we do not consume more than 20-25% of cadence bandwidth in our service alone.

Topic		Replies	Views
It sometimes takes too long to register workflow Community Support go-sdk , cadence	2	876	October 24, 2020
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2430	July 13, 2023
Occasionally workflow task won't be started after scheduled Community Support	16	616	November 9, 2022
Continuous frontend connection timeout Community Support java-sdk , mysql	1	2247	March 25, 2021
Context deadline exceeded when trying to start workflow (v1.7.1) Community Support java-sdk	8	2007	April 18, 2024

Start and cancellation of workflows timing out

Related topics