Start and cancellation of workflows timing out

Hi,

Seeing startWorkflowExecution and RequestCancelWorkflowExecution intermittently fail with request timeouts:

"stack_trace":"org.apache.thrift.TException: Rpc error:<ErrorResponse id=12 errorType=Timeout message=Request timeout after 1006ms>\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.requestCancelWorkflowExecution(WorkflowServiceTChannel.java:1455)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RequestCancelWorkflowExecution$21(WorkflowServiceTChannel.java:1443)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$measureRemoteProc$1(WorkflowServiceTChannel.java:559)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:533)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteProc(WorkflowServiceTChannel.java:556)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.RequestCancelWorkflowExecution(WorkflowServiceTChannel.java:1441)\n\tat com.cloudera.cdp.proxy.services.HealthCheckService.$anonfun$cancelHealthCheckWorkflow$1(HealthCheckService.scala:31)\n\tat scala.collection.Iterator.foreach(Iterator.scala:941)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:941)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1429)\n\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n\tat com.cloudera.cdp.proxy.services.HealthCheckService.cancelHealthCheckWorkflow(HealthCheckService.scala:23)\n\tat com.cloudera.cdp.proxy.handlers.RpcHandlerImpl.$anonfun$removeConfig$4(RpcHandler.scala:124)\n\tat slick.basic.BasicBackend$DatabaseDef.$anonfun$runInContextInline$1(BasicBackend.scala:171)\n\tat scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)\n\tat scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\n"}

"stack_trace":"org.apache.thrift.TException: Rpc error:<ErrorResponse id=5 errorType=Timeout message=Request timeout after 1005ms>\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)\n\tat com.uber.cadence.serviceclient.WorkflowServiceTChannel.startWorkflowExecution(WorkflowServiceTChannel.java:773)\n\tat

The workflow client retries start of workflow over 30 times but does not succeed.

We have two domains registered, each not running more than a 100 workflows, one of the domains runs long running activity heart beating every 5 sec.

Infra details (K8S):

  • Mysql 8 RDS with 1000 shards
  • 3 frontend pods, 6 each of history, worker and matching

Not seeing a CPU usage spike on the DB host when this happens, cpu usage is below 15%. Write IOPs maxes at 400 with latency at 1 ms.
Resource usage on the K8S service pods is also not high.

Seeing the following errors in cadence history service logs:

“Failed to lock shard with ID: 965. Error: Error 1205: Lock wait timeout exceeded; try restarting transaction”

Could you please help?

1 Like

I just learnt that Mysql 8 is not supported. We are using Mysql 8.0.11 RDS instance. Could this be causing these issues?

1 Like

Hi,

Could you please help with the following questions:

  1. It appears that when thrift rpc calls are made to the frontend services to start/cancel workflows, under certain conditions the rpc call times out. The client in the worker retries a few times and in some cases is able to start/cancel workflows on retries. Why would this timeout happen? The client is clearly able to connect to the front end service. Is it that the front end service is timing out in calling history services? We are running a couple hundred workflows with 6 history pods with enough resource requests. (There are a lot of heartbeat requests made from some of the workflows). The history pods also do not seem to be under load. Could you please explain this behaviour?

  2. Would you strongly discourage using Mysql 8 with Cadence? Do you think using Mysql 8 could cause issues under load?

Many thanks,
Guru.

Any pointers please? Could the timeouts happen when a workflow is assigned a shard on which db updates fail because of translation lock timeout? Has anybody faced this issue with MySQL?

Hey @Guruprasad_Venkatesh,

Do you see any logs in the frontend service itself indicating timeouts? Do you see any other errors elsewhere? Also do you have any metrics that you can share?

I can’t comment on Cadence support of MySQL 8, but let me verify behavior of Temporal on 8.0 in the coming days to give you a better reference.

Thanks @shawn ! Please find the stack trace of client error -

Caused by: org.apache.thrift.TException: Rpc error:<ErrorResponse id=38 errorType=Timeout message=Request timeout after 1004ms>
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:513)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:488)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.startWorkflowExecution(WorkflowServiceTChannel.java:773)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$StartWorkflowExecution$7(WorkflowServiceTChannel.java:761)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:533)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.StartWorkflowExecution(WorkflowServiceTChannel.java:760)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.lambda$startWorkflowInternal$0(GenericWorkflowClientExternalImpl.java:134)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.startWorkflowInternal(GenericWorkflowClientExternalImpl.java:132)
at com.uber.cadence.internal.external.GenericWorkflowClientExternalImpl.startWorkflow(GenericWorkflowClientExternalImpl.java:77)
at com.uber.cadence.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:116)
… 32 more
}

In frontend service I do not see any logs for the request. I seems like the request is getting timedout before reaching to the frontend.

I am not sure which metrics I should share so I am sharing here important ones. Please let me know If you want to know any specific metrics.
Please find the metrics as below -
Add Activity Task Operation Requests - Max 700
Add decision task Operation Requests - Max - 250
CancelOutstandingPoll Operation Requests - Max - 25
DescribeWorkflowExecution Operation Requests - Max - 250
GetWorkflowExecutionHistory Operation Requests - Max -9
PollForActivityTask Operation Requests - Max -90
PollForDecisiontask Operation Requests - Max - 125
PollMutableState Operation Requests - Max - 10
RecordActivityTaskHeartbeat Operation Requests - Max - 919940
We are using long running workflow with frequent heartbeats(in every 5 seconds)
RecordActivityTaskStarted Operation Requests - Max - 778

There are other metrics also. But I am not seeing very high values for them and think they might not be relevant. Please let me know if you want to know any specific metrics.

cc @Guruprasad_Venkatesh

@Shawn any idea what might cause above client timeouts?

Hey Aditya, apologies for my delayed reply here - What are your connection pool settings for MySQL? Additionally, what are the timeframe for the metrics listed above?

Thanks @Shawn ! Please find below the connection pool settings -

maxConns: 20
maxIdleConns: 20
maxConnLifetime: “1h”

The shared metrics are for 1 day timeframe. I have shared the maximum values within that timeframe. We are using heartbeats to run long running activities. And running approximately 200 workflows for now. But we want to run it on scale of at least 10,000 workflows. We send heartbeat in every 5 seconds and the heartbeat timeout is 30 seconds. Also the cadence service is shared across other projects also, so we want to make sure that we do not consume more than 20-25% of cadence bandwidth in our service alone.