CPU Spike for Misconfigured Workflows (e.g. BadScheduleActivityAttributes)

Hello Temporal community!

We recently shipped out a bug with one of our workflows where the activity attributes were misconfigured. Specifically, the error we ran into was "BadScheduleActivityAttributes: MaximumInterval cannot be less than InitialInterval on retry policy.".

We had a few thousand workflows start with this bad configuration before we caught it and rolled out a fix. However, since the activity attributes are part of the workflow state, this isn’t something that can auto-recover and self-correct. We had to manually terminate the affected workflows and restart them.

While Temporal was attempting to process the bad workflow(s), the CPU spiked up to 90-100% and remained steadily there. Temporal seemed to be aggressively trying to execute the next step in the workflow, even though it would never succeed.

This leads us to three questions:

  1. Is the aggressive retry behavior intentional, and is there any way we could configure Temporal to abandon and fail a workflow if it is in a configuration state that will never succeed?

  2. Related to (1), the CPU spike to 100% is concerning. We noticed similar spikes even if just a single workflow was misconfigured. Is it intended that a single misconfigured workflow will consume so many resources? For example, when there was one misconfigured workflow, here’s a CPU utilization graph (range 0 to 100%) showing k8s kubernetes.cpu.usage.total over kubernetes.cpu.limits during the time that workflow was “running”.

Screen Shot 2021-10-30 at 3.41.52 PM

  1. Is there a way to auto-recover from such situations? Are there best practices around refreshing activity stub configuration so that if we start workflows with improper configuration, we can push out a change and have the workflows just pick that up? That would be significantly simpler than query/terminate/restart.

Thank you!

1 Like

Which SDK version do you use?

Java SDK, verison 1.10. Currently running temporal server version 1.10.5.

Interesting. I can see that this can increase the load on the system. But 100% looks excessive unless the system is running close to its capacity.

Adding a bit of related context to what we believe to be the root cause, I attempted to reproduce this issue locally with the java test sdk 1.4. When the incorrect configuration is used (e.g.

                        .setInitialInterval(Duration.ofSeconds(2))
                        .setMaximumInterval(Duration.ofSeconds(1))
                        .setDoNotRetry(IllegalArgumentException.class.getName())
                        .build())

),
the internal test runner hangs but does not fail. I reproduced this on

samples-java/HelloActivityRetryTest.java at master · temporalio/samples-java · GitHub, changing the retry options here: samples-java/HelloActivityRetry.java at master · temporalio/samples-java · GitHub.

The test output looks like:

21:53:01.039 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=2, pollThreadNamePrefix='Workflow Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.044 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=1, pollThreadNamePrefix='Local Activity Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.047 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=5, pollThreadNamePrefix='Activity Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.049 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=5, pollThreadNamePrefix='Host Local Workflow Poller'}, identity=7f61a229-9a0f-49ae-82e0-09f5c2af125a}

On another example, eventually it indicated a duration timeout.

The behavior of retrying the workflow task on failure is by design. So your reproduction does validate that the test framework works according to the design.

@maxim But can temporal be configured or handles this case smartly out of the box so that it does not retry when the reason for failure is

Why do you want to not retry BadScheduleActivityAttributes? This is a clear programmer error that can be fixed without failing workflows. It also should be caught by the single execution, so any test would catch it.

This is a clear programmer error that can be fixed without failing workflows.

But workflows which have started execution with incorrect attributes cannot auto-recover. In the above case we had to terminate them and rerun the workflows after fixing the attributes.

But workflows which have started execution with incorrect attributes cannot auto-recover.

This is not correct. You could deploy the fix without terminating workflows and they would continue execution.

1 Like

Hi Maxim,

Thank you for indicating that we can deploy the fix. I tested this locally and confirmed it worked for statically bound ActivityStubs, such as:

    private final HelloActivity helloActivity = Workflow.newActivityStub(
        HelloActivity.class,
        ActivityOptions.newBuilder()
                       .setStartToCloseTimeout(Duration.ofDays(10))
                       .setRetryOptions(RetryOptions.newBuilder()
                                                    .setInitialInterval(Duration.ofMinutes(2))
                                                    .setMaximumInterval(Duration.ofMinutes(3))
                                                    .setBackoffCoefficient(1.0)
                                                    .build())
                       .build()
    );

On our production case, we dynamically loaded the activityStubs at runtime from a LocalActivity which fetched values for configuration. The rationale behind this was to enable different retry/activity options per environment. I believe in this case, it would not be possible to deploy a fix to existing workflows, because they had already passed the dynamic construction of ActivityStubs at runtime.

Is this a bad practice? If so, do you have any recommendations for maintaining different configurations?

For activity options you can consider setting them through WorkflowImplementationOptions.setActivityOptions.

What was the difference between us fetching the values from application.properties v/s statically binding the values in the code. ?

It depends on where these are read. You should never read application.properties from workflow code as it can break determinism. But using properties to set WorkflowImplementationOptions or worker options is fine.

1 Like