CPU Spike for Misconfigured Workflows (e.g. BadScheduleActivityAttributes)

dpincas · October 30, 2021, 10:47pm

Hello Temporal community!

We recently shipped out a bug with one of our workflows where the activity attributes were misconfigured. Specifically, the error we ran into was "BadScheduleActivityAttributes: MaximumInterval cannot be less than InitialInterval on retry policy.".

We had a few thousand workflows start with this bad configuration before we caught it and rolled out a fix. However, since the activity attributes are part of the workflow state, this isn’t something that can auto-recover and self-correct. We had to manually terminate the affected workflows and restart them.

While Temporal was attempting to process the bad workflow(s), the CPU spiked up to 90-100% and remained steadily there. Temporal seemed to be aggressively trying to execute the next step in the workflow, even though it would never succeed.

This leads us to three questions:

Is the aggressive retry behavior intentional, and is there any way we could configure Temporal to abandon and fail a workflow if it is in a configuration state that will never succeed?
Related to (1), the CPU spike to 100% is concerning. We noticed similar spikes even if just a single workflow was misconfigured. Is it intended that a single misconfigured workflow will consume so many resources? For example, when there was one misconfigured workflow, here’s a CPU utilization graph (range 0 to 100%) showing k8s kubernetes.cpu.usage.total over kubernetes.cpu.limits during the time that workflow was “running”.

Screen Shot 2021-10-30 at 3.41.52 PM

Is there a way to auto-recover from such situations? Are there best practices around refreshing activity stub configuration so that if we start workflows with improper configuration, we can push out a change and have the workflows just pick that up? That would be significantly simpler than query/terminate/restart.

Thank you!

maxim · October 31, 2021, 12:42am

Which SDK version do you use?

dpincas · October 31, 2021, 9:23pm

Java SDK, verison 1.10. Currently running temporal server version 1.10.5.

maxim · October 31, 2021, 10:32pm

Interesting. I can see that this can increase the load on the system. But 100% looks excessive unless the system is running close to its capacity.

akd · November 2, 2021, 4:57am

Adding a bit of related context to what we believe to be the root cause, I attempted to reproduce this issue locally with the java test sdk 1.4. When the incorrect configuration is used (e.g.

                        .setInitialInterval(Duration.ofSeconds(2))
                        .setMaximumInterval(Duration.ofSeconds(1))
                        .setDoNotRetry(IllegalArgumentException.class.getName())
                        .build())

),
the internal test runner hangs but does not fail. I reproduced this on

samples-java/HelloActivityRetryTest.java at master · temporalio/samples-java · GitHub, changing the retry options here: samples-java/HelloActivityRetry.java at master · temporalio/samples-java · GitHub.

The test output looks like:

21:53:01.039 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=2, pollThreadNamePrefix='Workflow Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.044 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=1, pollThreadNamePrefix='Local Activity Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.047 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=5, pollThreadNamePrefix='Activity Poller taskQueue="WorkflowTest-testActivityImpl-22faf369-da3f-4711-b21e-1ca190f04c34", namespace="UnitTest"'}, identity=57150@SOFI-C02FJ40WMD6R}
21:53:01.049 [Test worker] INFO  io.temporal.internal.worker.Poller - start(): Poller{options=PollerOptions{maximumPollRateIntervalMilliseconds=1000, maximumPollRatePerSecond=0.0, pollBackoffCoefficient=2.0, pollBackoffInitialInterval=PT0.1S, pollBackoffMaximumInterval=PT1M, pollThreadCount=5, pollThreadNamePrefix='Host Local Workflow Poller'}, identity=7f61a229-9a0f-49ae-82e0-09f5c2af125a}

On another example, eventually it indicated a duration timeout.

maxim · November 2, 2021, 3:33pm

The behavior of retrying the workflow task on failure is by design. So your reproduction does validate that the test framework works according to the design.

nkatre · November 2, 2021, 8:26pm

@maxim But can temporal be configured or handles this case smartly out of the box so that it does not retry when the reason for failure is

maxim · November 3, 2021, 2:45pm

Why do you want to not retry BadScheduleActivityAttributes? This is a clear programmer error that can be fixed without failing workflows. It also should be caught by the single execution, so any test would catch it.

nkatre · November 3, 2021, 3:46pm

This is a clear programmer error that can be fixed without failing workflows.

But workflows which have started execution with incorrect attributes cannot auto-recover. In the above case we had to terminate them and rerun the workflows after fixing the attributes.

maxim · November 3, 2021, 4:16pm

But workflows which have started execution with incorrect attributes cannot auto-recover.

This is not correct. You could deploy the fix without terminating workflows and they would continue execution.

akd · November 3, 2021, 5:04pm

Hi Maxim,

Thank you for indicating that we can deploy the fix. I tested this locally and confirmed it worked for statically bound ActivityStubs, such as:

    private final HelloActivity helloActivity = Workflow.newActivityStub(
        HelloActivity.class,
        ActivityOptions.newBuilder()
                       .setStartToCloseTimeout(Duration.ofDays(10))
                       .setRetryOptions(RetryOptions.newBuilder()
                                                    .setInitialInterval(Duration.ofMinutes(2))
                                                    .setMaximumInterval(Duration.ofMinutes(3))
                                                    .setBackoffCoefficient(1.0)
                                                    .build())
                       .build()
    );

On our production case, we dynamically loaded the activityStubs at runtime from a LocalActivity which fetched values for configuration. The rationale behind this was to enable different retry/activity options per environment. I believe in this case, it would not be possible to deploy a fix to existing workflows, because they had already passed the dynamic construction of ActivityStubs at runtime.

Is this a bad practice? If so, do you have any recommendations for maintaining different configurations?

maxim · November 3, 2021, 5:38pm

For activity options you can consider setting them through WorkflowImplementationOptions.setActivityOptions.

nkatre · November 22, 2021, 6:17pm

What was the difference between us fetching the values from application.properties v/s statically binding the values in the code. ?

maxim · November 22, 2021, 7:25pm

It depends on where these are read. You should never read application.properties from workflow code as it can break determinism. But using properties to set WorkflowImplementationOptions or worker options is fine.

Topic		Replies	Views
Temporal performance issues Community Support java-sdk , performance , worker , kubernetes	1	1794	April 26, 2023
Temporal Performance Community Support java-sdk	1	289	January 31, 2024
Potential deadlock detected Community Support java-sdk	4	4061	December 2, 2022
Workflow Performance with Java SDK Community Support java-sdk	1	727	February 20, 2023
High CPU usage in worker nodes Community Support general-impl	1	1084	August 6, 2021

CPU Spike for Misconfigured Workflows (e.g. BadScheduleActivityAttributes)

Related topics