Hello Temporal community!
We recently shipped out a bug with one of our workflows where the activity attributes were misconfigured. Specifically, the error we ran into was
"BadScheduleActivityAttributes: MaximumInterval cannot be less than InitialInterval on retry policy.".
We had a few thousand workflows start with this bad configuration before we caught it and rolled out a fix. However, since the activity attributes are part of the workflow state, this isn’t something that can auto-recover and self-correct. We had to manually terminate the affected workflows and restart them.
While Temporal was attempting to process the bad workflow(s), the CPU spiked up to 90-100% and remained steadily there. Temporal seemed to be aggressively trying to execute the next step in the workflow, even though it would never succeed.
This leads us to three questions:
Is the aggressive retry behavior intentional, and is there any way we could configure Temporal to abandon and fail a workflow if it is in a configuration state that will never succeed?
Related to (1), the CPU spike to 100% is concerning. We noticed similar spikes even if just a single workflow was misconfigured. Is it intended that a single misconfigured workflow will consume so many resources? For example, when there was one misconfigured workflow, here’s a CPU utilization graph (range 0 to 100%) showing k8s kubernetes.cpu.usage.total over kubernetes.cpu.limits during the time that workflow was “running”.
- Is there a way to auto-recover from such situations? Are there best practices around refreshing activity stub configuration so that if we start workflows with improper configuration, we can push out a change and have the workflows just pick that up? That would be significantly simpler than query/terminate/restart.