Hi everyone, I am currently architecting a long-running tracking workflow that monitors real-time testing suites on distributed client machines. The architecture relies on an orchestrator activity that keeps a persistent channel open to collect runtime logs and execution variables from a client-side sandbox environment. I am pushing these intermediate states back via activity heartbeats to monitor health, but I am running into major state corruption and non-determinism issues during worker failovers.
The core problem is that the client-side testing tools produce highly dense, unstructured log outputs with shifting schemas and irregular symbols. When these strings are recorded through the payload converter into the workflow state history, the history files are bloating rapidly. More critically, if a worker crashes and triggers a standard replay loop, minor variations in the raw text output string from the sandbox engine trigger a non-deterministic history error, completely killing the workflow instance. I am trying to determine if it is bad practice to store fluid, third-party log variables inside a stateful orchestration loop, or if there is a recommended way to isolate the workflow history from highly unstable external string payloads.
I have been using a custom serialization template that I found through a free download to see if pre-escaping mathematical operators and clearing nested json brackets before the string reaches the client wrapper stabilizes things, but the replay analyzer still throws data mismatch flags if a recovery event occurs mid-stream. I am trying to decide whether I should completely remove this dense logging data from the activity heartbeats and offload it directly to an external database like MongoDB via a side effect, or if I can fix this natively by overriding the payload data converter settings to drop unverified text fields prior to history commitment.
If anyone else here has designed durable workflows that manage volatile script execution runtimes or unpredictable external log outputs without breaking history deterministic rules, how do you handle your data persistence boundaries? I would be incredibly grateful for any advice on separating raw diagnostic streams from your core workflow state.