Querying active workflows by their history size

Hi Temporal team!

We are running some Temporal workflows in production, driven by certain user-generated events. Those workflows were not supposed to be long-running, and we expected them to have at most several hundred events in their history before they terminate.

However, by analyzing the workflows that have already closed I’ve noticed that some of the workflows (probably generated by “bot”-like users) have several thousand events in their histories. That made me worry if some of the active workflows might have tens of thousand events in their histories or even more, and if we should ASAP deploy a fix that would do ContinueAsNew after some time/incoming event count? To better understand the priority for that fix, I wonder if there’s a way to query active workflows by their history size in either events or bytes?

(Workflow and activity signals in that workflow are just ~100 bytes in size each, so I assume we still have some time left until history grows becomes a problem, but I’d like to be sure some active workflows are not significantly worse than the closed ones)

Hey @GreyTeardrop,
We do not have an index which allows you to list workflow executions using history_size using our visibility API. This might be an interesting feature to add though. Currently we have the following two configuration knobs:

  • HistorySizeLimitError: This has a default value of 50MB and controls force termination of workflow execution if the size exceeds beyond 50MB.
  • HistorySizeLimitWarn: This has a default value of 10MB and allows the server to emit a log "history size exceeds warn limit." if any execution breaches 10MB threshold. This log is tagged with Namespace, WorkflowID, RunID, etc. So this should allow you to search for large workflow executions if your logs are indexed.

We also emit metric on execution stats which also might be useful to get visibility into history sizes/count and setting up alerts.

1 Like

Thank you, Samar! HistorySizeLimitWarn and HistoryCountLimitWarn look like exactly what I was looking for.

Do I understand it right, that if we ever hit the history size/count error limit, the way to fix that would be to update the workflow to ContinueAsNew, and then temporarily increase the limit slightly so the workflow can commit the ContinueAsNew command?

Having workflow execution to ContinueAsNew is a good way to make sure history of an execution does not goes beyond certain size.
If any of your workflow execution hits either of the size or count error limits, it will automatically be forced terminated by the system. So one potential way to recover from this situation is to call ResetWorkflowExecution after increasing the size/count limits on the server.

1 Like