Batch Processing Best Practices

We are looking to migrate some existing processes across to temporal. At a very high level these processes are made up of multiple microservices where the data at each stage is progressively processed and eventually saved to a database. Each data element is passed to the next stage via a queue.

Based on what I have read from different posts it I’d like to confirm if what I have gathered is correct:

  1. Avoid passing the data directly between each stage as this will bloat the history. Instead, an activity should process the data and save it to a file that could then be referenced by the next stage. (Similar to the file processing example)
  2. For long running Activities use the heartbeat to pass back and record the current progress. If the Activity fails this can then be used to identify where to restart the processing from again.

The questions I would like to clarify:
In a long running Activity with the heartbeat, I’d also like to write the current progress to a db table. This DB call could fail so should ideally be in its own Activity. How could you structure this so:
a. You get the benefit of Temporal automatically retrying the DB call if it fails, but also
b. Not force the long running Activity to retry until the DB call has been retried ‘x’ times.

I assume that the long running Activity should not call another Activity from within it?

Avoid passing the data directly between each stage as this will bloat the history.

This is true if data is potentially large (over hundreds of kilobytes)

In a long running Activity with the heartbeat, I’d also like to write the current progress to a db table.

What is the reason writing it to the DB table as well? You can always query the latest activity heartbeat directly from Temporal using DesribeWorkflowExecuction API. If you really want to do it you’ll have to do it from the same activity implementation and deal with errors yourself.

Thanks maxim for the response.

The table provides the status of all current and historical jobs that have been run. If the jobs in Temporal are archived after a period of time then we would loose this information.

I’ll have a look at the API you have suggested. I guess we would need to have a look at how quickly the history grows and tailor an archival policy around that. Is there an api that allows you to query things like the size of the history logs, etc or is it just a case of monitoring how quickly the storage fills and setting a cleanup policy around that?

Such table makes perfect sense. But updating it while an activity is running is overkill IMHO. Update it when workflow is completed (as a last activity) and query workflow directly while it is running.

. Is there an api that allows you to query things like the size of the history logs, etc or is it just a case of monitoring how quickly the storage fills and setting a cleanup policy around that?

I don’t think such an API exists at this point.

Thanks maxim for the response.

That’s a good suggestion - use Temporal for querying ‘active’ workflows and the table for all historical information.