Batch Processing Best Practices

cg1972 · December 8, 2020, 5:57am

We are looking to migrate some existing processes across to temporal. At a very high level these processes are made up of multiple microservices where the data at each stage is progressively processed and eventually saved to a database. Each data element is passed to the next stage via a queue.

Based on what I have read from different posts it I’d like to confirm if what I have gathered is correct:

Avoid passing the data directly between each stage as this will bloat the history. Instead, an activity should process the data and save it to a file that could then be referenced by the next stage. (Similar to the file processing example)
For long running Activities use the heartbeat to pass back and record the current progress. If the Activity fails this can then be used to identify where to restart the processing from again.

The questions I would like to clarify:
In a long running Activity with the heartbeat, I’d also like to write the current progress to a db table. This DB call could fail so should ideally be in its own Activity. How could you structure this so:
a. You get the benefit of Temporal automatically retrying the DB call if it fails, but also
b. Not force the long running Activity to retry until the DB call has been retried ‘x’ times.

I assume that the long running Activity should not call another Activity from within it?

maxim · December 8, 2020, 10:57pm

Avoid passing the data directly between each stage as this will bloat the history.

This is true if data is potentially large (over hundreds of kilobytes)

In a long running Activity with the heartbeat, I’d also like to write the current progress to a db table.

What is the reason writing it to the DB table as well? You can always query the latest activity heartbeat directly from Temporal using DesribeWorkflowExecuction API. If you really want to do it you’ll have to do it from the same activity implementation and deal with errors yourself.

cg1972 · December 8, 2020, 11:25pm

Thanks maxim for the response.

The table provides the status of all current and historical jobs that have been run. If the jobs in Temporal are archived after a period of time then we would loose this information.

I’ll have a look at the API you have suggested. I guess we would need to have a look at how quickly the history grows and tailor an archival policy around that. Is there an api that allows you to query things like the size of the history logs, etc or is it just a case of monitoring how quickly the storage fills and setting a cleanup policy around that?

maxim · December 9, 2020, 1:52am

Such table makes perfect sense. But updating it while an activity is running is overkill IMHO. Update it when workflow is completed (as a last activity) and query workflow directly while it is running.

. Is there an api that allows you to query things like the size of the history logs, etc or is it just a case of monitoring how quickly the storage fills and setting a cleanup policy around that?

I don’t think such an API exists at this point.

cg1972 · December 9, 2020, 9:02pm

Thanks maxim for the response.

That’s a good suggestion - use Temporal for querying ‘active’ workflows and the table for all historical information.

Topic		Replies	Views
Store temporal activity history data in custom table/another database Community Support database	1	732	May 24, 2021
Activity Retry Temporal Database consumption Community Support database	3	48	August 15, 2024
Rec req: database calls from an Activity Community Support	1	428	November 8, 2023
Possible to store workflow/activity data in another database(own database) Community Support java-sdk , mysql , database , sdk-configuration	7	2051	March 31, 2022
Get/Set workflow state var from child activity Server Deployment	3	32	April 27, 2025

Batch Processing Best Practices

Related topics