Managing long running, polling, transactions

Hello, using Temporal to orchestrate the provisioning of IT infrastructure via a workflow which has at least one long running transaction involving a “man in the loop” task.

The long running activity can take anywhere from one hour to a max of ten business days to complete, and the interaction between the temporal workflow and the man in the loop is by opening a ticket (via an api) which returns a ticket id which can be used for polling the completion state.

Current implementation is one activity with start to close timeout set to 10 business days (calculated at workflow start) and have the activity “recoverably fail” until the ticket is completed.

This actually works but I wonder if there is a better, more “temporal” implementation, such as:

  • have the activity not “failing” but just looping and sleeping, emitting an heartbeat at each poll.
  • managing the sleep and the retry in the workflow, not in the activity, so the activity would always complete in a short time and an heartbeat would not be needed.
  • other patterns/features I am not aware of

Thanks,
Roberto

Your current approach has one serious issue. Temporal detects activity failures only through timeouts. So if your activity has 10-day timeout, then any intermittent failure, like the activity worker crashing before opening the ticket, would be detected in 10 days. I don’t think this is acceptable.

The best solution would be to have an activity with a short timeout to create a ticket, and then the ticketing system notify workflow through a signal when the ticket is closed.

If notification is not possible then you can have a separate activity that would poll for the result. Here are some options for polling.

1 Like

Thanks, about the issue of failures not being detected, could it be handled by setting also a short heartbeat timeout (say: every minute or so)?
For example:

def my_workflow():
  workflow.execute_activity("open_then_poll_ticket_activity", start_to_close_timeout=timedelta(days=10), 
heartbeat=timedelta(seconds=60))

def open_then_poll_ticket_activity():
  ticket = open_ticket()
  poll_retries = 0
  while True:
    ticket = poll_ticket(ticket)
    if ticket == "completed"):
      break
    activity.heartbeat(poll_retries)
    poll_retries = poll_retries + 1
    sleep(30)
  

By the way I understand that the process of opening a ticket and then polling is best implemented as two separate activities (open and poll, given that signaling is not possible), as they have different semantics and timings.

About the poll, option 1 from the linked post: which start to close timeout should be configured for the poll activity? The maximum elapse of 10 days (I guess no, as it would have the same issue about failures not being detected) or a short timeout of seconds and let the workflow handle the 10 days timeout with a loop? Something like this perhaps?

def my_workflow():
  ticket = workflow.execute_activity("open_ticket")
  expire_time = workflow.now() + timedelta(days=10)
  while workflow.now() < expire_time:
    result = workflow.execute_activity("poll_ticket", ticket, start_to_close_timeout=timedelta(seconds=10))
    if result == "ok":
      break

Thanks
Roberto

If you poll from a heartbeating activity, you set a short HeartbeatTimeout (1 minute, for example) and then a 10-day ScheduleToClose timeout, which limits the duration of activity execution, including all retries.