How to handle backend service failure with non-retryable activities

Hi, I just started using Temporal for a new Quarkus project where we are going to have 4 rest backend services which need to be called in sequence.

This service will provide a self-service Web UI portal where a user can specify some criteria and one or more gift cards associated with that user to purchases licenses.

The services we will be calling on the backend via HTTP/REST are:

  • Support Site - Verify the user exists in our DB
  • Factory - Verify the requested serial numbers are in our production DB
  • GiftCard - Handles the gift card transactions (DEBIT/CREDIT/Compensation)
  • License - Generates the licenses to be returned to the user
  • License Gateway - Entrance into the workflow process called by REST

Right now, I have each service call in their own Activity which uses the microprofile Rest Client.

In my License Gateway, I have a Workflow implementation that calls each Activity in sequence and checks for errors and validations before calling the next step in the process.

When it comes to charging the Gift Cards, we allow the user to specify a list of cards so I have to loop through each one and call the “gift card” activity to send the request to the server.

If there are any issues reserving the credits from the card, I fail the whole transaction and trigger the Saga compensation and then fail the workflow.

Our process is an all or nothing approach, if any validation or system errors the whole process is cancelled and the user sees an friendly error message.

This all seems to be working ok and compensations working nicely if one of the cards is charged but the second or third one fails, etc. I see any previous cards compensated.

One issue I ran into during ad-hoc testing was if I killed the gift card service AFTER a successful transaction happend for one card BUT the second card in the list couldn’t call the service because the gift card service was dead.

My workflow is set to timeout after 30 secs so because my gift card service was still down the whole workflow failed BUT the compensation for the first successful gift card transaction never happened because the gift card service was still down.

Also, I don’t let my activities retry on any business exception we define. Anything like a input validation is considered “unrecovereable” and we fail the activity with a non-retry exception.

Anything other than our business exceptions are retried up to the 30 second workflow timeout.

My questions are:

  1. Is this system design suitable for our architecture? I do have access to the code for each service since I’m the one writing them. Some are in production but I could update them if needed. They will need to be changed anyway to add the new REST endpoints we need in our workflow.

  2. I think I saw in the forum a suggestion to make each back end service run as a Activity Worker so they can listen to the Task Queues directly and we don’t need to directly call/retry the back end services?

If #2, then what happens when the “Gift Card” service is down for example? How will my workflow know this and let the user know so we can fail the process like we do now.

We would like a “synchronous” approach so the user can see “real-time” results and either get a list of licenses or an error and we don’t want to tell them to “check later for results”, etc.

I suppose we could do that if it makes things much more robust, I’m still wrapping my head around the best approach to design all these moving parts we need to call.

Any help/ideas appreciated…

Don’t use workflow timeout for business level timeouts. The workflow timeout is essentially the terminate (like kill -9) which by design doesn’t allow any cleanup code execution.

I recommend using the update for waiting for the workflow to get to a certain stage in the execution. I would block the update call on

Workflow.Await(Duration.ofSeconds(30), ()->/whatever condition that unlocks the wait/)

Hi @maxim,

Thanks for the response, but I’m afraid I’m not sure what you mean and how this would help my scenario.

I mispoke, I don’t have any timeouts set on my Workflow, rather on my Activities. I let the activities retry up to 30 seconds before failing the activity.

private final static ActivityOptions options = ActivityOptions.newBuilder()
            .setScheduleToCloseTimeout(Duration.ofSeconds(30))
            .setRetryOptions(RetryOptions.newBuilder()
                    .setBackoffCoefficient(2)
                    .build())
            .build();

I understand the part about not setting a Workflow timeout, but unclear how the “update” would help me in the case where my backend service dies before the workflow calls the compensation routine to reverse my gift card charge.

My workflow calls my backend service via an “GiftCardActivity” which makes a REST call to a microservice running on another machine.

I am thinking about running each microservice as a Activity worker on each machine since it sounds like that might be better for resiliency?

If I understand correctly, if the Workflow worker tries to run the GiftCardActivity and that service is down, it will be completed when the remote worker service comes back online due to Temporal putting the activity request in the Task Queue?

If so, are there examples of running Activities in their own Worker process and the Workflow in another machine? Or do they both have to be run together? In which case that means I need to globally share the Workflow implementation across projects?

I’m a little confused if I can run Activities on one machine and the Workflow implementation on another.

What I’d ultimately like to do (if possible) is:

  • User calls rest end point to kick off workflow
  • Workflow starts
    • Calls SupportSiteActivity - OK response
    • Calls BCFactoryActivity - OK response
    • Calls VerifyGiftCardModelsActivity - OK response
    • Loops through each gift card by calling GiftCardActivity - Error after first card is successful because backend service is dead
    • Wait for 30 seconds for response from GiftCardActivity
    • Timeout activity since it’s dead

Now, the part I’m not sure of is I want the workflow to either stay running and not mark it as failed? so it can try the giftcard rollback when the service comes up AND return a response back to the user.

I can’t seem to have a workflow keep running AND return a response back to the user.

Also, when I restart my giftcard service the workflow doesn’t try to run the compensations it couldn’t run previously due to the giftcard service being down.

I did manage to get my giftcard service running as a separate worker on another machine with a separate queue name, but I am still seeing the same issues as with the activity running locally.

Am I thinking about this the wrong way? it is sort of a sync/async paradigm where we want synchronous method calls to the services BUT also want the workflow to pick up where any service errors happened (if they do) AND be able to return a response to the user so we can notify them that the process failed and will continue cleanup when the services come back online.

Below is my Workflow which is kicked off from a REST call (Gateway):

Gateway

@Path("")
public class LicensingGateway {

    @ConfigProperty(name = "temporal.purchase.license.task.queue")
    String taskQueue;

    @Inject
    Logger log;

    @Inject
    WorkflowApplicationObserver observer;

    @Path("/purchase")
    @Timed
    @POST
    @AddingSpanAttributes
    public Response purchaseLicenses(@Valid @SpanAttribute(value = "http.payload") PurchaseLicenseRequest request) {
        log.infof("Attempting to purchase licenses with request [%s]", request);

        
        UUID requestId = UUID.randomUUID();
         PurchaseLicenseWorkflow workflow = observer.getClient().newWorkflowStub(
                PurchaseLicenseWorkflow.class, WorkflowOptions.newBuilder()
                        .setWorkflowId("PurchaseLicenseRequest-" + requestId.toString())
                        .setTaskQueue(taskQueue).build()
        );

        PurchaseLicenseResponse output = workflow.purchaseLicenses(requestId, request);
        return Response.ok(output).build();

    }
}

WorkFlow

public class PurchaseLicenseWorkflowImpl implements PurchaseLicenseWorkflow {

    private static final Logger log = Workflow.getLogger(PurchaseLicenseWorkflowImpl.class);
    
    SupportSiteActivities supportSiteActivity = ActivityStubsProvider.getSupportSiteActivities();
    GiftCardActivities giftCardActivity = ActivityStubsProvider.getGiftCardActivities();
    LicenseServiceActivities licenseActivity = ActivityStubsProvider.getLicenseActivities();
    BCFactoryActivities bcfactoryActivity = ActivityStubsProvider.getBCFactoryActivities();
    VerificationActivities verificationActivity = ActivityStubsProvider.getVerificationActivities();

    @Override
    public PurchaseLicenseResponse purchaseLicenses(UUID requestId, PurchaseLicenseRequest request) {
        
        Objects.requireNonNull(requestId, "A unique request id is required");
        Objects.requireNonNull(request, "PurchaseLicenseRequest is required");
        
        // Controls the rollbacks
        Saga saga = new Saga(new Saga.Options.Builder().build());

        try {

            // verify user
            // Will throw exception if user not found
            log.debug("Calling Support Site to verify user {}", request.getOwnerEmail());
            supportSiteActivity.verifyUser(request.getOwnerEmail());

            // verify the serials exist
            // and group them by model AG1->SERIALS, etc
            // will throw exception if any serials not found
            log.debug("Calling BCFactory to group serial numbers by model");
            GroupSerialsByModelResponse groupedSerials = bcfactoryActivity.groupSerialsByModel(request.getSerials());

            // Verify that we have the proper serial models to gift card types
            // in the request
            // will throw exception if any mismatches
            log.debug("Verifying gift card and serial model matches");
            verificationActivity.verifySerialModelsToGiftCardTypes(groupedSerials, request.getGiftCards());
            
            // Debit the cards
            // If one fails, they all fail and will
            // be rolled back
            log.debug("Calling Gift Card service to debit {} gift cards", request.getGiftCards().size());
            debitGiftCards(request, saga, requestId);

            // Generate the licenses
            // if this fails, will cause a rollback of gift card transactions
            log.debug("Calling license service to generate licenses");
            Map<String, String> licenses = licenseActivity.generateLicenses(request).getLicenses();
            
            // The response
            PurchaseLicenseResponse lr = new PurchaseLicenseResponse();
            lr.setOwnerEmail(request.getOwnerEmail());
            lr.setExpirationDate(request.getExpirationDate());
            lr.setRequestedSerials(request.getSerials());
            lr.setLicenses(licenses);

            log.debug("Process complete..Returning reponse - {}", lr);
            return lr;
            
        } catch (Exception e) {
            saga.compensate();
            throw ApplicationFailure.newFailureWithCause(e.getMessage(), "RUNTIME ERROR!", e);
        }
    }

    /**
     * Process the gift cards specified
     *
     * @param request
     * @param saga
     * @param requestId
     */
    private void debitGiftCards(PurchaseLicenseRequest request, Saga saga, UUID requestId) {

        // Charge each card
        for (GiftCard giftCard : request.getGiftCards()) {
            
            // Create new request
            DebitAccountRequest r = new DebitAccountRequest();
            r.setCardId(giftCard.getId());
            r.setCredits(giftCard.getCredits());
            r.setOwnerEmail(request.getOwnerEmail());
            r.setInitiatedBy(request.getOwnerEmail());
            r.setCardType(giftCard.getType());

            // Perform operation
            giftCardActivity.debitGiftCard(requestId, r);
            
            // Setup the compensations in case of failures
            saga.addCompensation(() -> giftCardActivity.reverseGiftCardTransaction(requestId, giftCard.getId()));
        }
    }

We are working on “UpdateWithStart” feature that will provide this functionality out of the box. The current workaround is for client:

  1. StartWorkflow
  2. Call Update
  3. When update returns return the update result to the user

Workflow implementation:

  1. Block update handler on a Promise that is a field of the workflow.
  2. Workflow executes the sequence of activities.
  3. In the exception handler that surrounds the sequence of activities resolve the promise with the failure. This unblocks the update handler and notifies the client that transaction failed.
  4. Run the compensations as long as needed.
  5. If the sequence of activities didn’t fail then resolve the promise with an appropriate result that unlocks the client.

I did manage to get my giftcard service running as a separate worker on another machine with a separate queue name, but I am still seeing the same issues as with the activity running locally.

Where to run workflows and activities is orthogonal to the workflow logic. This is a purely deployment decision that doesn’t really affect the core design.