Alternatives: Verify Workflow status before calling Workflow.getResult

I have a flow where the client initiate a new workflow but it doesn’t wait for the result in-line. Instead it will do some other stuff and come back later. For example:

WorkflowClient workflowClient = ...;
WorkflowOptions options = ...;
var workflowInstance =
        workflowClient.newWorkflowStub(MyWorkflow.class, options);
WorkflowExecution workflowExecution = WorkflowClient.start(workflowInstance::execute, request);

When that moment arrives, it will need to check workflow status and based on that, get the result or continue doing some other stuff (the idea is not to park the thread). I am using versión 1.14.0 of temporal.io/sdk-java and I came across this snippet code I am using:

WorkflowClient workflowClient = ...;
WorkflowExecution workflowExecution = ...;
var stub = workflowClient.newUntypedWorkflowStub(execution.getId(), Optional.of(execution.getRunId()), Optional.empty());
Result result = stub.getResult(1L, TimeUnit.SECONDS, Result.class);

The thing is that I see some DEADLINE_EXCEEDED and UNAVAILABLE errors when that 1 second is reached since the workflow execution is still in progress.

Looking at this response. I can:

  1. Use a query feature to retried a custom workflow internal state. I will then, based on this state, call the getResult.
  2. Continue as I am doing, and discarding the DEADLINE_EXCEEDED and UNAVAILABLE errores, considering them it just part of the business logic.
  3. Use DescribeWorkflowExecution API in replace of the custom Query of point 1, and based on workflow Status call the getResult.

I find option 1 a little bit of unnecessary since I can use DescribeWorkflowExecution API to achieve the same without adding a new value to store to Temporal.
Option 2 seems risky, I might end up considering unwanted errors to be treated as expected and metrics become dirty.
Option 3 seems the best approach, I will probably have 5 to 10 attempts of describe an in-progress workflow execution before it actually ends and I can get the result. And since this is a heavy used API I might end-up with more than 1000 request per minute at peak.

What is the overhead of these solutions?

Thanks!

I see some DEADLINE_EXCEEDED and UNAVAILABLE errors when that 1 second is reached

Can you show details on these errors? Could you also check frontend service logs when this happens?

Do you see any service errors, via metrics? Sample grafana query:
sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)
Latencies (by operation):
histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type="frontend"}[5m])) by (operation, le))

Maybe another option could be to useWorkflowStub.getResultAsync for you? It returns CompletableFuture which you can complete at a later point as well.

Hard to check, I’m using the docker compose solution :frowning: I did see those DEADLINE_EXCEEDED and some UNAVAILABLE on the client metrics and also logs. I didn’t find a way to check server logs.

I’m quite certain that those DEADLINE_EXCEEDED matched when the timeout triggered, the other I can’t tell.

I did attempt the use of WorkflowStub.getResultAsync, since I’m using spring-webflux as the application client code. Nevertheless, the issue persists, since I’m implementing WorkflowStub.getResultAsync as a way to check if the workflow finished or not, if not, my app will tell the REST customer to wait for about 1 second and fire-up another REST call. Somehow like a poll mechanism to check if the workflow finished and get the corresponding result.

I don’t know Temporal internals to understand if it’s better for this purpose to use first a DescribeWorkflowExecution just to see if the workflow completed or not, and based on that, get the corresponding result instead of waiting for the result for a really short period of time.

Following on this topic. I finally put a lot on my application to analyze the errors related with UNAVAILABLE status… What I found was that the Status instance related to the exception was

Status{code=UNAVAILABLE, description=upstream request timeout, cause=null}

Those seems to happen in less frequency that the other DEADLINE_EXCEEDED status error code.

All these errors appears on the metric temporal_long_request_failure related to the operation GetWorkflowExecutionHistory (from the client perspective)

When I go to the dashboard related with the FrontEnd Services I see:

Those service_errors_context_cancelled seems to match with DEADLINE_EXCEEDED and those UNAVAILABLE seems to match with service_errors_context_timeout.

I’m still wondering if replacing a call to WorkflowStub.getResult with a low timeout value (around 1 second) is better than doing a call to DescribeWorkflowExecution to check if workflow is done before calling the WorkflowStub.getResult. I still need to test that change.

The normal duration of a Workflow execution is between 1 and 30 seconds.

Any ideas or opinions?