QueryWorkflow performance is not good

Currently we have many workflows running and we created QueryHandler in each one of them to show at what step they are.
We try to query them from our frontend service. However, it takes about 1 second to query each workflow which is way too slow (and because we have many workflows that we need to query, we try to use goroutines but it’s not great).
Is there anything we can do to improve QueryWorkflow performance ?
In the docs it says there is a mode of eventual consistency which may be faster but seems it was removed from the code some time ago

This latency is way higher than expected. But I’m confused by the following statement:

and because we have many workflows that we need to query, we try to use goroutines but it’s not great

Usually, in UI you query a single workflow. Why do you need to query many workflows simultaneously from the UI? If you query each workflow in a list of workflows it is rarely a good idea.

Hi we faced the same problem,
but we don’t query it in a list, we just hit it one by one per workflow id.
the QueryWorkflow take a long time, and sometimes it’s getting timeout, and not provided any data in workflow that still running/terminated
Is there any advice how to improve the performance of query workflow?
Thank You
image

Are you able to query these workflows via tctl, for example:

tctl wf query -w <workflow_id> --qt <query_name>

@Meli_Lee Do you use local activities? They can delay query responses.

1 Like

We try to query it from the tctl it return,
but we also build an api that use QueryWorkflow, sometimes it get >5s



in the failed activity itself we only call http rest
err := workflow.ExecuteActivity(ctx, activity1, activityReq).Get(ctx, &activity1Res)
func activity1(activityReq string){
res,err:= // call http service.client.Post
}

from the workflow sequence, does it shown that we use local activity?
still trying to figure out if we use local activity.
Does local activity mean, we execute activity inside of the function activity?
ex
func activity1(activityReq string){
err:=workflow.ExecuteActivity(…).Get(ctx,…)
}

We build our customized retry handling,
When retry from the activity has reach max attempts, they would wait for signal retry,
ch := workflow.GetSignalChannel(ctx, RetryWorkflow)
image

will this impacted on the performance of QueryWorkflow?
additional info,
during an activity failure(got error timeout/etc), when we direct it to Signal,
then try to QueryWorkflow to check, it took 5s and the next direct hit, it seem the response got cache, it become faster like 0.2s
but when there is interval of 60s and we try to query again, it become 5s again

from the workflow sequence, does it shown that we use local activity?

It seems you are not using local activities, here is a sample in the go-samples repo. For local activities you would see a MarkerRecorded event type in your history with markerName prop set to “LocalActivity”.

1 Like

We build our customized retry handling

Could you share your workflow code (and the GetRetrySignal function) , think it would help to see what could be going on (you can dm if you don’t want to share it it publicly, that’s fine).

IMHO, latency of the query method is really due to the way it works.

  • It needs to pull the execution history from DB (if already not in memory)

  • Worker needs to replay these events.

  • Finally it returns the requested variable.

Also, as mentioned above, you need to have an active worker to have your query fullfilled, If you have too many workflows being executed, then your workers will be busy. Due to this, the delay in query response depends on how free the workers are.

I’d recommend to store the workflow step information in your own database and pull it out from there.

IMHO, Temporal query methods are to be used when the data you are pulling out is frequently updated and infrequently accessed. In that case writing it to DB would be an overkill and hence query methods are suitable.

2 Likes

Hi @Vikas_NS ,

What do you mean by
Worker needs to replay these events

Does it mean, the worker actually replays those events? Or Temporal stores the result of all activities and just replay with the stored output of each activity.

Thanks a lot.

Temporal stores the results of activities as Workflow Execution History and uses them to replay.

Say the information you are interested is a String of length 10. If you use a your own database, You will be writing a String of length 10 and reading a String of length 10.

But if you use temporal’s query method,
Temporal needs to bring in the whole execution history ( results of activities) (This is an unnecessary overhead - The size of data being pulled out from Temporal DB depends on your Workflow Size)
And then replay and then return your String of length 10.

2 Likes

Thanks a lot, Vikas.

Btw @Vikas_NS ,

What if the workflow logic was changed a lit bit, how does the replay process deal with this?

The reason I ask is because

I have a workflow and expose a query handler to check a variable in the Workflow. (indicating workflow status like processing/completed/failed)
Since the response time when using query handlers is not good, I plan to persist this variable to DB as suggested.

To do so, I have to either

    1. Create a new activity to persist this variable to DB after each activity is completed.
    1. Or update current activities to persist this variable.

However, during the transition, I need to support both workflows ( old and new ones). The old ones still use the Workflow variable to return workflow status, and the new ones use the variable from DB.

What could be a good way for us to support this?
Thanks a lot.

Temporal has the concept of versioning.
Here’s an excellent video on that - Move Fast WITHOUT Breaking Anything - Workflow Versioning with Temporal - YouTube

Your workflow code will house the versioning logic.
If the workflow version is old, it will use the old code.
Else , it will use the new code.

1 Like

That’s great. Thanks a lot @Vikas_NS

Just wanted to add that Temporal workers cache workflow executions. Fetching full histories is only needed when a state is lost due to a process restart, or worker pushing a workflow execution out of cache in LRU manner. See here for more info and tuning.

3 Likes

thanks a lot @tihomir @Vikas_NS for the explanation
will look more into it