Hi,
I have a project that is very similar to the project described here, we have only 1 workflow with 2 activities, the first activity is responsible for sending an HTTP request to a downstream service (with 100ms response time) and the second activity is responsible for publishing the HTTP call response status to the NATS.
Currently, on the local M1 machine, we get 16 workflows/sec with Cassandra, but for production, we need something around 5000 workflows/sec.
we need to execute around 5 million HTTP requests in 15 minutes.
I have a few questions:
Is Temporal a good fit for such a project with these performance requirements?
If temporal is a good fit, what are your recommendations about the production resources that we need?
Is the local activity a better option for us (to improve performance)?
In terms of performance, Do you recommend using 2 different activities for sending HTTP requests and publishing status codes to NATS, or it would be better to use just a single activity for the whole functionality?
we need to execute around 5 million HTTP requests in 15 minutes.
Temporal is capable of this type of load (~5.5K activities / second, ~22-25K state transitions / second). It would come at a cost of having to set up a pretty large Cassandra cluster tho imo.
For simple “fire and forget” short http calls you might want to consider using local activities which would definitely reduce this cost.
I think the question for you is what is your use case? Do you care about durability of these calls (which is one of main reasons to use Temporal for the your use case imo)? What should happen if first http call fails, do you need retries or can just forget it failed?
Trying to understand your use case and need to be able to tell if Temporal would be a good use case for it.
Thanks for your response.
Each HTTP call is a single job, in case of failure(network error) we need to retry it, also for each response we need to send the response status to NATS. So, each workflow has two activities: sending HTTP calls and publishing the status to the NATS. We prefer to be resilient against any failure, especially for publishing the status to the NATS, but for the HTTP request just in case of network failure we need to retry it. the durability of the call is important but with reasonable resource usage. If it costs us a heavy cluster, we prefer to handle the failures in another service (publisher of the jobs).
I have another question, what happens if we consider aggregating 100 thousand of the payloads, and sending them as a single bulk request, for example instead of calling an Endpoint 5 million times, we can create 50 workflows that each workflow has 100 thousand records? Is it okay for temporal to have workflows with huge states (100 hundred records)?
What is the source of data for these HTTP requests? Have you considered making these requests from an activity that directly reads the source and makes both HTTP and NATS request in a loop? This activity could record the progress in the heartbeat. On failure, the data from the last heartbeat can be accessed when an activity is retried.
We have another service that reads the records from DB and creates a new workflow for each record. This flow is too slow (making 100 thousand workflow takes around 3 minutes), maybe your solution helps us with this issue.
But as I said our main concern is about cluster resources that we need to handle this heavy load of workflow.
Okay, cool, but what about concurrent requests? I mean because of HTTP response time, doing all HTTP requests in a loop make it too slow. Is temporal supports goroutine or other concurrent paradigms? So I can make concurrent requests in the loop without the need to wait for the last request to be finished?
Activity doesn’t have any limitation on the type of code it can support. So you can use multiple goroutines to implement it. But I would recommend starting multiple such activities from a workflow in parallel.