Our usecase is we need to schedule and run many one-off jobs/activities.
The activity is going out and hitting an API and being able to report some metrics about that call. Fairly simple, but we may have many millions of these jobs waiting to happen (scheduled) or executing.
i.e. I schedule something to run at 6:00 pm tomorrow and this activity will never run again. It’s not chronic, but scheduled. Might be scheduled months in advance.
EDIT: For additional information, the requests are more or less “the same” (differences in the URL, but otherwise not too much). So we could theoretically run off a database that pulls the information off at execution time. (Reducing code for activities.)
I would execute the workflow with the schedule date/time as an argument. In the workflow implementation, then calculate the difference between the current date/time and the schedule date/time. This gives you the time to sleep:
Okay, I think I understand based on the Workflow lifecycle. For these API Requests, I would also be able to hand the URL, headers, etc. in as arguments as well?
My only sticking point is we also need to do a handshake with another microservice and send Authorization Bearer token across (the token expires every 12 hours so we’d want to retrieve it about every 6 hours). In an ideal world, we’d have an activity to refresh and cache that token which would then distribute it to the other workers who are going out at execution time with whatever the newest token is.
On the first part, yes pass in all input you need during the execution as arguments.
For the tokens, it depends a bit what service you are accessing. For instance, my worker is running on Google Kubernetes Engine (GKE). I use the Workload Identity to run my worker so it has always valid access using the Google Cloud SDKs to access the required services.
For us, it’s internal microservices that all use Okta for authentication. So we have to have the Authorization Bearer in every call at execution time (cannot be done at time of scheduled). Getting and using a new authorization token on every call would bring down all our services, so that’s not an option.
So I guess what I’m asking is: Can an activity get the the token from a workflow?
I suppose I’d want to use something like a local activity to get this information (if I understand right).
I think it is more common to keep token not per workflow, but per activity worker process which is shared by multiple workflows. Then the process should be able to manage the token refresh on its own internal schedule.
Hello! We are looking at this use case as well. We will sometimes want to schedule a workflow to run far into the future (perhaps months ahead of time). Does calling workflow.Sleep tie up resources on the workflow executor? We expect to have millions of these one-off events at any given time, so this might not scale well.
If it does tie up resources, is there any other way to schedule a one-off run? From some quick experimentation with Temporal, it looks like cron jobs schedule their next run recursively, so the underlying mechanism does seem to be “run once at timestamp”. It seems weird that this is not exposed to users…
| Does calling workflow.Sleep tie up resources on the workflow executor?
Sleep would not tie up worker resources. Workers have in-memory cache of workflow executions and can chose to evict ones that are waiting on unblocking conditions / timers to allow other executions to make progress.
| We expect to have millions of these one-off events at any given time
Would need to test conditions where a large number of timers for these executions fire at the same time and make sure that you provision your workers, set up right number of task queue partitions and watch latencies via metrics to make sure if this happens processing happens within the expected processing times.
@tihomir Thanks for your quick reply! Great to hear Sleep is efficient, and we will definitely be testing these large scale scenarios before going to production. We have considered the case where lots of timers fire at the same time, and are planning on a solution to pre-scale the worker pool to absorb that load with acceptable latencies.
Regarding task queue partitions, is there any guidance on how to pick the right value for this? I haven’t found a page in the docs that explicitly discusses it. We currently set them to a relatively high arbitrary value (512) since it seems it can’t be changed easily afterwards, but we’d love to have a deeper understanding. Thanks!