We have been exploring different patterns for integrating Temporal workflows into our existing microservices which prompted a pattern question that I’m hoping you can clarify.
Currently, we are thinking of creating grpc or http requests to our downstream microservices that are Temporal agnostic. Within our activities, we would make client requests to the downstream services to complete the actual process.
As an example:
// ACTIVITY
func SomeActivity(ctx context.Context, value) error {
cancelc := make(chan struct{})
defer close(cancelc)
go func() {
for {
select {
case <-cancelc:
return
default:
{
time.Sleep(1 * time.Second)
activity.RecordHeartbeat(ctx)
}
}
}
}()
ret, err := client.SomeMethod(ctx, value)
if err != nil {
return err
}
return ret
}
Questions:
Is this an anti-pattern?
If any, what features would we lose if we took this approach?
Would there be areas to look out for that become more difficult?
Why should this be an antipattern, temporal activities are free to do anything they want, but having said that , this could really become tricky in many cases :
E.g. there is a workflow, which invokes an activity which does say Add or delete a resource (say a row in table), and lets assume this is done through http put, http delete… Lets also assume if the upstream was too slow and the request (http/grpc) request timed out, but in reality the resource creation/deleting could have actually happened in upstream…
In such scenarios the activity code may have to take care of
a) a set of rules to correctly identify such errors and configure appropriate retry policy
b) or some mechanism to ensure that the activity invocation largely remains idempotent ( say, e.g. if the activity is retried and the resource is already created /deleted, appropriate errors code may have to be dealt with 409/ 404 etc.) this could get a bit tricky though…
Yes, it is an anti-pattern if you own the downstream services.
Obviously, if you are calling into services that are part of another company or organization then you have no choice and you’ll have to forgo all the benefits outlined.
Here are some reasons to implement downstream services as activities directly:
Flow Control
If an activity worker is down it is not consuming activity tasks from the associated task queue. So no load on the service is generated and no error logs are produced.
Activity worker allows specifying per instance rate limit.
Activity worker allows specifying per instance limit on the number of parallelly executing activities.
Activity worker allows specifying per task list rate limit which is enforced by the service across any number of workers.
If there is a request spike and activities are requested faster than workers can (or allowed due to configured rate limits) then requests are backlogged in a task queue and processed later as soon as workers get spare capacity.
Compare it to the proposed downstream gRPC service approach:
If gRPC service is down activities are still executed and make requests to the service, possibly killing it with its requests.
If gRPC service is overloaded it has no way to push back on the request rate.
Additional load on Temporal service and activity workers to cycle for the failing activities.
No support for absorbing traffic spikes without overloading downstream service.
Routing and Load Balancing
You have to maintain completely separate routing and load balancing layer for RPC
This layer is not needed (besides the ability for Temporal workers to find Temporal Frontends) when using temporal activities directly.
Temporal supports routing requests to specific workers when needed. It can be achieved through RPC as well but might be nontrivial.
Long-Running Operations
RPC services don’t support long-running operations directly.
Temporal activities can have unlimited duration
Temporal activities support heartbeating to support fast worker failure detection