Hi there,
We are using the temporal-large-payload-codec to work around the size limitations of Temporal in- and outputs. This works great, except we had situations where we got non determinism errors, namely when the connection to the underlying bucket is interrupted or delayed.
Based on the current interface design of the PayloadCodec an error is returned from Encode respectively Decode. The interface methods are:
Encode([]*commonpb.Payload) ([]*commonpb.Payload, error)
Decode([]*commonpb.Payload) ([]*commonpb.Payload, error)
Atm, when there is no network connection, the implementation of these function will hang and then once the client connection times out (in our case a very high timeout of 60s) return an error.
The problem occurs in the workflow implementation where we have a call to a future:
err := fut.Get(ctx, &result);
It can occur that the actual called activity executed and is recorded in the history, but then on the Get call the retrieval via the codec fails. This will return an error. In our code we don’t distinguish between a workflow error or a codec error and the code takes some compensating actions based on the error. When this workflow gets replayed at a later stage, the codec might then be able to decode the payload and no error is returned and the code takes a different path leading to a non determinism error.
What we are asking ourself is whether it would be better to panic in the codec in case of an error. This should then be handled by the workflows panic handler/policy and a non determinism error should not occur.
Or is it better to wrap errors in ErrUnableToEncode/ErrUnableToDecode (see errors.go) and check the error type of future.Get?
Any ideas?
Thanks,
Hardy