Log errors when enabling history.enableReplicationStream

Hi,

I am wondering what to make of the error “StreamSender exit recv loop” (stream_sender.go:155) when enabling the dynamic configuration option history.enableReplicationStream:

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamSenderImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_sender.go:155
go.temporal.io/server/service/history/replication.(*StreamSenderImpl).Start.func2
	external/io_temporal_go_server/service/history/replication/stream_sender.go:106

For context, we are running Temporal v1.23.1 and use default Temporal replication for that version. We want to test replication streaming which as far as we understand is faster and more reliable. To do so we set history.enableReplicationStream to true. As far was we can tell, replication streaming gets enabled and replication continues. However, the above log error keeps appearing in the logs. Initially we thought this might be temporary, but the error keeps being logged (even if we switch history.enableReplicationStream back to false.

Is it safe to toggle history.enableReplicationStream in the dynamic config or do the clusters be disconnected or at least not replication be ongoing before enabling this feature?

In Temporal v1.24 replication streaming will become the default. Will the switch from false to true for history.enableReplicationStream cause still the same errors?

Any help appreciated,
Hardy

StreamSender exit recv loop indicating sender(source side) stopped sending replication tasks. You should be able to find the reason of that on a different log entry in stream_sender. On following release(1.26), we have improved the log emitting.

Ok, I’ll turn history.enableReplicationStream back on on see whether I can get more out of the logs.

Generally though, is it ok enable history.enableReplicationStream at runtime, one cluster at a time in a given multi cluster replication setup?

Ok, there are actually three errors in the logs, appearing roughly at the same rate (we are talking ~100K log lines per hour)

StreamReceiver recv stream encountered unexpected err

closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

stream_receiver.go:237

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).processMessages
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:237
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:184
BiDirectionStream encountered unexpected error, closing: *serviceerror.Unavailable closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

bi_direction_stream.go:188

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*BiDirectionStreamImpl[...]).recvLoop
	external/io_temporal_go_server/service/history/replication/bi_direction_stream.go:188

and

StreamReceiver exit recv loop

closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:185

The errors have different shard ids as well.

Looking at the origin of the error for bi_direction_stream.go and the fact that the message mentions EOF, I am wondering whether the case io.EOF is not properly working:

func (s *BiDirectionStreamImpl[Req, Resp]) recvLoop() {
	defer close(s.channel)
	defer s.Close()

	for {
		resp, err := s.streamingClient.Recv()
		switch err {
		case nil:
			s.channel <- StreamResp[Resp]{
				Resp: resp,
				Err:  nil,
			}
		case io.EOF:
			return
		default:
			s.logger.Error(fmt.Sprintf(
				"BiDirectionStream encountered unexpected error, closing: %T %s",
				err, err,
			))
			var errResp Resp
			s.channel <- StreamResp[Resp]{
				Resp: errResp,
				Err:  err,
			}
			return
		}
	}
}

@xwduan, wdyt?

The replication stream connection is shard to shard. That means, if you have two 1K shards cluster connected, then there is 2K bidirection stream connections between these 2 clusters. The error you saw (max_age) can be tuned by KeepAliveMaxConnectionAge this config. By default, it is 5 mins, so every 5 mins, there is 2K connection reseting, each reset will cause error on sender side and receiver side, both will emit log. That’s why you are seeing so many error logs. BTW, if your cluster is fairly idle, you may also want to tune this config KeepAliveMaxConnectionIdle to reduce the chance of connection reset.

Yes. It is okay to roll it out one cluster at a time.

1 Like

Hi,

Yes, that makes sense and the numbers roughly add up. I guess the confusing part is that these are error level logs. Given that this is expected operation, I’d have thought this would be debug or info level logs. Maybe that changes in higher version of Temporal and when replication streaming becomes the default.

Looking at KeepAliveMaxConnectionAge and KeepAliveMaxConnectionIdle, they seem to be frontend/grpc related properties. Would increasing these limits have impact of other Temporal APIs, not only replication streaming? And are there any other negative side effects by increasing these values?

Thanks for all the help,
Hardy