Log errors when enabling history.enableReplicationStream

hferentschik · January 27, 2025, 11:35am

Hi,

I am wondering what to make of the error “StreamSender exit recv loop” (stream_sender.go:155) when enabling the dynamic configuration option history.enableReplicationStream:

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamSenderImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_sender.go:155
go.temporal.io/server/service/history/replication.(*StreamSenderImpl).Start.func2
	external/io_temporal_go_server/service/history/replication/stream_sender.go:106

For context, we are running Temporal v1.23.1 and use default Temporal replication for that version. We want to test replication streaming which as far as we understand is faster and more reliable. To do so we set history.enableReplicationStream to true. As far was we can tell, replication streaming gets enabled and replication continues. However, the above log error keeps appearing in the logs. Initially we thought this might be temporary, but the error keeps being logged (even if we switch history.enableReplicationStream back to false.

Is it safe to toggle history.enableReplicationStream in the dynamic config or do the clusters be disconnected or at least not replication be ongoing before enabling this feature?

In Temporal v1.24 replication streaming will become the default. Will the switch from false to true for history.enableReplicationStream cause still the same errors?

Any help appreciated,
Hardy

xwduan · January 28, 2025, 7:08pm

StreamSender exit recv loop indicating sender(source side) stopped sending replication tasks. You should be able to find the reason of that on a different log entry in stream_sender. On following release(1.26), we have improved the log emitting.

hferentschik · January 29, 2025, 10:48am

Ok, I’ll turn history.enableReplicationStream back on on see whether I can get more out of the logs.

Generally though, is it ok enable history.enableReplicationStream at runtime, one cluster at a time in a given multi cluster replication setup?

hferentschik · January 29, 2025, 4:44pm

Ok, there are actually three errors in the logs, appearing roughly at the same rate (we are talking ~100K log lines per hour)

StreamReceiver recv stream encountered unexpected err

closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

stream_receiver.go:237

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).processMessages
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:237
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:184

BiDirectionStream encountered unexpected error, closing: *serviceerror.Unavailable closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

bi_direction_stream.go:188

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*BiDirectionStreamImpl[...]).recvLoop
	external/io_temporal_go_server/service/history/replication/bi_direction_stream.go:188

and

StreamReceiver exit recv loop

closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

go.temporal.io/server/common/log.(*zapLogger).Error
	external/io_temporal_go_server/common/log/zap_logger.go:156
go.temporal.io/server/service/history/replication.(*StreamReceiverImpl).recvEventLoop
	external/io_temporal_go_server/service/history/replication/stream_receiver.go:185

The errors have different shard ids as well.

Looking at the origin of the error for bi_direction_stream.go and the fact that the message mentions EOF, I am wondering whether the case io.EOF is not properly working:

github.com/temporalio/temporal

service/history/replication/bi_direction_stream.go

b05f22d67


      
          func (s *BiDirectionStreamImpl[Req, Resp]) recvLoop() {
          	defer close(s.channel)
          	defer s.Close()
          
          	for {
          		resp, err := s.streamingClient.Recv()
          		switch err {
          		case nil:
          			s.channel <- StreamResp[Resp]{
          				Resp: resp,
          				Err:  nil,
          			}
          		case io.EOF:
          			return
          		default:
          			s.logger.Error(fmt.Sprintf(
          				"BiDirectionStream encountered unexpected error, closing: %T %s",
          				err, err,
          			))
          			var errResp Resp

This file has been truncated. show original

func (s *BiDirectionStreamImpl[Req, Resp]) recvLoop() {
	defer close(s.channel)
	defer s.Close()

	for {
		resp, err := s.streamingClient.Recv()
		switch err {
		case nil:
			s.channel <- StreamResp[Resp]{
				Resp: resp,
				Err:  nil,
			}
		case io.EOF:
			return
		default:
			s.logger.Error(fmt.Sprintf(
				"BiDirectionStream encountered unexpected error, closing: %T %s",
				err, err,
			))
			var errResp Resp
			s.channel <- StreamResp[Resp]{
				Resp: errResp,
				Err:  err,
			}
			return
		}
	}
}

@xwduan, wdyt?

xwduan · January 29, 2025, 6:28pm

The replication stream connection is shard to shard. That means, if you have two 1K shards cluster connected, then there is 2K bidirection stream connections between these 2 clusters. The error you saw (max_age) can be tuned by KeepAliveMaxConnectionAge this config. By default, it is 5 mins, so every 5 mins, there is 2K connection reseting, each reset will cause error on sender side and receiver side, both will emit log. That’s why you are seeing so many error logs. BTW, if your cluster is fairly idle, you may also want to tune this config KeepAliveMaxConnectionIdle to reduce the chance of connection reset.

xwduan · January 29, 2025, 6:29pm

Yes. It is okay to roll it out one cluster at a time.

hferentschik · January 30, 2025, 10:25am

Hi,

Yes, that makes sense and the numbers roughly add up. I guess the confusing part is that these are error level logs. Given that this is expected operation, I’d have thought this would be debug or info level logs. Maybe that changes in higher version of Temporal and when replication streaming becomes the default.

Looking at KeepAliveMaxConnectionAge and KeepAliveMaxConnectionIdle, they seem to be frontend/grpc related properties. Would increasing these limits have impact of other Temporal APIs, not only replication streaming? And are there any other negative side effects by increasing these values?

Thanks for all the help,
Hardy

Topic		Replies	Views
Streaming error Logs in Multi-Cluster Temporal Server Setup Server Deployment replication	3	28	May 20, 2025
Replication crashes history service Server Deployment replication	11	94	October 2, 2024
Whats optimal settings for quick xdc replication? what metrics can one track? Community Support xdc	12	1158	July 8, 2024
Errors in frontend and hisotry Community Support history , worker , grpc	26	2342	April 9, 2022
Failed to retrieve replication message and Failed to get replication tasks from client errors Community Support replication	6	643	July 8, 2024

Log errors when enabling history.enableReplicationStream

Related topics