Archival - 100% Failure of Signal Archives

etherops · March 26, 2022, 12:19am

Greetings,

We enabled archiving earlier this week on a high volume namespace. We initially experienced high levels of failure on inline archiving and we were able to resolve that by configuring timeouts per this post: Weird Behaviour with Archival - Requires HttpGet before archiving.

However we are still seeing 100% failure of all the signal based archiving, all logging “failed to send signal to archival system workflow” with an error of “context deadline exceeded”.

I found this topic: SignalWithStartWorkflow behaviour with context.WithTimeout - #2 by tihomir where this bug was opened (Archival attempted for already archived workflow · Issue #2464 · temporalio/temporal · GitHub) which this pr (Handle history not found error when archiving history by yycptt · Pull Request #2465 · temporalio/temporal · GitHub) presumably resolves. However that PR doesn’t actually address the timeout issue by making that timeout configurable, it only resolves how attempts to subsequently delete and already archived workflow are handled.

I have established that connectivity is possible between the history and the frontend services so I don’t think it is networking. Is there any other explanation for why we might get 100% context deadline exceeded when trying to archive workflows?

We are on temporal version 1.13.3

THank you,

{
	"content": {
		"timestamp": "2022-03-26T00:07:36.821Z",
		"host": "i-06416f2c0019966d2",
		"service": "history.temporal.infra",
		"attributes": {
			"service": "history",
			"level": "error",
			"service_name": "history.temporal.infra",
			"wf-id": "temporal-archival-867",
			"meta": {
				"log_processor": "global"
			},
			"msg": "failed to send signal to archival system workflow",
			"stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/common/log/zap_logger.go:142\ngo.temporal.io/server/service/worker/archiver.(*client).sendArchiveSignal\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/worker/archiver/client.go:297\ngo.temporal.io/server/service/worker/archiver.(*client).Archive\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/worker/archiver/client.go:190\ngo.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).archiveWorkflow\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/timerQueueTaskExecutorBase.go:210\ngo.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/timerQueueTaskExecutorBase.go:129\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/timerQueueActiveTaskExecutor.go:109\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/timerQueueActiveProcessor.go:307\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/go/pkg/mod/go.temporal.io/server@v1.13.3/service/history/taskProcessor.go:171",
			"archival-request-namespace-id": "f15ed5e9-b7e6-49ea-9b11-707424096c00",
			"archival-request-run-id": "e6c1fb4f-0ab9-45cc-866e-711bffacb352",
			"archival-request-namespace": "payments_prod",
			"address": "172.30.112.211:7234",
			"logging-call-at": "client.go:297",
			"shard-id": 4085,
			"ts": "2022-03-26T00:07:36.819Z",
			"shard-item": "0xc006982680",
			"error": "context deadline exceeded",
			"archival-request-workflow-id": "money-allocation-alloc_od_6497773539130607",
			"archival-archive-attempted-inline": false,
			"archival-caller-service-name": "history"
		}
	}
}

Vikas_NS · March 26, 2022, 7:13am

Hi,

Do you want to increase the hardcoded signalTimeout and test?
We did that and things improved.

Here’s the const you got to update (line 114)- temporal/client.go at e660403953b48ad2f24a50a39fd7a4be71a62667 · temporalio/temporal · GitHub

We set it to 30s and were able to resolve the issue.

When ever there is a huge load on the sytem, signals are slow. And the default value is very very small.
Give 30s a try.

Also, are your inline archivals working fine?

etherops · March 26, 2022, 3:53pm

Hi @Vikas_NS thank you for the reply - i read your other post thoroughly. We can try deploying a custom binary w/ increased hard coded timeout.

All inline archives are working fine, it is just the signal/workflow archives that are 100% failing.

We also tried scaling up the frontend and the worker services to see if that would improve things. Load (CPU, memory) on frontend is very low so it seems odd that the system would act as if it were under huge load. Did you find any other ways to resolve on the receiving side of the signal rather than just upping the timeout?

Thanks!

tihomir · March 27, 2022, 4:06pm

Opened issue for this here.

Vikas_NS · March 28, 2022, 3:08pm

We didn’t try any other approach apart from from.

But one thing to note is, when we faced this issue, atleast one or two would succeed to go through within that 300ms limit. While most of them failed, it was not all of them.

In your case given that its 100% failure,
Maybe you want to try increasing the timeout first just to narrow down on the issue.

If it still fails even after increasing, it might be a different issue all together.

Topic		Replies	Views
SignalWithStartWorkflow behaviour with context.WithTimeout Community Support general-impl , timeout	5	918	February 4, 2022
Disabling archival not working Community Support archival	1	810	June 11, 2021
Temporal s3 archival issues Community Support archival	10	1379	May 28, 2025
Weird Behaviour with Archival - Requires HttpGet before archiving Community Support s3 , kubernetes	11	1094	May 28, 2025
Temporal workflows getting failed due to i/o timeout Server Deployment go-sdk	2	294	April 15, 2024

Archival - 100% Failure of Signal Archives

Related topics