Archival - 100% Failure of Signal Archives


We enabled archiving earlier this week on a high volume namespace. We initially experienced high levels of failure on inline archiving and we were able to resolve that by configuring timeouts per this post: Weird Behaviour with Archival - Requires HttpGet before archiving.

However we are still seeing 100% failure of all the signal based archiving, all logging “failed to send signal to archival system workflow” with an error of “context deadline exceeded”.

I found this topic: SignalWithStartWorkflow behaviour with context.WithTimeout - #2 by tihomir where this bug was opened (Archival attempted for already archived workflow · Issue #2464 · temporalio/temporal · GitHub) which this pr (Handle history not found error when archiving history by yycptt · Pull Request #2465 · temporalio/temporal · GitHub) presumably resolves. However that PR doesn’t actually address the timeout issue by making that timeout configurable, it only resolves how attempts to subsequently delete and already archived workflow are handled.

I have established that connectivity is possible between the history and the frontend services so I don’t think it is networking. Is there any other explanation for why we might get 100% context deadline exceeded when trying to archive workflows?

We are on temporal version 1.13.3

THank you,

	"content": {
		"timestamp": "2022-03-26T00:07:36.821Z",
		"host": "i-06416f2c0019966d2",
		"service": "history.temporal.infra",
		"attributes": {
			"service": "history",
			"level": "error",
			"service_name": "history.temporal.infra",
			"wf-id": "temporal-archival-867",
			"meta": {
				"log_processor": "global"
			"msg": "failed to send signal to archival system workflow",
			"stacktrace": "*zapLogger).Error\n\t/go/pkg/mod/\*client).sendArchiveSignal\n\t/go/pkg/mod/\*client).Archive\n\t/go/pkg/mod/\*timerQueueTaskExecutorBase).archiveWorkflow\n\t/go/pkg/mod/\*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask\n\t/go/pkg/mod/\*timerQueueActiveTaskExecutor).execute\n\t/go/pkg/mod/\*timerQueueActiveProcessorImpl).process\n\t/go/pkg/mod/\*taskProcessor).processTaskOnce\n\t/go/pkg/mod/\*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/\\n\t/go/pkg/mod/\\n\t/go/pkg/mod/\\n\t/go/pkg/mod/\*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/\*taskProcessor).taskWorker\n\t/go/pkg/mod/",
			"archival-request-namespace-id": "f15ed5e9-b7e6-49ea-9b11-707424096c00",
			"archival-request-run-id": "e6c1fb4f-0ab9-45cc-866e-711bffacb352",
			"archival-request-namespace": "payments_prod",
			"address": "",
			"logging-call-at": "client.go:297",
			"shard-id": 4085,
			"ts": "2022-03-26T00:07:36.819Z",
			"shard-item": "0xc006982680",
			"error": "context deadline exceeded",
			"archival-request-workflow-id": "money-allocation-alloc_od_6497773539130607",
			"archival-archive-attempted-inline": false,
			"archival-caller-service-name": "history"


Do you want to increase the hardcoded signalTimeout and test?
We did that and things improved.

Here’s the const you got to update (line 114)- temporal/client.go at e660403953b48ad2f24a50a39fd7a4be71a62667 · temporalio/temporal · GitHub

We set it to 30s and were able to resolve the issue.

When ever there is a huge load on the sytem, signals are slow. And the default value is very very small.
Give 30s a try.

Also, are your inline archivals working fine?

1 Like

Hi @Vikas_NS thank you for the reply - i read your other post thoroughly. We can try deploying a custom binary w/ increased hard coded timeout.

All inline archives are working fine, it is just the signal/workflow archives that are 100% failing.

We also tried scaling up the frontend and the worker services to see if that would improve things. Load (CPU, memory) on frontend is very low so it seems odd that the system would act as if it were under huge load. Did you find any other ways to resolve on the receiving side of the signal rather than just upping the timeout?


Opened issue for this here.

1 Like

We didn’t try any other approach apart from from.

But one thing to note is, when we faced this issue, atleast one or two would succeed to go through within that 300ms limit. While most of them failed, it was not all of them.

In your case given that its 100% failure,
Maybe you want to try increasing the timeout first just to narrow down on the issue.

If it still fails even after increasing, it might be a different issue all together.

1 Like