Weird Behaviour with Archival - Requires HttpGet before archiving

I need to add the below code in the archival code. (first line of Archvie method in VisibilityArchiver.go & HistoryArchiver.go under s3store package)
If not the archival doesn’t work
Not sure why.


Temporal is running on kubernetes.

I had a hard time gettting archival to work,
I added the above code to check if the egress is setup properly, and right then the archival started to work.

Not sure what http.Get("MY-S3-URL") does to get the archival to work magically.

FYI - S3 I’m using is a flavour of AWS S3

Could it be that you are using presigned s3 urls? Uploading objects using presigned URLs - Amazon Simple Storage Service

I don’t believe that’s the case.

I have listed down the errors I see.

In most cases the error is : Context deadline exceeded.

Want to understand what causes this error?

    "level": "error",
    "ts": "2022-01-12T09:20:19.346Z",
    "msg": "Archive method encountered an non-retryable error.",
    "service": "history",
    "archival-request-workflow-id": "s-tst-1",
    "archival-request-workflow-type": "GreetingWorkflow",
    "archival-request-close-timestamp": "2022-01-12T09:20:17.453Z",
    "archival-request-status": "Completed",
    "archival-URI": "s3://s3-archival",
    "archival-archive-fail-reason": "failed to write history to s3",
    "error": "RequestCanceled: request context canceled\ncaused by: context deadline exceeded",
    "logging-call-at": "visibilityArchiver.go:122",
    "stacktrace": "*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\*visibilityArchiver).Archive.func1\n\t/temporal/common/archiver/s3store/visibilityArchiver.go:122\*visibilityArchiver).Archive\n\t/temporal/common/archiver/s3store/visibilityArchiver.go:151\*client).archiveVisibilityInline\n\t/temporal/service/worker/archiver/client.go:260"
    "level": "error",
    "ts": "2022-01-12T09:27:00.163Z",
    "msg": "failed to send signal to archival system workflow",
    "service": "history",
    "shard-id": 3,
    "address": "",
    "shard-item": "0xc000111100",
    "archival-caller-service-name": "history",
    "archival-archive-attempted-inline": true,
    "archival-request-workflow-id": "s-tst-2",
    "wf-id": "temporal-archival-818",
    "error": "context deadline exceeded",
    "logging-call-at": "client.go:297",
    "stacktrace": "*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\*client).sendArchiveSignal\n\t/temporal/service/worker/archiver/client.go:297\*client).Archive\n\t/temporal/service/worker/archiver/client.go:190\*transferQueueTaskExecutorBase).recordWorkflowClosed\n\t/temporal/service/history/transferQueueTaskExecutorBase.go:196\*transferQueueActiveTaskExecutor).processCloseExecution\n\t/temporal/service/history/transferQueueActiveTaskExecutor.go:313\*transferQueueActiveTaskExecutor).execute\n\t/temporal/service/history/transferQueueActiveTaskExecutor.go:116\*transferQueueActiveProcessorImpl).process\n\t/temporal/service/history/transferQueueActiveProcessor.go:331\*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\\n\t/temporal/common/backoff/retry.go:103\*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"
    "level": "info",
    "ts": "2022-01-12T07:17:50.855Z",
    "msg": "failed to perform visibility archival inline",
    "service": "history",
    "shard-id": 4,
    "address": "",
    "shard-item": "0xc00058f900",
    "archival-caller-service-name": "history",
    "archival-archive-attempted-inline": true,
    "archival-request-workflow-id": "s3-test-10",
    "archival-URI": "s3://s3-archival",
    "error": "RequestCanceled: request context canceled\ncaused by: context deadline exceeded",
    "logging-call-at": "client.go:237"

Looks like the context deadline errors are due to small default values of archival time limits.

TimerProcessorArchivalTimeLimit → 1s
TransferProcessorVisibilityArchivalTimeLimit → 200ms
VisibilityProcessorVisibilityArchivalTimeLimit → 200ms

I’m trying to set these values in dynamic config as shown below, but doesn’t seem have to have an impact.

history.TimerProcessorArchivalTimeLimit: 2m
history.TransferProcessorVisibilityArchivalTimeLimit: 2m
history.VisibilityProcessorVisibilityArchivalTimeLimit: 2m

What am I doing wrong?

Tried adding double quotes “2m” as well.

line 60: cannot unmarshal !!str 2m into *dynamicconfig.constrainedValue\n

I believe for time.Duration properties the values have to be type int64 and value should be in milliseconds,
so for example for two minutes the value should be 120000000000

Its not working

Tried with and without quotes.

history.TimerProcessorArchivalTimeLimit: 120000000000

{"level":"info","ts":"2022-01-17T17:39:23.210Z","msg":"Unable to create file based dynamic config client, use no-op config client instead.","error":"failed to decode dynamic config yaml: unmarshal errors:\n line 60: cannot unmarshal !!int 1200000…into []*dynamicconfig.constrainedValue\n line 61: cannot unmarshal !!int1200000…into []*dynamicconfig.constrainedValue\n line 62: cannot unmarshal !!int1200000… into []*dynamicconfig.constrainedValue","logging-call-at":"main.go:136"}

history.TimerProcessorArchivalTimeLimit: “120000000000”

{"level":"info","ts":"2022-01-17T17:45:17.521Z","msg":"Unable to create file based dynamic config client, use no-op config client instead.","error":"failed to decode dynamic config yaml: unmarshal errors:\n line 60: cannot unmarshal !!str 1200000…into []*dynamicconfig.constrainedValue\n line 61: cannot unmarshal !!str1200000…into []*dynamicconfig.constrainedValue\n line 62: cannot unmarshal !!str1200000… into []*dynamicconfig.constrainedValue","logging-call-at":"main.go:136"}

I think this should be

- value: 120000000000
  constraints: {}

see how the other constrained values are defined in dynamic config, for example: temporal/development_es.yaml at master · temporalio/temporal · GitHub

It worked, I didn’t see any errors in history pod.

However, setting these values doesn’t seem to be overwriting the default value in set in temporal server code.

I set the configs as below to 4 mins

  - value: 240000000000
    constraints: {}
  - value: 240000000000
    constraints: {}
  - value: 240000000000
    constraints: {}

In the logs I only found TransferProcessorVisibilityArchivalTimeLimit not the other two.
Also as per my config, shouldn’t it be 240?

{"level":"info","ts":"2022-01-17T18:31:15.186Z","msg":"Get dynamic config","name":"history.transferProcessorVisibilityArchivalTimeLimit","value":120,"default-value":120,"logging-call-at":"config.go:79"}

FYI : I changed the default value to 2m directly in the server code
TransferProcessorVisibilityArchivalTimeLimit: dc.GetDurationProperty(dynamicconfig.TransferProcessorVisibilityArchivalTimeLimit, 2*time.Minute),

And the context deadline seems to be still 2 mins (default hardcoded value) and isn’t picking from dynamic config.

2022/01/17 18:31:15 deadline time : 2022-01-17 18:33:15.186328856 +0000 UTC m=+207.845076539

FYI : Place where I’m logging the above line

It should have set the deadline on context here: temporal/timerQueueTaskExecutorBase.go at master · temporalio/temporal · GitHub

@Chad_Retz am I seeing this wrong?

Below way of representation is working.

  - value: "5m"
    constraints: {}

To summarize there were two problems
1. Default values for archival time limits are very small

  • Given the default archival timeouts are between 300ms to 1sec, whenever the end system (S3) was slow to respond, the archival would timeout and throw context deadline exception.

  • Solved this problem by increasing the default timeouts to 1m. Below are the parameters I tweaked.


2. Faulty egress configuration & DNS resolution

  • The problem is not really related to temporal.

  • The egress configured was based on the DNS url of the S3 system. And the DNS resolution was a bit faulty. Doing http.Get("MY-S3-URL") prior to the API would do something to the DNS resolution and would get it to work.

  • But the real impediment in identifying the issue was the small time out limits again. This ment that I would not able to able to see the real errors and only saw context deadlline. Only after I increased the time limits, I was able to the underlying failures and that’s how I got to know its related to egress.

@tihomir I feel that the time limits are very small. Either we can increase them or we should mention these params in the documentation so that developers are aware on how to get around these issues.