We found a case where a worker is definitely working on an activity, but when it went to report it as complete, the service responded with not_exists. Inspecting the workflow execution history showed that there was an event for ActivityScheduled, but no event for ActivityStarted. How can the worker be working on it if the history contains no event for ActivityStarted?
Temporal has various optimizations to avoid unnecessary history growth. When an activity has an associated retry policy (and Temporal assigns one to every activity by default) its start and completion events are not written to the history until activity completes successfully or no more retries will be attempted. This way an activity can stay in retry loop for a very long time without causing any history growth. The ActivityTaskStartedEvent
contains number of actual activity executions in the attempt
field.
You can see all currently executing activities in the UI summary page for the workflow in the pending activities section. Or you can use CLI workflow describe
command to see the same data. Each pending activity information includes a number of attempts so far as well as information about the last failure.
We understand that the current UI experience is not the most intuitive one. We will try to make it cleaner in the future.
Thanks. That does explain why the workflow execution history looks the way it does.
But it does not explain why the service responded with not_exists
when the worker tried to complete the event. This happened well before any configured timeout (the first timeout was scheduledToStart at 5 minutes).
I’m suspecting a timeout. Could you post the ActivityTaskScheduled
event? Input payload is not needed.
This doesn’t look right. What persistence do you use?
Postgres
Then my guess is that it is PostgreSQL integration bug.
I think I found the HistoryService log entries that correspond to the failed RespondActivityTaskCompleted request:
It doesn’t tell me much.
PosgreSQL integration has multiple reported bugs. We don’t recommend running it in production at this point.
The only specific bug I have heard mentioned is the child workflows, which we don’t use. Is there a list of other bugs somewhere?
I’m not aware about such list. But as we never ran any real tests of the integration it is hard to say how many bugs are there.
Why is there so little information in the logs? It says only, “application_error”. But when I look in the historyEngine code, there are like 5 specific errors that can get returned from RespondActivityTaskCompleted. Are those not logged anywhere?
This looks like an issue with the rpc layer used by Cadence which might be eating up the actual error returned by the service.
Then where is it stored? Is it just in memory on a History node? What if that node dies?
It is stored in the DB in the executions table.
This shows two poll responses for the same activity within seconds of each other. This should not happen, right? The activity timeout is 2 minutes. After the first poll response, the worker tried to complete the activity and the service reported ActivityNotFound. Any suggestions for how to debug this?
Are you able to reproduce this problem using different DB? I believe these are manifestations of bugs in the PostreSQL integration.
I haven’t been able to repro on my desktop (where I can try a different DB). If I can, I will try it.
But assuming it is Postgres, do you have any tips for how I can debug and maybe fix?