Activity started without corresponding event in execution history

chf75 · August 19, 2020, 8:50pm

We found a case where a worker is definitely working on an activity, but when it went to report it as complete, the service responded with not_exists. Inspecting the workflow execution history showed that there was an event for ActivityScheduled, but no event for ActivityStarted. How can the worker be working on it if the history contains no event for ActivityStarted?

maxim · August 19, 2020, 10:40pm

Temporal has various optimizations to avoid unnecessary history growth. When an activity has an associated retry policy (and Temporal assigns one to every activity by default) its start and completion events are not written to the history until activity completes successfully or no more retries will be attempted. This way an activity can stay in retry loop for a very long time without causing any history growth. The ActivityTaskStartedEvent contains number of actual activity executions in the attempt field.

You can see all currently executing activities in the UI summary page for the workflow in the pending activities section. Or you can use CLI workflow describe command to see the same data. Each pending activity information includes a number of attempts so far as well as information about the last failure.

We understand that the current UI experience is not the most intuitive one. We will try to make it cleaner in the future.

chf75 · August 19, 2020, 10:48pm

Thanks. That does explain why the workflow execution history looks the way it does.

But it does not explain why the service responded with not_exists when the worker tried to complete the event. This happened well before any configured timeout (the first timeout was scheduledToStart at 5 minutes).

maxim · August 20, 2020, 1:25am

I’m suspecting a timeout. Could you post the ActivityTaskScheduled event? Input payload is not needed.

chf75 · August 20, 2020, 2:10am

chf75 · August 20, 2020, 2:41am

And here is the failed activity completion (not exists) from the worker.

maxim · August 20, 2020, 2:48am

This doesn’t look right. What persistence do you use?

chf75 · August 20, 2020, 2:59am

Postgres

maxim · August 20, 2020, 3:29am

Then my guess is that it is PostgreSQL integration bug.

chf75 · August 20, 2020, 3:29am

I think I found the HistoryService log entries that correspond to the failed RespondActivityTaskCompleted request:

It doesn’t tell me much.

maxim · August 20, 2020, 3:30am

PosgreSQL integration has multiple reported bugs. We don’t recommend running it in production at this point.

chf75 · August 20, 2020, 3:33am

The only specific bug I have heard mentioned is the child workflows, which we don’t use. Is there a list of other bugs somewhere?

maxim · August 20, 2020, 3:35am

I’m not aware about such list. But as we never ran any real tests of the integration it is hard to say how many bugs are there.

chf75 · August 20, 2020, 3:58am

Why is there so little information in the logs? It says only, “application_error”. But when I look in the historyEngine code, there are like 5 specific errors that can get returned from RespondActivityTaskCompleted. Are those not logged anywhere?

samar · August 20, 2020, 6:58am

This looks like an issue with the rpc layer used by Cadence which might be eating up the actual error returned by the service.

chf75 · August 20, 2020, 10:34pm

Then where is it stored? Is it just in memory on a History node? What if that node dies?

maxim · August 20, 2020, 10:51pm

It is stored in the DB in the executions table.

chf75 · August 24, 2020, 1:49am

This shows two poll responses for the same activity within seconds of each other. This should not happen, right? The activity timeout is 2 minutes. After the first poll response, the worker tried to complete the activity and the service reported ActivityNotFound. Any suggestions for how to debug this?

maxim · August 24, 2020, 4:13am

Are you able to reproduce this problem using different DB? I believe these are manifestations of bugs in the PostreSQL integration.

chf75 · August 24, 2020, 7:21am

I haven’t been able to repro on my desktop (where I can try a different DB). If I can, I will try it.

But assuming it is Postgres, do you have any tips for how I can debug and maybe fix?

Topic		Replies	Views
When does Temporal write the ActivityTaskStarted event into workflow history? Developer Corner general-impl	0	961	October 9, 2022
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3383	May 24, 2021
No EVENT_TYPE_WORKFLOW_TASK_STARTED when Activity was started Community Support	1	41	November 14, 2024
Temporal Activity Poll & Start Delays - Issues under Load Community Support java-sdk , general-impl	6	719	May 24, 2023
ActivityTaskStarted event versus WorkflowTaskStarted event Community Support	3	527	April 16, 2021

Activity started without corresponding event in execution history

Related topics