Temporal Activity get stuck, without performing any action

We are using temporal , to copy files from source to destination, both the source and destination mapped drives are mounted the worker service, which actually has a logic to copy files from source mapped drive to destination drive. Issue we are facing is, when there is momentarily disconnection of mapped drives (if source/destination is down ) or network glitch, activity part of the copy gets stuck , without performing any action and run till timeout (we have a timeout of 5 hrs for a activity), even after mapped drive gets connected after sometime.

Need help in how do we debug this.

Hi @Avadoot

have you consider using activity hearbeat Detecting Activity failures | Temporal Platform Documentation ?

Antonio

Hi @antonio.perez ,

Yes we do have an activity heartbeat , which we are sending every 30 sec, also activity timeout is 5 hrs, since these are meant for long running task.

Also we are running 10 activities in parallel , these issue is happening only when we are running multiple jobs . each jobs running its own 10 parallel activities.

What is the value of the activity heartbeat timeout?

Hi @maxim ,

its 1min, we have observed that when we run with less(3-4) activity concurrency(currently 10 in parallel at single point of time* number of job runs) , Jobs are getting completed. Any pointers towards this behaviour would be helpful.

let me know if you need any other information.

Look at ActivityTaskScheduled event to see if there was more than one attempt.

I wonder if this has to do with constrained resources, maybe something is blocking the even loop (CPU-intensive synchronous operation or an infinite loop), can you double check the CPU utilization?

We debugged it , It was stale connection issue of the mapped drive. because of which activities were getting stuck.we changed the mounting commands to save the creds in windows creds manager.