We are using temporal , to copy files from source to destination, both the source and destination mapped drives are mounted the worker service, which actually has a logic to copy files from source mapped drive to destination drive. Issue we are facing is, when there is momentarily disconnection of mapped drives (if source/destination is down ) or network glitch, activity part of the copy gets stuck , without performing any action and run till timeout (we have a timeout of 5 hrs for a activity), even after mapped drive gets connected after sometime.
Yes we do have an activity heartbeat , which we are sending every 30 sec, also activity timeout is 5 hrs, since these are meant for long running task.
Also we are running 10 activities in parallel , these issue is happening only when we are running multiple jobs . each jobs running its own 10 parallel activities.
its 1min, we have observed that when we run with less(3-4) activity concurrency(currently 10 in parallel at single point of time* number of job runs) , Jobs are getting completed. Any pointers towards this behaviour would be helpful.
I wonder if this has to do with constrained resources, maybe something is blocking the even loop (CPU-intensive synchronous operation or an infinite loop), can you double check the CPU utilization?
We debugged it , It was stale connection issue of the mapped drive. because of which activities were getting stuck.we changed the mounting commands to save the creds in windows creds manager.