Timeouts on local host specific task queue

I am seeing timeouts happening regularly on tasks that are scheduled on the local host task queue (not the shared task queue).

This happens every time a task is scheduled to to local task queue. After the timeout happens, the task is then scheduled into the shared task queue and it gets completed successfully. Any idea why this is happening? Is there anything i need to configure so that the workers pick tasks from this local task queue?

The type of the timeout type is StartToClose. It means that it happened after the workflow task was picked up by a worker. It looks like the worker fails to process the task. Do you see any errors in the worker logs?

I do see the following errors in the logs, any idea what could be causing them? I am running temporal-sdk 1.0.0

[2020-10-26T19:04:22,152] ERROR [Host Local Workflow Poller: 5] io.temporal.internal.worker.Poller: Failure in thread Host Local Workflow Poller: 5 []
    java.lang.NullPointerException: null
    	at io.temporal.internal.worker.WorkflowWorker.apply(WorkflowWorker.java:272) ~[service-1.0.jar:?]
    	at io.temporal.internal.sync.SyncWorkflowWorker.apply(SyncWorkflowWorker.java:215) ~[service-1.0.jar:?]
    	at io.temporal.internal.sync.SyncWorkflowWorker.apply(SyncWorkflowWorker.java:51) ~[service-1.0.jar:?]
    	at io.temporal.internal.worker.PollWorkflowTaskDispatcher.process(PollWorkflowTaskDispatcher.java:75) ~[service-1.0.jar:?]
    	at io.temporal.internal.worker.PollWorkflowTaskDispatcher.process(PollWorkflowTaskDispatcher.java:40) ~[service-1.0.jar:?]
    	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:277) ~[service-1.0.jar:?]
    	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:242) [service-1.0.jar:?]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    	at java.lang.Thread.run(Thread.java:834) [?:?]

@maxim any idea what could be causing this?
Looking at the code where it throws NPE, it seems like pollTaskExecutor is null in this case, not sure why.

This exception is certainly not expected. Do you see it reported constantly or only on startup? I also don’t think that it is related to the original problem of workflow task timeout.

I was not able to find any other error logs beside this. I do see the following request failure metric which i am guessing is related.

@maxim any other suggestions on how to debug this?

After some more debugging it looks like this was caused because i had separate workers for activity and workflow tasks both listening to same task queue. I updated the workers to now register for both workflow and activity tasks (with single queue) and i dont see this error anymore.

@maxim If we have separate workers for activity and workflow tasks should they not be sharing the same task queue?

Interesting. Let us look into this. It should work fine.

There are two completely independent task queue types for workflow and activity tasks. So it is OK to have them with the same name.

A Worker internally contains two task queue consumers. One for activities and another workflows. If no workflows or activities registered the corresponding consumer is not created.

As you can see after I made the change to combine the workers, the start_to_close timeouts went away. I am still seeing few schedule_to_start timeouts, not sure if this is related to the sticky queue, still looking into them.

I wasn’t able to reproduce this issue locally. I’ve split one of existing workers that’s been handling both activity and workflow tasks into two separate ones - one that handles activities only and the other one that handles workflows only and I didn’t see any exceptions.
There might be something special in your setup, can you share with us the way you setup the worker?

@Vitaly Here is a sample of my code

private void setupFeedUploadWorkflow() {
    final WorkflowConfiguration config = // get config;
    logger.info("Setting up feed workflow workers with following config : {}", config);

    // register worker for the main workflow
    setupWorkflowWorker(FeedWorkflowImpl.class,
                        workflowTaskQueueName,
                        WorkerOptions.newBuilder()
                                .setMaxConcurrentWorkflowTaskExecutionSize(config.getWorkflowMaxConcurrentExecutionSize())
                                .setWorkflowPollThreadCount(config.getWorkflowPollThreadCount())
                                .build());

    // register workflow for general feed activities
    setupActivityWorker(activities,
                        workflowTaskQueueName, // TODO consider moving activities to new task queue
                        WorkerOptions.newBuilder()
                                .setMaxConcurrentActivityExecutionSize(config.getActivityMaxConcurrentExecutionSize())
                                .setActivityPollThreadCount(config.getActivityPollThreadCount())
                                .build());
}

/**
 * Workflows are stateful. So you need a type to create instances.
 */
private void setupWorkflowWorker(final Class<?> workflow,
                                 final String taskQueueName,
                                 final WorkerOptions options) {
    final Worker worker = workerFactory.newWorker(taskQueueName, options);
    worker.registerWorkflowImplementationTypes(workflow);
}

/**
 * Activities are stateless and thread safe. So a shared instance is used.
 */
private void setupActivityWorker(final Object activity,
                                 final String taskQueueName,
                                 final WorkerOptions options) {
    final Worker worker = workerFactory.newWorker(taskQueueName, options);
    worker.registerActivitiesImplementations(activity);
}

Thanks for your example, I was able to reproduce it now.

Main difference between your code and what I’ve been doing is that you run both workers in the same java process, creating them using same factory and I ran them in separate processes.

I think I have identified a source of the problem and it is related to the way we setup sticky task poller and register workflow workers to the dispatcher.

Here is my attempt to address the issue:

It should soon be available when we roll out next SDK release.

@Vitaly Look like there has not been another release after 1.0.3, could you let me know when a version with this fix will be released?

Correct, we plan to release 1.0.4 within a week or two.

Sounds good, in the interim is there any way i can get around this? you mentioned you tried creating workers on separate threads, would you have an example?

Right now i am using the same task queue for both workflow and activity workers , i am planning on making some change to have little more control on my task queues.

If you use two different main functions and run two java processes with one worker in each it should work fine. You could also try using two different factories and creating each worker with a separate factory, but I haven’t tested if that would work with 1.0.3. Alternatively if this is not for production use, you could try using 1.0.3-SNAPSHOT, which should have this bug fixed.