Temporal Sever errors ; workflow failures and all request to history client failed

Hi Temporal Community,

We observed a burst of workflow. Each workflow has a lot of activities. A total number of workflows at a given data point were not very high ~39 and the total activities at that point were ~79200.

We started observing a lot of errors in our server as a result. All the request to history client failed at this point.

To me the problematic errors are

error type 1:

{"level":"info","ts":"2023-04-28T23:13:58.725Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:90"}

error type 2 :

{"level":"info","ts":"2023-04-28T23:14:32.542Z","msg":"Activity task not found","service":"matching","component":"matching-engine","wf-namespace-id":"a13de8cb-9392-4914-bb77-656159403a7f","wf-id":"83ec3b13-8ce5-4a5c-8486-9fbeb4d45976","wf-run-id":"fd1e3d61-b610-4581-a23a-fe2ecc52ff56","wf-task-queue-name":"/_sys/TestCallTreeCreation/3","queue-task-id":2950492,"queue-task-visibility-timestamp":"2023-04-28T23:14:30.947Z","wf-history-event-id":2161,"error":"invalid activityID or activity already timed out or invoking workflow is completed","logging-call-at":"matchingEngine.go:502"}

error type 3:

{"level":"info","ts":"2023-04-28T23:15:39.214Z","msg":"Workflow task not found","service":"matching","component":"matching-engine","wf-task-queue-name":"90@ltx1-app4600.prod.linkedin.com:99b943d2-9aa7-49f9-9b64-c15a3eb48bbc","wf-namespace-id":"a13de8cb-9392-4914-bb77-656159403a7f","wf-id":"83ec3b13-8ce5-4a5c-8486-9fbeb4d45976","wf-run-id":"fd1e3d61-b610-4581-a23a-fe2ecc52ff56","wf-task-queue-name":"90@ltx1-app4600.prod.linkedin.com:99b943d2-9aa7-49f9-9b64-c15a3eb48bbc","queue-task-id":-137,"queue-task-visibility-timestamp":"2023-04-28T23:15:39.209Z","wf-history-event-id":12014,"error":"Workflow task not found.","logging-call-at":"matchingEngine.go:423"}

There were also errors of the type but these look benign and were also present in workflows that passed

{"level":"warn","ts":"2023-04-28T23:12:01.439Z","msg":"history size exceeds warn limit.","shard-id":453,"address":"10.154.98.59:7234","component":"history-cache","wf-namespace-id":"a13de8cb-9392-4914-bb77-656159403a7f","wf-id":"34dbae70-d99a-44f3-863e-7773dbd29934","wf-run-id":"b951ff2b-4481-40b9-b82b-7b15b63062c2","wf-history-size":3043550,"wf-event-count":17197,"logging-call-at":"context.go:915"}

Also to add the workflows that fails ; fail with timeouts

{
  "message": "Activity task timed out",
  "source": "",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": {
    "message": "activity ScheduleToClose timeout",
    "source": "Server",
    "stackTrace": "",
    "encodedAttributes": null,
    "cause": null,
    "timeoutFailureInfo": {
      "timeoutType": "ScheduleToClose",
      "lastHeartbeatDetails": null
    }
  },
  "activityFailureInfo": {
    "scheduledEventId": "532",
    "startedEventId": "17873",
    "identity": "",
    "activityType": {
      "name": "Create"
    },
    "activityId": "5a60d429-c2fd-3595-ba06-695480709fa9",
    "retryState": "NonRetryableFailure"
  }
}

Im having same issues.

@Anshumali_Prasad see if you are doing a lot of wf.sleep or setTimeout in a tight loop. These are expensive calls and I just replaced with a patch a bunch of such expensive calls in a few tight workflows with a much more efficient condition() timeout. It is way too early to say, ill monitor for a few more days but it seems to be helping me.

Thank you I did check the client-side code and it looks like sleep of 5 mins is being passed for variable delayToStart

public class TestCallTreeCreationWorkflowImpl implements TestCallTreeCreationWorkflow {
  private final static ActivityOptions ACTIVITY_OPTIONS = ActivityOptions
      .newBuilder()
      .setScheduleToCloseTimeout(Duration.ofHours(1))
      .build();

  // ActivityStubs enable calls to Activities as if they are local methods, but actually perform an RPC.
  private final CreateTestCallTreesActivity _createTestCallTreesActivity = Workflow.newActivityStub(
      CreateTestCallTreesActivity.class,
      ACTIVITY_OPTIONS);

  /**
   * Construct {@link TestCallTree} entities for the given list of CallTreeIDs and persist in database.
   *
   * @param testContextID          a {@link Long} denotes the Test Context ID to stamp on TestCallTree entities.
   * @param isFetchingCallTreeData a boolean flag denotes whether to fetch Call Tree Data against Call Tree App or not.
   * @param callTreeIDs            a {@link List} of {@link String} representation of Call Tree IDs.
   * @param delayToStart           a {@link Duration} denotes the delay-to-start the workflow. This is an {@link Nullable}
   *                               parameter.
   */
  @Override
  public void create(
      Long testContextID,
      boolean isFetchingCallTreeData,
      List<String> callTreeIDs,
      @Nullable Duration delayToStart) {
    // If the delayToStart is provided through the TestCallTreeCreationWorkflowClient, trigger the workflow execution
    // in the future time
    if (delayToStart != null) {
      Workflow.sleep(delayToStart);
    }

    // Process and persist test call tree entities in parallel
    List<Promise<String>> processPromises = StreamEx.of(callTreeIDs)
        .map(callTreeId -> Async.function(
            _createTestCallTreesActivity::create,
            testContextID,
            callTreeId,
            isFetchingCallTreeData))
        .toImmutableList();

    // Trigger promise invocations in parallel and wait for all promises to complete.
    Promise.allOf(processPromises).get();
  }

Hi @Anshumali_Prasad ,

While it is still very recommended to use condition() sleep rather than wf.sleep() where possible (due to the fact as wf.sleep is very expensive: as I understand that wf.sleep() changes workflow state, timers, updates to persistence storage, etc while condition() is done locally on the worker), for example here is how I patched my JavaScript code

if (patched('heartbeats-on-conditions')) {
  await wf.condition(() => false, '60s'); // wait until we acquire lock and finish execution
}
else {
  await wf.sleep(60 * 1000); // 60 seconds
}

However, having that said, I ended up installing Prometheus + Graphana and after analyzing all metrics from temporal cluster and from workers/SDK I will be increasing numbers of shards to allow for better parallell workflows/tasks executions.

I would recommend to do both of these steps but it’s just my personal experience/thoughts/learnings :slight_smile:

I don’t think you need condition() in your case if it’s a one-off wf.sleep()