Improve Performance on production stuff

Hi all,

With Temporal we managed a workflow splitted on 3 phases (on Java SDK):

  1. Loading phase: extract from oracle source of all datas to manage. I save it on ms with mongodb
  2. Engine phase: read from mongodb populated data and manage the elaboration(read/write operation, huge dataset)
  3. Extractor phase: read result and copy it on oracle, to close the circle

In technical point of view, Engine phase will start only when Loading phase finished. Engine is the part involved on our performance insight.
Atm we have an helm instance of Temporal (default values, just 1 replica) deployed on k8s (fe, web, worker, history and matching)
Temporal’s params are:

  • maxConcurrentWorkflowTaskExecutionSize: 1000
  • maxConcurrentActivityExecutionSize: 1000
  • maxConcurrentActivityTaskPollers: 10
  • maxWorkflowThreadCount: 180000

Engine works in parallel mode, manage datas grouped by and read configuration directly from yml file. We run a lot of activities in this parallel way. (Async - 100 parallel - he needs a lot of memory to do that)
Working with params(example put 500 parallel instead of 100), the problem that i found is context deadline exceeded.
I read a lot of topic on forum that contains ur answer but i didn’t find a solution for my case.
Another importat thing is that i don’t see any correct value on grafana. I followed the guide on blog but i don’t reach the goal also for this.
I need a technical help to improve performance of Temporal to use it for each product in our company.

Thx a lot guys.

For the context deadline exceeded error, are you getting that in your worker code, client, maybe somewhere else? Can you show the full error?

What persistence store are you using? Are you using the default 512 numHistoryShards in values.yaml? Are you changing any default configs in the helm chart?

I’m not sure that having a single replica of especially history and matching services would give you a great setup for performance testing, would probably go with 5 history, 3 frontend, 3 matching and 2 frontend (would nee to configure ingress if its >1) and go up from there. Would be good to know the resources you are setting up for the pods too.

I think to start looking at improving performance you need to set up SDK and server metrics. Can you give more info on the Grafana issue you are having? What guide did you follow?