Out of memory error

Hi
We are getting java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:801). We restarted the machine still this excption is coming. Hardly we have 20 workflows running

top command
top - 17:44:38 up 31 min,  1 user,  load average: 0.31, 0.48, 1.95
Tasks: 293 total,   1 running, 291 sleeping,   0 stopped,   1 zombie
%Cpu(s):  7.3 us,  0.2 sy,  0.0 ni, 90.8 id,  0.4 wa,  0.0 hi,  0.0 si,  1.2 st
KiB Mem : 32505292 total, 11266404 free, 18471948 used,  2766940 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13531228 avail Mem 



  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                               
 5635 root      20   0 1031496 344496  22540 S   2.3  1.1   5:55.94 temporal-server  

Threrds

ps -T -p 5635
  PID  SPID TTY          TIME CMD
 5635  5635 ?        00:00:00 temporal-server
 5635  6030 ?        00:00:50 temporal-server
 5635  6032 ?        00:00:00 temporal-server
 5635  6038 ?        00:00:35 temporal-server
 5635  6041 ?        00:00:32 temporal-server
 5635  6043 ?        00:00:33 temporal-server
 5635  6044 ?        00:00:34 temporal-server
 5635  6046 ?        00:00:00 temporal-server
 5635  6047 ?        00:00:33 temporal-server
 5635  6055 ?        00:00:24 temporal-server
 5635  6056 ?        00:00:35 temporal-server
 5635  6057 ?        00:00:32 temporal-server
 5635  6092 ?        00:00:34 temporal-server
 5635 10829 ?        00:00:08 temporal-server

at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:2658) at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:77) at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:37) at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:273) at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:242) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:801) at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:939) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1356) at io.grpc.internal.ManagedChannelImpl$3.execute(ManagedChannelImpl.java:629) at io.grpc.internal.DnsNameResolver.resolve(DnsNameResolver.java:389) at io.grpc.internal.DnsNameResolver.refresh(DnsNameResolver.java:212) at io.grpc.internal.ManagedChannelImpl.refreshNameResolution(ManagedChannelImpl.java:468) at io.grpc.internal.ManagedChannelImpl.refreshAndResetNameResolution(ManagedChannelImpl.java:462) at io.grpc.internal.ManagedChannelImpl.handleInternalSubchannelState(ManagedChannelImpl.java:1113) at io.grpc.internal.ManagedChannelImpl.access$5800(ManagedChannelImpl.java:111) at io.grpc.internal.ManagedChannelImpl$SubchannelImpl$1ManagedInternalSubchannelCallback.onStateChange(ManagedChannelImpl.java:1782) at io.grpc.internal.InternalSubchannel.gotoState(InternalSubchannel.java:333) at io.grpc.internal.InternalSubchannel.gotoNonErrorState(InternalSubchannel.java:323) at io.grpc.internal.InternalSubchannel.access$300(InternalSubchannel.java:65) at io.grpc.internal.InternalSubchannel$TransportListener$2.run(InternalSubchannel.java:583) at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95) at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127) at io.grpc.internal.InternalSubchannel$TransportListener.transportShutdown(InternalSubchannel.java:574) at io.grpc.netty.shaded.io.grpc.netty.ClientTransportLifecycleManager.notifyGracefulShutdown(ClientTransportLifecycleManager.java:55) at io.grpc.netty.shaded.io.grpc.netty.ClientTransportLifecycleManager.notifyShutdown(ClientTransportLifecycleManager.java:59) at io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler.onConnectionError(NettyClientHandler.java:499) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:641) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:380) at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792) at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) … 1 common frames omitted " thread=“Workflow Poller taskQueue=“HELLO_WORLD_CHILD_TASK_QUEUE”, namespace=“default”: 1” message=“Failure in thread Workflow Poller taskQueue=“HELLO_WORLD_CHILD_TASK_QUEUE”, namespace=“default”: 1” TAG=“JIFFY.workhorse”


Log Labels:
All Jiffy Logs
Source JIFFY.workhorse
fluentd_worker 0
Parsed Fields:
TAG “JIFFY.workhorse”
level “ERROR”
logger “io.temporal.internal.worker.Poller”
message “Failure in thread Workflow Poller taskQueue=”
namespace “default”
thread “Workflow Poller taskQueue=”
ts 2021-03-24T17:37:41.811Z
tsNs 1616607461811000000

2021-03-24 23:07:41 level=“ERROR” logger=“io.temporal.internal.worker.Poller” throwable="io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug! at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:2658) at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:77) at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:37) at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:273) at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:242) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:801) at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:939) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1356) at io.grpc.internal.ManagedChannelImpl$3.execute(ManagedChannelImpl.java:629) at io.grpc.internal.DnsNameResolver.resolve(DnsNameResolver.java:389)

Not sure what could be wrong


Top of temporal process ID

top - 17:48:37 up 35 min, 1 user, load average: 0.23, 0.37, 1.57
Threads: 14 total, 2 running, 12 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.2 sy, 0.0 ni, 98.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.5 st
KiB Mem : 32505292 total, 11244388 free, 18492724 used, 2768180 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13510404 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6030 root 20 0 1031496 343896 22560 S 1.0 1.1 0:52.77 temporal-server
6041 root 20 0 1031496 343896 22560 S 0.7 1.1 0:33.38 temporal-server
6044 root 20 0 1031496 343896 22560 S 0.7 1.1 0:35.45 temporal-server
6055 root 20 0 1031496 343896 22560 R 0.7 1.1 0:26.21 temporal-server
6057 root 20 0 1031496 343896 22560 S 0.7 1.1 0:34.08 temporal-server
10829 root 20 0 1031496 343896 22560 S 0.7 1.1 0:10.48 temporal-server
6047 root 20 0 1031496 343896 22560 S 0.3 1.1 0:35.19 temporal-server
6092 root 20 0 1031496 343896 22560 S 0.3 1.1 0:36.38 temporal-server
5635 root 20 0 1031496 343896 22560 S 0.0 1.1 0:00.10 temporal-server
6032 root 20 0 1031496 343896 22560 S 0.0 1.1 0:00.00 temporal-server
6038 root 20 0 1031496 343896 22560 S 0.0 1.1 0:36.39 temporal-server
6043 root 20 0 1031496 343896 22560 S 0.0 1.1 0:35.02 temporal-server
6046 root 20 0 1031496 343896 22560 S 0.0 1.1 0:00.00 temporal-server
6056 root 20 0 1031496 343896 22560 R 0.0 1.1 0:35.99 temporal-server



Thanks

It looks like your client process runs out of threads. Have you looked at thread dump to see what does consume them?

Thanks @maxim took the thread dump, and the following is what we found in the dump. There are around 2500 nearly in this state

2468 matches for “ConnectionBackoffResetter-thread” in buffer: tdumpmaster.txt
1032:“ConnectionBackoffResetter-thread-0” #342 daemon prio=5 os_prio=0 cpu=156.56ms elapsed=35336.53s tid=0x00007fb75c4ad560 nid=0x19c8 waiting on condition [0x00007fb814bae000]
1389:“ConnectionBackoffResetter-thread-0” #399 daemon prio=5 os_prio=0 cpu=152.51ms elapsed=35328.85s …

“ConnectionBackoffResetter-thread-0” #454 daemon prio=5 os_prio=0 cpu=165.93ms elapsed=35325.02s tid=0x00007fb7d011b990 nid=0x1b2a waiting on condition [0x00007fb7328f6000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
- parking to wait for <0x00000006148576b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@15.0.1/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@15.0.1/AbstractQueuedSynchronizer.java:1661)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@15.0.1/ScheduledThreadPoolExecutor.java:1182)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@15.0.1/ScheduledThreadPoolExecutor.java:899)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@15.0.1/ThreadPoolExecutor.java:1056)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@15.0.1/ThreadPoolExecutor.java:1116)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@15.0.1/ThreadPoolExecutor.java:630)
at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

We looked temporal code and found that ConnectionBackoffResetter in the Workflowstubs code
we were creating Workflowstubs with the following
WorkflowServiceStubs service = WorkflowServiceStubs.newInstance();. This we are doing for each run of workflow
which is supposed used in singleton way. We removed this code and added code to have only one instance created in ourcode.

Just wanted to confirm whether this was the main reason
Thanks

I see. WorkflowServiceStubs is a heavyweight object and should be used once per process. So switching from an instance per request to a shared one is the right solution.