3

Application Context: We are currently deploying an application based on spring-boot:3.2.0 which involves REST-based APIs exposed to clients using spring-boot-starter-web, integrating with our datastore via org.springframework.boot:spring-boot-starter-data-elasticsearch topped up by an in-memory cache based on com.github.ben-manes.caffeine:caffeine and uses org.springframework.kafka:spring-kafka.

Issue: Every midnight exactly at 00:00 the thread count of our application spikes and after that very few threads are terminated, leaving our application with incrementally increased threads every day and resulting in the restart of pods sometimes(suspecting this due to OOM as the mem usage increases).

enter image description here

Observation: During this time, there is little deviation in the throughput pattern for the REST-based requests or consuming pipelines of the messages.


Attempt: So, we tried analyzing the thread dump of our application and compared the dump over two different days. One of the major aspects visible was that there are a significant number of threads that are timed_waiting in our application.

enter image description here

stacktrace for these looks like

"http-nio-8080-exec-61" #37660 [37664] daemon prio=5 os_prio=0 cpu=38462.67ms elapsed=46889.40s tid=0x00007f2e0c0557b0 nid=37664 waiting on condition [0x00007f2ddf2fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21/Native Method) - parking to wait for <0x0000000684f26058> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21/LockSupport.java:269) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21/AbstractQueuedSynchronizer.java:1758) at java.util.concurrent.LinkedBlockingQueue.poll(java.base@21/LinkedBlockingQueue.java:460) at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:99) at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33) at org.apache.tomcat.util.threads.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1113) at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176) at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.runWith(java.base@21/Thread.java:1596) at java.lang.Thread.run(java.base@21/Thread.java:1583) 

Help Wanted: But from here, we are currently lost in understanding which direction to look into. Looking forward to some hints from the community over:

  1. Could one of the libraries mentioned have some default configuration to perform some task internal to JVM exactly at 00:00?
  2. Based on the stacktrace of the dump, how do we figure out by what process was the object that a lot of threads are waiting for created? (additional info)

Note - I can share the thread dump if that helps and further information as and when required for this application. As I said, this literally happens every night with our application.


Update: Web and tomcat properties overridden for the application in setup include:

spring.servlet.multipart.max-file-size=10MB spring.servlet.multipart.max-request-size=10MB server.servlet.context-path=/places/api server.tomcat.accesslog.enabled=true server.tomcat.accesslog.suffix=.log server.tomcat.accesslog.prefix=access_log server.tomcat.accesslog.file-date-format=.yyyy-MM-dd server.tomcat.accesslog.pattern=%h %l %t "%r" %s %b "%{Referer}i" "%{User-Agent}i" %D server.tomcat.accesslog.rotate=true server.tomcat.basedir=/var/log server.tomcat.accesslog.directory=places server.tomcat.accesslog.max-days=7 
8
  • What are the names of the threads that keep piling up? Are these all like http-nio-8080-exec? Does the application have any non-default configuration regarding spring-web or tomcat? And in which timezone is 00:00? Is it the user timezone of the jvm or utc? Commented Mar 8, 2024 at 12:24
  • @SpaceTrucker Thanks for your response. Yes, most of them are Tomcat-initiated ones like http-nio-8080-exec. The timezone is IST[UTC+05:30]. There are very few non-default configurations, updated them in the question. Commented Mar 8, 2024 at 12:39
  • @SpaceTrucker now that I doubled down on the configuration explicitly used for the application, could it be the log rotation? Is there a way to confirm that? Commented Mar 8, 2024 at 12:46
  • It looks to me like there's a sudden burst of HTTP traffic that causes Tomcat to spin up extra request handling threads. These threads will stop if they're left sitting idle for too long and Tomcat's thread pool will drop back down to its core pool size. Can you look at your app's access logs to test this theory? Commented Mar 8, 2024 at 13:11
  • 2
    @Bohemian No, it doesn't and I have figured out the cause. It's the access log rotation! I could post an update with findings soon, but I am looking out for Tomcat's bug tracker for any existing references over this. Interestingly, their is a relationship with this when virtual threads are enabled impacting the performance of the application, that is what keeps me occupied with the proof of concept and more findings for now. Commented Mar 18, 2024 at 2:56

1 Answer 1

1

First Phase:

Thanks to the comment by the user which led to me investigating the non-default configurations that the application had. In our case, we were dealing with the following:

server.tomcat.accesslog.file-date-format=.yyyy-MM-dd server.tomcat.accesslog.rotate=true 

which meant that every day exactly at 00:00 timestamp our log files were to be rotated to a newer filename. This was causing multiple tomcat threads to be left in the timed_waiting state.

Upon changing the rotation duration from a day to a month, the impact could be perceived - the thread count increase was not evident during midnight. But, still, some threads resulted in the timed_waiting state at odd intervals for our application.


Second Phase

Upon further inspection of our application's thread dumps, we could suspect the logging still. We identified that we were deploying our application across multiple pods(three in our case) with a K8s cluster over GCP with each logging to the same access log file on a mounted disk. This is what I would suspect resulted in a significant impact over the threads moving to a waiting state to catch our eye.

I validated these with a pod-based prefix added to the log files inclusive of app logs, access logs, and gc logs on our application. That has helped us resort back to a normal execution of our application over multiple days.

enter image description here

Note:

(1) I have created a bug on the Tomcat bug tracker since the search doesn't yield any existing such issue reported or references to state that such practices shall be avoided.

(2) I also suspect the impact of this misconfiguration, while making use of the virtual threads could have been much degrading on the deployed application.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.