I've run into an issue related to performance of Spring Boot 3 application with virtual threads enabled.
TL;DR reproduction steps:
- Checkout https://github.com/stsypanov/concurrency-demo
- Run
DependencyApplicationandConcurrencyDemoApplication - When both apps are up run
StuckApplicationTest - It will take about 1-2 minutes for the test to complete
- Now go to
demo-service/application.ymland setspring.threads.virtual.enabled: true(by default it'sfalse). - Restart
ConcurrencyDemoApplication - Run
StuckApplicationTestagain - Run
YourKitprofiler and connect toConcurrencyDemoApplication, almost immediately you'll see a warning message about potential deadlock and the application itself hangs as all threads of itsForkJoinPoolhaveWaitingstatus.
More detailed description:
General set-up:
- Windows 11
- 13th Gen Intel(R) Core(TM) i7-1370P
- Liberica JDK 21.0.4
- Spring Boot 3.3.2
org.springframework.cloud:spring-cloud-dependencies:2023.0.3io.github.openfeign:feign-httpclient
To reproduce the issue you need a test doing not that many concurrent calls to a Spring Boot -based microservice, e.g. calling /actuator/health:
@Test void run() { var restTemplate = new RestTemplate(); var latch = new CountDownLatch(1); var executor = Executors.newVirtualThreadPerTaskExecutor(); try (executor) { for (int i = 0; i < 100; i++) { executor.submit(() -> { waitOn(latch); assertNotNull(restTemplate.getForEntity("http://localhost:8081/actuator/health", ResponseEntity.class)); }); } latch.countDown(); } assertTrue(executor.isTerminated()); } private static void waitOn(CountDownLatch latch) { try { latch.await(); } catch (InterruptedException e) { throw new RuntimeException(e); } } This test sends GET requests to, let's say, Service A. In Service A I have a pretty simple instance of HealthIndicator and Feign client:
@Component @RequiredArgsConstructor public class DownstreamServiceHealthIndicator implements HealthIndicator { private final HealthClient healthClient; @Override public Health health() { var response = healthClient.checkHealth(); if (response.getStatusCode().is2xxSuccessful()) { return new Health.Builder().up().build(); } return new Health.Builder().down().withDetails(Map.of("response", response)).build(); } } @FeignClient(name = "healthClient", url = "http://localhost:8087/actuator/health", configuration = InternalFeignConfiguration.class) public interface HealthClient { @GetMapping ResponseEntity<String> checkHealth(); } public class InternalFeignConfiguration { @Bean public Client client() { return new ApacheHttpClient(HttpClients.createDefault()); } } Again, the test concurrently calls /actuator/health of Service A via RestTemplate and Service A calls /actuator/health of Service B via Feign client. Service B consists of the application main class and application.yml (see the code in the repository specified above), declaring health endpoint.
When you run the system with default settings it's OK. It takes ~1,5 second for the test to complete, but everything else is fine.
However, having virtual threads on, the Service A gets stuck and if you connect to it with e.g. YourKit profiler you'll get a warning message about potential deadlock with this stack trace:
+-----------------------------------------------------------------------------------------------------------------------------+ | Name | +-----------------------------------------------------------------------------------------------------------------------------+ | +---Read-Updater Frozen for at least 10s <Ignore a false positive> | | | +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native) | | | +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371 | | | +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458 | | | +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613 | | | +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257 | | | +---sun.nio.ch.Poller.updateLoop() Poller.java:286 | | | +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run() | | | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 | | | +---java.lang.Thread.run() Thread.java:1583 | | | +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186 | | | | | +---spring.cloud.inetutils Frozen for at least 10s <Ignore a false positive> | | | +---java.net.Inet6AddressImpl.getHostByAddr(byte[]) Inet6AddressImpl.java (native) | | | +---java.net.InetAddress$PlatformResolver.lookupByAddress(byte[]) InetAddress.java:1225 | | | +---java.net.InetAddress.getHostFromNameService(InetAddress, boolean) InetAddress.java:840 | | | +---java.net.InetAddress.getHostName(boolean) InetAddress.java:782 | | | +---java.net.InetAddress.getHostName() InetAddress.java:754 | | | +---org.springframework.cloud.commons.util.InetUtils$$Lambda.0x0000024081187240.call() | | | +---java.util.concurrent.FutureTask.run() FutureTask.java:317 | | | +---java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1144 | | | +---java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:642 | | | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 | | | +---java.lang.Thread.run() Thread.java:1583 | | | | | +---Write-Updater Frozen for at least 10s <Ignore a false positive> | | +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native) | | +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371 | | +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458 | | +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613 | | +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257 | | +---sun.nio.ch.Poller.updateLoop() Poller.java:286 | | +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run() | | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 | | +---java.lang.Thread.run() Thread.java:1583 | | +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186 | +-----------------------------------------------------------------------------------------------------------------------------+ As it appears from the above, the application's bottleneck is in this method:
// class org.springframework.cloud.commons.util.InetUtils public HostInfo convertAddress(final InetAddress address) { HostInfo hostInfo = new HostInfo(); Future<String> result = this.executorService.submit(address::getHostName); // <--- String hostname; try { hostname = result.get(this.properties.getTimeoutSeconds(), TimeUnit.SECONDS); // <--- } catch (Exception e) { this.log.info("Cannot determine local hostname"); hostname = "localhost"; } hostInfo.setHostname(hostname); hostInfo.setIpAddress(address.getHostAddress()); return hostInfo; } and the root cause is likely to be SingleThreadExecutor used to detect host name when doing a remote call:
public InetUtils(final InetUtilsProperties properties) { this.properties = properties; this.executorService = Executors.newSingleThreadExecutor(r -> { Thread thread = new Thread(r); thread.setName(InetUtilsProperties.PREFIX); thread.setDaemon(true); return thread; }); } In practice that means only one task can be executed at a moment, i.e. when parallel threads are doing simultaneous calls they resolve the host name sequentially!
What puzzles me is the fact that if I turn the virtual threads off there'll be no hung in the application.
Can anyone explain whether my theory about the root cause of the issue is correct and why the application counter-intuitively survives on platform (heavyweight) threads and dies on virtual (i.e. lightweight) ones?
CountDownLatchbut only calllatch.countDown();once. So 99 threads will wait forever?CountDownLatchhere is a kind of barrier: it is instantiated with count = 1 and when I dolatch.countDown()all the threads blocked atlatch.await()are released and start sending requests simultaneously.