The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

userlaojie · 2024-08-31T09:00:01Z

Hello, we are revamping our system with spring-webflux. After the service was started in a Linux environment, it was found that the memory kept increasing, and the memory was never reclaimed by the jvm. After pulling the service dump file, we suspected that the webclient connection pool was cross-referenced, resulting in the EpollSocketChannel object not being reclaimed.
Please help to check whether there is any problem with webclient configuration, or you can check it from other aspects. Thank you.

This is MAT after analyzing a single object over 80M.

This is the jvm memory monitoring usage.

Steps to Reproduce

The webclient configuration parameters are as follows:

@Slf4j
@Configuration
public class WebClientConfig {
    @Data
    @ConfigurationProperties(prefix = "business.webclient")
    @Configuration
    static class WebClientConnectionConfig {
        private int pendingAcquireTimeout = 50;
        private int maxConnections = 32;
        private int pendingAcquireMaxCount = 1000;
        private long maxIdleTime = 10000;
        private long maxLifeTime = -1;
        private int connectTimeout = 2000;
        private long responseTimeout = 3000;
        private long writeTimeout = 10000;
        private long evictionIntervalTime = 120000;

        @Override
        public String toString() {
            return "WebClientConnectionConfig{" +
                    "pendingAcquireTimeout=" + pendingAcquireTimeout +
                    ", maxConnections=" + maxConnections +
                    ", pendingAcquireMaxCount=" + pendingAcquireMaxCount +
                    ", maxIdleTime=" + maxIdleTime +
                    ", maxLifeTime=" + maxLifeTime +
                    ", connectTimeout=" + connectTimeout +
                    ", responseTimeout=" + responseTimeout +
                    ", writeTimeout=" + writeTimeout +
                    ", evictionIntervalTime=" + evictionIntervalTime +
                    '}';
        }
    }

    @Bean
    public HttpClient httpClient(@Qualifier("webClientConfig.WebClientConnectionConfig")final WebClientConnectionConfig config) throws SSLException {
        log.info("webClientConfig.WebClientConnectionConfig:{}", config.toString());
        ConnectionProvider provider = ConnectionProvider.builder("biz-http-client")
                .pendingAcquireTimeout(Duration.ofMillis(config.getPendingAcquireTimeout()))
                .maxConnections(config.getMaxConnections())
                .maxIdleTime(Duration.ofMillis(config.getMaxIdleTime()))
                .maxLifeTime(Duration.ofMillis(config.getMaxLifeTime()))
                .pendingAcquireMaxCount(config.getPendingAcquireMaxCount())
                .evictInBackground(Duration.ofMillis(config.getEvictionIntervalTime()))
                .build();

        SslContext context = SslContextBuilder.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE).build();

        return HttpClient.create(provider)
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, config.getConnectTimeout())
                .responseTimeout(Duration.ofMillis(config.getResponseTimeout()))
                .doOnConnected(conn -> conn.addHandlerLast(new WriteTimeoutHandler(config.getWriteTimeout(), TimeUnit.MILLISECONDS)))
                .secure(t->t.sslContext(context));
    }

    @Bean
    public WebClient webClient(@Qualifier("httpClient")final HttpClient httpClient) {
        WebClient.Builder builder = WebClient.builder();
        DefaultUriBuilderFactory factory = new DefaultUriBuilderFactory();
        factory.setEncodingMode(DefaultUriBuilderFactory.EncodingMode.NONE);

        return builder
                .clientConnector(new ReactorClientHttpConnector(httpClient))
                .codecs(configurer -> {configurer.defaultCodecs()
                        .jackson2JsonEncoder(new Jackson2JsonEncoder(JsonUtils.getMapper()
                                .setSerializationInclusion(JsonInclude.Include.NON_NULL), MediaType.APPLICATION_JSON));
                })
                .defaultHeader("Content-Type", "application/json; charset=UTF-8")
                .uriBuilderFactory(factory)
                .build();
    }
}

Possible Solution

I have two considerations. The first is that ByteBuf references are not released in epoll model, and the second is that webclient connection pool has configuration problems.

Your Environment

Reactor version(s) used:1.1.20
Other relevant libraries versions (eg. netty, ...):

<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-webflux</artifactId>
	<version>3.2.7</version>
</dependency>

JVM version (java -version):OpenJDK
system version (eg. uname -a):CentOS Linux 7 (Core)

The text was updated successfully, but these errors were encountered:

userlaojie · 2024-08-31T09:09:41Z

This is jstat GC statistics, and the number of cgc and ygc is almost the same.

kzander91 · 2024-09-09T08:22:37Z

I believe we have the same issue.
reactor-netty 1.1.22
Netty 4.1.112
Spring Boot 3.3.3
uname -a: Linux batch-service-794ddfb76-bqnlb 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
java -version:

openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)

EpollSocketChannel objects are not being garbage-collected:

I looked into some of these instances and they all seem to be referenced by invalidated pooled connections:

Have recent releases changed anything w.r.t. pool entry invalidation? Note that we have not configured the connection pool in any way (reactor.netty.pool.maxIdleTime and friends), so all the defaults should apply.

lfs1985 · 2024-09-10T06:41:45Z

We have the same issue.
Reactor version(s) used:1.1.20
Spring Boot: 3.2.7
JVM version (java -version):OpenJDK 17
EpollSocketChannel is more than 1.27G，but not being garbage-collected。

violetagg · 2024-09-10T13:12:16Z

All, Please try to provide a reproducible example

userlaojie · 2024-09-12T02:07:35Z

Ok, we will try to reproduce the scene locally with the jmeter pressure test interface, which will take a day

kzander91 · 2024-09-17T05:50:24Z

@userlaojie any luck so far? I myself have been unable to reliably reproduce it.
The tricky thing is that even in my production application, the leak doesn't always happen.
Sometimes it starts leaking until a crash, then, after the reboot, everything is fine for many days.

Considering that in my heap dump, the pool refs are all in STATE_INVALIDATED, maybe its related to connections being closed abnormally?

vitjouda · 2024-09-17T15:01:35Z

Hi @userlaojie, I strongly believe you came across the same problem as I did, please check my issue if you observe the same behavior. I spent a lot of time trying to simulate it locally, but never managed to produce a reliable reproducer.

userlaojie · 2024-09-24T02:49:10Z

Sorry, again, we can't replicate this locally. Our latest progress is to remove as many factors as possible that cause http connection unrelease, such as eliminating micrometer usage and not using custom MeterRegistry.
The following is the latest monitoring data, some pod memory is still too high:
channel-qrcode-pay-7686b6d777-5pjgj

channel-qrcode-pay-6959cf9bb4-b5zst

vitjouda · 2024-09-24T09:13:00Z

Hi, I managed to replicate part of the problem and currently discussing it on gitter. If you have the same problem, there are 2 ways how to mitigate it at the moment. Either replace reactor-netty with different, WebClient supported library (we used Apache HttpClient 5, works well), or if you can handle it disable connection keepAlive. In our case, both options eliminate the leak. Of course disabling keepalive is not great and a long-term solution, but you can at least verify if its the same problem. Performance hit will depend on your use case.

userlaojie added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Aug 31, 2024

violetagg removed the status/need-triage A new issue that still need to be evaluated as a whole label Sep 10, 2024

violetagg self-assigned this Sep 10, 2024

violetagg added the for/user-attention This issue needs user attention (feedback, rework, etc...) label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

userlaojie commented Aug 31, 2024

userlaojie commented Aug 31, 2024

kzander91 commented Sep 9, 2024 •

edited

Loading

lfs1985 commented Sep 10, 2024

violetagg commented Sep 10, 2024

userlaojie commented Sep 12, 2024

kzander91 commented Sep 17, 2024

vitjouda commented Sep 17, 2024 •

edited

Loading

userlaojie commented Sep 24, 2024 •

edited

Loading

vitjouda commented Sep 24, 2024

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

Comments

userlaojie commented Aug 31, 2024

Steps to Reproduce

Possible Solution

Your Environment

userlaojie commented Aug 31, 2024

kzander91 commented Sep 9, 2024 • edited Loading

lfs1985 commented Sep 10, 2024

violetagg commented Sep 10, 2024

userlaojie commented Sep 12, 2024

kzander91 commented Sep 17, 2024

vitjouda commented Sep 17, 2024 • edited Loading

userlaojie commented Sep 24, 2024 • edited Loading

vitjouda commented Sep 24, 2024

kzander91 commented Sep 9, 2024 •

edited

Loading

vitjouda commented Sep 17, 2024 •

edited

Loading

userlaojie commented Sep 24, 2024 •

edited

Loading