Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

Open
userlaojie opened this issue Aug 31, 2024 · 9 comments
Open
Assignees
Labels
for/user-attention This issue needs user attention (feedback, rework, etc...) type/bug A general bug

Comments

@userlaojie
Copy link

Hello, we are revamping our system with spring-webflux. After the service was started in a Linux environment, it was found that the memory kept increasing, and the memory was never reclaimed by the jvm. After pulling the service dump file, we suspected that the webclient connection pool was cross-referenced, resulting in the EpollSocketChannel object not being reclaimed.
Please help to check whether there is any problem with webclient configuration, or you can check it from other aspects. Thank you.

This is MAT after analyzing a single object over 80M.
image
image

This is the jvm memory monitoring usage.
image

Steps to Reproduce

The webclient configuration parameters are as follows:

@Slf4j
@Configuration
public class WebClientConfig {
    @Data
    @ConfigurationProperties(prefix = "business.webclient")
    @Configuration
    static class WebClientConnectionConfig {
        private int pendingAcquireTimeout = 50;
        private int maxConnections = 32;
        private int pendingAcquireMaxCount = 1000;
        private long maxIdleTime = 10000;
        private long maxLifeTime = -1;
        private int connectTimeout = 2000;
        private long responseTimeout = 3000;
        private long writeTimeout = 10000;
        private long evictionIntervalTime = 120000;

        @Override
        public String toString() {
            return "WebClientConnectionConfig{" +
                    "pendingAcquireTimeout=" + pendingAcquireTimeout +
                    ", maxConnections=" + maxConnections +
                    ", pendingAcquireMaxCount=" + pendingAcquireMaxCount +
                    ", maxIdleTime=" + maxIdleTime +
                    ", maxLifeTime=" + maxLifeTime +
                    ", connectTimeout=" + connectTimeout +
                    ", responseTimeout=" + responseTimeout +
                    ", writeTimeout=" + writeTimeout +
                    ", evictionIntervalTime=" + evictionIntervalTime +
                    '}';
        }
    }

    @Bean
    public HttpClient httpClient(@Qualifier("webClientConfig.WebClientConnectionConfig")final WebClientConnectionConfig config) throws SSLException {
        log.info("webClientConfig.WebClientConnectionConfig:{}", config.toString());
        ConnectionProvider provider = ConnectionProvider.builder("biz-http-client")
                .pendingAcquireTimeout(Duration.ofMillis(config.getPendingAcquireTimeout()))
                .maxConnections(config.getMaxConnections())
                .maxIdleTime(Duration.ofMillis(config.getMaxIdleTime()))
                .maxLifeTime(Duration.ofMillis(config.getMaxLifeTime()))
                .pendingAcquireMaxCount(config.getPendingAcquireMaxCount())
                .evictInBackground(Duration.ofMillis(config.getEvictionIntervalTime()))
                .build();

        SslContext context = SslContextBuilder.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE).build();

        return HttpClient.create(provider)
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, config.getConnectTimeout())
                .responseTimeout(Duration.ofMillis(config.getResponseTimeout()))
                .doOnConnected(conn -> conn.addHandlerLast(new WriteTimeoutHandler(config.getWriteTimeout(), TimeUnit.MILLISECONDS)))
                .secure(t->t.sslContext(context));
    }

    @Bean
    public WebClient webClient(@Qualifier("httpClient")final HttpClient httpClient) {
        WebClient.Builder builder = WebClient.builder();
        DefaultUriBuilderFactory factory = new DefaultUriBuilderFactory();
        factory.setEncodingMode(DefaultUriBuilderFactory.EncodingMode.NONE);

        return builder
                .clientConnector(new ReactorClientHttpConnector(httpClient))
                .codecs(configurer -> {configurer.defaultCodecs()
                        .jackson2JsonEncoder(new Jackson2JsonEncoder(JsonUtils.getMapper()
                                .setSerializationInclusion(JsonInclude.Include.NON_NULL), MediaType.APPLICATION_JSON));
                })
                .defaultHeader("Content-Type", "application/json; charset=UTF-8")
                .uriBuilderFactory(factory)
                .build();
    }
}

Possible Solution

I have two considerations. The first is that ByteBuf references are not released in epoll model, and the second is that webclient connection pool has configuration problems.

Your Environment

  • Reactor version(s) used:1.1.20
  • Other relevant libraries versions (eg. netty, ...):
<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-webflux</artifactId>
	<version>3.2.7</version>
</dependency>
  • JVM version (java -version):OpenJDK
  • system version (eg. uname -a):CentOS Linux 7 (Core)
@userlaojie userlaojie added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Aug 31, 2024
@userlaojie
Copy link
Author

This is jstat GC statistics, and the number of cgc and ygc is almost the same.
image

@kzander91
Copy link

kzander91 commented Sep 9, 2024

I believe we have the same issue.
reactor-netty 1.1.22
Netty 4.1.112
Spring Boot 3.3.3
uname -a: Linux batch-service-794ddfb76-bqnlb 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
java -version:

openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)

EpollSocketChannel objects are not being garbage-collected:
grafik

I looked into some of these instances and they all seem to be referenced by invalidated pooled connections:
grafik

Have recent releases changed anything w.r.t. pool entry invalidation? Note that we have not configured the connection pool in any way (reactor.netty.pool.maxIdleTime and friends), so all the defaults should apply.

@lfs1985
Copy link

lfs1985 commented Sep 10, 2024

We have the same issue.
Reactor version(s) used:1.1.20
Spring Boot: 3.2.7
JVM version (java -version):OpenJDK 17
EpollSocketChannel is more than 1.27G,but not being garbage-collected。

@violetagg violetagg removed the status/need-triage A new issue that still need to be evaluated as a whole label Sep 10, 2024
@violetagg violetagg self-assigned this Sep 10, 2024
@violetagg
Copy link
Member

All, Please try to provide a reproducible example

@violetagg violetagg added the for/user-attention This issue needs user attention (feedback, rework, etc...) label Sep 10, 2024
@userlaojie
Copy link
Author

Ok, we will try to reproduce the scene locally with the jmeter pressure test interface, which will take a day

@kzander91
Copy link

@userlaojie any luck so far? I myself have been unable to reliably reproduce it.
The tricky thing is that even in my production application, the leak doesn't always happen.
Sometimes it starts leaking until a crash, then, after the reboot, everything is fine for many days.

Considering that in my heap dump, the pool refs are all in STATE_INVALIDATED, maybe its related to connections being closed abnormally?

@vitjouda
Copy link

vitjouda commented Sep 17, 2024

Hi @userlaojie, I strongly believe you came across the same problem as I did, please check my issue if you observe the same behavior. I spent a lot of time trying to simulate it locally, but never managed to produce a reliable reproducer.

@userlaojie
Copy link
Author

userlaojie commented Sep 24, 2024

Sorry, again, we can't replicate this locally. Our latest progress is to remove as many factors as possible that cause http connection unrelease, such as eliminating micrometer usage and not using custom MeterRegistry.
The following is the latest monitoring data, some pod memory is still too high:
channel-qrcode-pay-7686b6d777-5pjgj
image
channel-qrcode-pay-6959cf9bb4-b5zst
image

@vitjouda
Copy link

Hi, I managed to replicate part of the problem and currently discussing it on gitter. If you have the same problem, there are 2 ways how to mitigate it at the moment. Either replace reactor-netty with different, WebClient supported library (we used Apache HttpClient 5, works well), or if you can handle it disable connection keepAlive. In our case, both options eliminate the leak. Of course disabling keepalive is not great and a long-term solution, but you can at least verify if its the same problem. Performance hit will depend on your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for/user-attention This issue needs user attention (feedback, rework, etc...) type/bug A general bug
Projects
None yet
Development

No branches or pull requests

5 participants