Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak - version 3.9.0 #695

Open
Jin9628 opened this issue Dec 20, 2023 · 10 comments
Open

Memory leak - version 3.9.0 #695

Jin9628 opened this issue Dec 20, 2023 · 10 comments
Assignees
Labels
waiting for feedback waiting for feedback from the reporter

Comments

@Jin9628
Copy link

Jin9628 commented Dec 20, 2023

I encountered a problem, I used version 2.70.0 and found "com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory$TimeoutContext

net.sourceforge.htmlunit.corejs.javascript.Interpreter$CallFrame

net.sourceforge.htmlunit.corejs.javascript.ConsString” takes up almost 5G of memory. when I set webClient.getOptions().setJavaScriptEnabled(false), I don't hava this problem. I also tried the latest version 3.9.0 , but it doesn't work.

@rbri
Copy link
Member

rbri commented Dec 20, 2023

do you have minimal sample to let me reproduce this?

@Jin9628
Copy link
Author

Jin9628 commented Dec 22, 2023

do you have minimal sample to let me reproduce this?

This is the demo that I used:

public static String getTextByHtmlUrl(String htmlUrl) {
    if (StringUtils.isBlank(htmlUrl)) {
        return StringUtils.EMPTY;
    }
    WebClient webClient = createWebClient();
    HtmlPage page = null;
    String text = "";
    try {
        page = webClient.getPage(htmlUrl);
        webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
        String pageXml = page.asXml();
        Document document = Jsoup.parse(pageXml);
        Elements body = document.select(BODY);
        text = body.get(0).text();
        Elements iframes = document.select(IFRAME);
        if (CollectionUtils.isEmpty(iframes)) {
            return text;
        }
        return getTextByInnerInframe(text, htmlUrl, iframes, webClient);
    } catch (Throwable e) {
        LoggerUtil.warn(LOGGER, "HtmlAnalysisUtil->getTextByHtmlUrl error");
        return StringUtils.EMPTY;
    } finally {
        webClient.close();
    }
}

private static String getTextByInnerInframe(String sourceText, String sourceUrl, Elements iframes, WebClient webClient) {
    StringBuilder stringBuilder = new StringBuilder();
    stringBuilder.append(sourceText);
    iframes.stream().forEach(iframe -> {
        String iframeUrl = iframe.attr(SRC);
        if (StringUtils.isNotBlank(iframeUrl)) {
            iframeUrl = buildValidUrl(sourceUrl, iframeUrl);
            try {
                HtmlPage innerPage = webClient.getPage(iframeUrl);
                webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
                String innerPageXml = innerPage.asXml();
                Document innerDocument = Jsoup.parse(innerPageXml);
                Elements innerBody = innerDocument.select(BODY);
                stringBuilder.append(innerBody.get(0).data());
            } catch (Throwable e) {
                LoggerUtil.error(LOGGER, e, "HtmlAnalysisUtil->getTextByInnerInframe error");
            }
        }
    });
    return stringBuilder.toString();
}

private static WebClient createWebClient() {
    WebClient webClient = null;
    webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setRedirectEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setActiveXNative(false);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    return webClient;
}

@rbri
Copy link
Member

rbri commented Dec 22, 2023

@Jin9628
and you are facing this memory leak after webClient.close();?

@rbri
Copy link
Member

rbri commented Dec 22, 2023

@Jin9628

String pageXml = page.asXml();
Document document = Jsoup.parse(pageXml);

to see code like this makes me sad ;-)
I think you can have everything you get from Jsoup also with HtmlUnit

Maybe you can improve your code (or give me a hint about what is missing). Serializing the page back to xml and then parse it again is not soo efficient.

@Jin9628
Copy link
Author

Jin9628 commented Dec 26, 2023

@Jin9628 and you are facing this memory leak after webClient.close();?

yeah,I will improve my code. But I think it can't help me solve the problem

@rbri
Copy link
Member

rbri commented Dec 27, 2023

@Jin9628 do you have also an url for your sample to let me debug this here?

@Jin9628
Copy link
Author

Jin9628 commented Jan 12, 2024

@Jin9628 do you have also an url for your sample to let me debug this here?

This is an example I found, you can debug this:
url: https://webs.csjywlkj.cn/privacy-tcyx?a=1

@Jin9628
Copy link
Author

Jin9628 commented Jan 23, 2024

@rbri Sorry,have you made any progress?

@rbri
Copy link
Member

rbri commented Jan 24, 2024

@Jin9628 sorry, i wrote a small test program that fetches the page and did this in a loop for 20min. But i can't see any memory leak - the profiler shows no growing in memory.

Maybe you can provide a small test program?

@rbri
Copy link
Member

rbri commented Apr 28, 2024

@Jin9628 can you please try the latest release and report your results

@rbri rbri self-assigned this Apr 28, 2024
@rbri rbri added the waiting for feedback waiting for feedback from the reporter label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for feedback waiting for feedback from the reporter
Projects
None yet
Development

No branches or pull requests

2 participants