-
-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[automation] memory leak caused by bad rule in UI #2200
Comments
This issue has been mentioned on openHAB Community. There might be relevant details there: |
Thanks for opening this. Reference my notes here. I always thought rules files containing syntax errors were ignored by the rule engine. But I see even in 2.5 that my simple rule executes every second. Of course, because 2.5 only has text-based DSL rules, it doesn't cause a memory consumption issue. That only happens with the UI version of DSL rules. |
@mhilbush Was able to load the test jar I posted and had some interesting results. Apparently restarting of org.openhab.core.model.script.runtime caused the memory to free up. This at least gives us a path to look at for where the issue could be located. |
@kaikreuzer This can now be easily reproduced, so hopefully we can get to the root cause. |
This makes complete sense to me as the resources associated with the bundle were freed up when the original bundle was uninstalled and the new one installed. |
Seems to me that something is different in how resources are managed after a failed rule execution between the text DSL rules and the UI DSL rules. |
So the offending line is definitely Line 126 in 0292806
The call to scriptEngine.newScriptFromString() is failing and throwing an exception when the bad script is in place. The question that I have is why we aren't handling that properly (or what's not releasing). This is already in a try/catch so Line 135 in 0292806
should be throwing the proper exception back to ScriptActionHandler (which it looks like it's doing based on the error above). I'm wondering if the leak is in ScriptActionHandler and for some reason it's not releasing the memory for something when it gets an exception. |
Also, I would assume the issue has nothing to do with the change to cache the parsed script, since that commit was excluded from your test jar. Correct? |
Correct. That said, I believe this would have happened before that as well since we've always called newScriptFromString. |
Just to keep some notes as I go, I threw some debugs in and the exception is definitely being thrown by the line below. It make sense because it's a validation error. Line 150 in 0292806
So the question at this point is, should we be freeing up anything in the else on L150 before we throw the error in the function below: Lines 125 to 156 in 0292806
|
Interesting. But there's one thing I don't understand... The exception message that's logged doesn't say what's in the |
I was just looking at that. The actual error log message is fired at: Line 65 in 0292806
I believe that is because of the way https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core.model.script/src/org/openhab/core/model/script/engine/ScriptParsingException.java is created |
Yeah, I see. So is |
From what I can tell yes. When I grep for it I only get: bundles/org.openhab.core.model.script.runtime/src/org/openhab/core/model/script/runtime/internal/engine/ScriptEngineImpl.java:111: return newScriptFromXExpression(parseScriptIntoXTextEObject(scriptAsString)); |
Ok so more info. I got to looking at this to understand the logic of why we're doing a whole bunch of stuff to one but not the other. Line 96 in 0292806
I added logger.debug("DSLScriptEngine {}",script.stripLeading()); right before that line For a text based rules file we get: 2021-02-15 19:40:12.645 [DEBUG] [time.internal.engine.DSLScriptEngine] - DSLScriptEngine // context: test2-1 For a UI based rule we get: 2021-02-15 19:40:13.090 [DEBUG] [time.internal.engine.DSLScriptEngine] - DSLScriptEngine xxxxxxx |
Apparently I should have grepped for this instead... bundles/org.openhab.core.model.script.runtime/src/org/openhab/core/model/script/runtime/internal/engine/ScriptEngineImpl.java:111: return newScriptFromXExpression(parseScriptIntoXTextEObject(scriptAsString)); |
Text versus UI based evalutation and errors thrown UI:
Text:
|
And the reason for that is because... Lines 108 to 118 in 0292806
Text based rules call newScriptFromXExpression() where UI calls newScriptFromString() |
So I think I see the problem, I'm just not sure how to fix it. When you create a rule in the UI no context is created for it. I'm going to keep going through the code to see how contexts are created when text rules are added to see if I can add something to create a context if one doesn't exist. I "think" that the fact that no context exists is causing the parsing attempt to be stored somewhere and every time the rule fires it attempts to parse and it just spins out of control eventually. Option B is to figure out how to dispose of it properly when it fails. |
And I've made it to the end of the issue... maybe. The offending line... Line 70 in 0292806
Every time it kicks, it creates a new context. Because the rules UI doesn't create a context, and because the generation fails, these go into oblivion for some reason. I've got a jar running right now to see if this is going to spin out of control but right now the system is just hovering inside of a few MB. Basically the fix was to set context to null at the top and then fire new SimpleScriptContext(); if context is null in two places. I also, probably overkill, set context = null before the script throws an exception just in the event something created it and it needs to be destroyed. If this stays stable, I'll make a clean jar without all my debug chaos in it and publish. |
So there appears to be two leaks. Notice below there are periods where the memory usage stays basically stagnant. Then randomly it will jump up 5-10MB. Then calm. Then jump.
|
@mhilbush Please test the jar at: https://github.com/morph166955/openhab-core/releases/tag/memoryleak2200-1 I've not pushed the code up to my branch yet. Looking to see if this helps the situation. To note, the second memory leak seems to kick a few seconds after the following are triggered: 2021-02-15 23:35:42.012 [DEBUG] [core.karaf.internal.FeatureInstaller] - Running scheduled sync job Then for another reason I haven't figured out yet, about 15 minutes in, something else kicks and memory starts to build again. |
@morph166955 I haven't had a chance to try your jar yet. However, I believe the problem is in the call to It appears to me that the resource set will grow on every call to @kaikreuzer WDYT? |
Signed-off-by: morph166955 <[email protected]>
Interesting. I hadn't seen that and I would agree that could be causing a leak as well. There definitely is a second leak that I see kick on after the 15 minute mark. I would still appreciate if you can test the jar to see if it does anything on your system. My system stays very stable until that 15 minute point, and then it starts to spin off to oblivion. I suppose we could fix the leak you have identified simply by setting this.resourceSet = null; or just running deactivate(); before the exception is thrown? If you agree that this would resolve that issue I can add that to my branch and generate up a new test jar. I pushed my changes to https://github.com/morph166955/openhab-core/tree/memoryleak2200 for review as well. |
I made a simple change to prevent the resource set from being reused (see link to code comparison below). This probably isn't a proper fix, as it has pretty significant CPU impact due to the resource set being recreated on each invocation of an invalid rule. However, I wanted to see if it fixed the memory leak, which it did. OTOH, my change has a severe effect on CPU only when you have an invalid UI DSL rule that executes every second. 😉 I'll defer to @kaikreuzer on what to do as the proper long term fix. main...mhilbush:ui-dsl-fix-mem-leak Edit: Been running stable for over 12 hours. |
Interesting. Then why did me modifying the context code also have an impact on the memory usage? Are they somehow tied together? |
Not sure. I'm guessing that it might've reduced the amount of memory added to the resource set on each invalid rule execution. So, maybe it just took longer for the problem to show up. But that's just a guess. |
This issue has been mentioned on openHAB Community. There might be relevant details there: |
Perhaps an option that would have less performance impact would be to call I'm running a version now that reuses the resource set and calls |
I have no issues with that. Perhaps we ultimately do both? |
After looking at the code an obvious problem I see is that the ScriptEngineImpl will create a new resource for every script the comes in, even if such a script is identical to something that has already been loaded. My suggestion to solve that memory leak issue is to only create a new resource if it is in fact a new script. |
According to the documentation |
I guess |
Yeah, I saw that too. I like what you did in your approach. Not knowing too much about how this stuff works, I was steering clear of makeing too many changes...
Yes, it seems to work, and the overhead doesn't look too bad. Remember, the overhead is only incurred when |
Thanks for your feedback, BTW! |
So in several places I added a call to a main...mhilbush:ui-dsl-mem-fix2 Which resulted in the following CPU and memory profile. My test case of an invalid rule executing once per second is pretty severe. If you have that situation going on in your system, you probably have bigger problems. 😉 That's why I modeled once per 15 seconds, which also seems pretty severe, but the results are much more reasonable. |
I'll defer to y'all here on the cause of the issue. I'll still submit a PR for the context memort issue "because it was found and there's no reason not to fix it". |
I actually think the context issue is just a side-effect of the core issue. Probably because parts of the context are stored/referenced as part of the resource in the resource set. |
I think there's still a memory leak even for valid rules. Every time a rule is saved or disabled/enabled, I believe memory will be lost. |
Hey guys, sorry for not chiming in any earlier, but I have been sick this week and am only slowly recovering. For the "invalid rule from UI" as per description of this issue, the fix main...mhilbush:ui-dsl-mem-fix2 seems to be just the right thing to do. Clearly the created resources leak and must be deleted. I have tested it locally and it definitely looks fixed by that change. So @mhilbush, not to steal you your credit for the fix, may I ask you to create a PR with that? If you both feel that there are other leaks for valid rules, I'd suggest to open a new issue with a new description for that. |
Perhaps it's the novice understanding of how things are done in the memory here, but why are the spikes and dips so drastically apart for one rule? It looks like you go from ~45MB to ~125MB up and down. That just seems like a huge sawtooth for one rule no? No disagreement that this seems to fix the issue, and it looks like the context issue is definitely solved with this, I'm just looking at that diagram and then the one a few posts up by @mhilbush going from ~175MB to ~425MB and saying "wow, that seems like a huge amount of memory carved off just for a rule." |
I'll submit the PR later today. And I'll open a separate issue for the leak from a valid rule. |
The number of rules doesn't matter here - the whole Xtext parsing is started every second and that simply involves many objects being instantiated, which can then later be garbage collected. I'd consider this as a quite normal behavior, since a lot of processing is indeed happening. For rules that are valid, we cache the parsed result, so in this case the parsing does not happen anymore - and that should be the normal situation. |
@mhilbush any link for the issue with memory leak for the valid dsl rule? |
Issue for UI DSL valid rule memory leak is here. |
This issue has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/memory-leak-using-jdbc-mysql-in-rules/119370/4 |
Tested against S2207 but reported on https://community.openhab.org/t/openhab-3-runs-out-of-memory-java-heap-space-errors-cpu-100-after-a-few-hours/110924/105 as happening between 3.0 and 3.0.1
As identified by @mhilbush a rule similar to the rule below, but created on the new rules UI. This is 100% reproducible. I created an Ubuntu 20.04 VM (on ESXi) with 4GB ram. I only installed the systeminfo binding to monitor system use over time. I removed the rrd4j persistence as it was thought to be causing the problem at once point (it wasn't). There is nothing on this system other than a single thing created as per the config below.
Causes the following:
Over time, this failure seems to cause a memory leak leading to an eventual system crash due to OutOfMemory
systeminfo thing and items created as
I attempted to fix this in #2182 thinking that this was because of the threadpools but that was not successful. I also attempted to back out #2057 and #1952 to see if either of those were the problem. This was also fruitless. I am at a loss on where to move further and can not identify where the leak is.
The text was updated successfully, but these errors were encountered: