-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
788 closing mem leaks #791
Conversation
Don't know if that matmul accuracy test crashing has something to do with these changes. On my laptop there were two failing tests ("cartesian, success" and "where, success") and on my scale up box there were no failing tests 🤷♂️
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix, @corepointer. Closing these memory leaks is very important. Thanks for structuring the PR into four meaningful commits that fix distinct memory leaks. The first three commits (on closeFile()
, castCol()
, and startDAPHNE()
) look totally fine to me.
I only have some comments on the fourth commit on the memory leaks of string scalars. Overall, the solution has the right ingredients and at first glance it looks very good. However, at a closer investigation, it doesn’t behave as expected internally. To illustrate the problems, I pushed a commit with some more prints (the logging facility didn't want to work for me...) and two script-level test cases (strings_1.daphne
and strings_2.daphne
). I had to comment out the fall-back freeing in ~DaphneContext()
; otherwise, the two new test cases print rubbish data.
-
test/api/cli/controlflow/strings_1.daphne
def foo(s:str) -> str { print(s); return s + "d"; } s = "abc"; # string literal, no kernel call s = foo(s); print(s);
bin/daphne --explain obj_ref_mgnt test/api/cli/controlflow/strings_1.daphne
prints:IR after managing object references: module { func.func @"foo-1-1"(%arg0: !daphne.String) -> !daphne.String { %0 = "daphne.constant"() {value = 123 : si64} : () -> si64 %1 = "daphne.constant"() {value = "abc"} : () -> !daphne.String %2 = "daphne.constant"() {value = 96365382536864 : ui64} : () -> ui64 %3 = "daphne.constant"() {value = 96365382536736 : ui64} : () -> ui64 %4 = "daphne.constant"() {value = 140731117186760 : ui64} : () -> ui64 %5 = "daphne.createDaphneContext"(%4, %3, %2) : (ui64, ui64, ui64) -> !daphne.DaphneContext "daphne.decRef"(%arg0) : (!daphne.String) -> () %6 = "daphne.cast"(%0) : (si64) -> !daphne.String %7 = "daphne.concat"(%1, %6) : (!daphne.String, !daphne.String) -> !daphne.String "daphne.decRef"(%6) : (!daphne.String) -> () "daphne.decRef"(%1) : (!daphne.String) -> () "daphne.destroyDaphneContext"() : () -> () "daphne.return"(%7) : (!daphne.String) -> () } func.func @main() { %0 = "daphne.constant"() {value = "abc"} : () -> !daphne.String %1 = "daphne.constant"() {value = true} : () -> i1 %2 = "daphne.constant"() {value = false} : () -> i1 %3 = "daphne.constant"() {value = 96365382536864 : ui64} : () -> ui64 %4 = "daphne.constant"() {value = 96365382536736 : ui64} : () -> ui64 %5 = "daphne.constant"() {value = 140731117186760 : ui64} : () -> ui64 %6 = "daphne.createDaphneContext"(%5, %4, %3) : (ui64, ui64, ui64) -> !daphne.DaphneContext "daphne.incRef"(%0) : (!daphne.String) -> () %7 = "daphne.generic_call"(%0) {callee = "foo-1-1"} : (!daphne.String) -> !daphne.String "daphne.decRef"(%0) : (!daphne.String) -> () "daphne.print"(%7, %1, %2) : (!daphne.String, i1, i1) -> () "daphne.decRef"(%7) : (!daphne.String) -> () "daphne.destroyDaphneContext"() : () -> () "daphne.return"() : () -> () } } IncRef: 140731117181116 (not found) DecRef: 140731117181116 (not found) DecRef: 96365410280384 (found) DecRef: 140731117181028 (not found) Deleting 1 remaining string refs in ~DaphneContext() DecRef: 140731117181116 (not found) abc123 DecRef: 96365410261984 (not found)
The
incRef
/decRef
are in the right places. Note that the value"abc"
is constant-propagated into the UDF.Problems: From the very beginning, the string is never in the
stringRefCount
. ThusìncRef
anddecRef
always ignore it and the fall-back tidy-up code in~DaphneContext()
isn't aware of the string. Effectively, the string is not garbage collected, the memory leak still exists. -
test/api/cli/controlflow/strings_2.daphne
def foo(s:str) -> str { return s + " is my favorite number"; } s = as.str(123); # kernel call (but could be constant folded in the future) s = foo(s); print(s);
bin/daphne --explain obj_ref_mgnt test/api/cli/controlflow/strings_2.daphne
prints:IR after managing object references: module { func.func @"foo-1"(%arg0: !daphne.String) -> !daphne.String { %0 = "daphne.constant"() {value = " is my favorite number"} : () -> !daphne.String %1 = "daphne.constant"() {value = 97987952754336 : ui64} : () -> ui64 %2 = "daphne.constant"() {value = 97987952754208 : ui64} : () -> ui64 %3 = "daphne.constant"() {value = 140722023857560 : ui64} : () -> ui64 %4 = "daphne.createDaphneContext"(%3, %2, %1) : (ui64, ui64, ui64) -> !daphne.DaphneContext %5 = "daphne.concat"(%arg0, %0) : (!daphne.String, !daphne.String) -> !daphne.String "daphne.decRef"(%0) : (!daphne.String) -> () "daphne.decRef"(%arg0) : (!daphne.String) -> () "daphne.destroyDaphneContext"() : () -> () "daphne.return"(%5) : (!daphne.String) -> () } func.func @main() { %0 = "daphne.constant"() {value = 123 : si64} : () -> si64 %1 = "daphne.constant"() {value = true} : () -> i1 %2 = "daphne.constant"() {value = false} : () -> i1 %3 = "daphne.constant"() {value = 97987952754336 : ui64} : () -> ui64 %4 = "daphne.constant"() {value = 97987952754208 : ui64} : () -> ui64 %5 = "daphne.constant"() {value = 140722023857560 : ui64} : () -> ui64 %6 = "daphne.createDaphneContext"(%5, %4, %3) : (ui64, ui64, ui64) -> !daphne.DaphneContext %7 = "daphne.cast"(%0) : (si64) -> !daphne.String "daphne.incRef"(%7) : (!daphne.String) -> () %8 = "daphne.generic_call"(%7) {callee = "foo-1"} : (!daphne.String) -> !daphne.String "daphne.decRef"(%7) : (!daphne.String) -> () "daphne.print"(%8, %1, %2) : (!daphne.String, i1, i1) -> () "daphne.decRef"(%8) : (!daphne.String) -> () "daphne.destroyDaphneContext"() : () -> () "daphne.return"() : () -> () } } IncRef: 97987987429184 (found) DecRef: 140722023851833 (not found) DecRef: 97987987429184 (not found) Deleting 1 remaining string refs in ~DaphneContext() DecRef: 97987987429184 (found) 123 is my favorite number DecRef: 97987987364976 (not found) Deleting 1 remaining string refs in ~DaphneContext()
Again, the
incRef
/decRef
are in the right places.Problems: The first
DecRef: ... (not found)
is because of aConstantOp
, the second/thirdDecRef: ... (not found)
are because the UDF has its ownDaphneContext
with a freshstringRefCount
. For the same reason, theDeleting ... remaining string refs
happens even though it never should.
Fixing these problems is required before we can merge the PR.
I recommend the following steps:
-
We need one global instance of the
stringRefCount
map that is shared by all instances ofDaphneContext
. The solution can be done similar toDaphneContext::config
,DaphneContext::dispatchMapping
, andDaphneContext::stats
, i.e., a single instance of thestringRefCount
should be created indaphne_internal.cpp
, the pointer to this single instance should be given to thecreateDaphneContext
-kernel, and theDaphneContext
should have a reference to the instance. The same holds for themtxStrRefCnt
mutex (maybe we could bundle the map and the mutex in a little helper struct). -
We should not register a newly created string scalar in the
stringRefCount
(with a reference counter of 1). Instead, we should assume that achar *
that is not in thestringRefCount
implicitly has a reference counter of 1. This approach has the following advantages:- We cannot forget registering a string scalar in
stringRefCount
. Currently, each kernel that creates a string scalar (there will be more such kernels in the future) needs to do that explicitly; a developer can easily forget that. It’s different from data objects, where the reference counter is automatically initialized to 1. - We correctly reflect the reference counter of string scalars that are not created by a C++ kernel. This applies to string constants (
ConstantOp
), strings evaluated through constant folding at compile-time, and may be the case for code-generated kernels in the future. - We reduce the number of accesses to the
stringRefCounter
, thereby spending less time in the critical section that is protected by themtxStrRefCnt
mutex.
Consequently, we would not need the function
DaphneContext::addStrRef()
anymore. Furthermore the calls to this function in the individual kernels (castSca
,concat
) can be removed. - We cannot forget registering a string scalar in
-
The fall-back code (
if(!stringRefCount.empty())
) in~DaphneContext()
should be removed. As you point out in a comment, this situation should never happen. If it happens, this means sth went wrong and it might not even be safe to free the remaining string scalars. (The fact that I needed to comment out that code to even make the new test cases run supports that.) However, we could still log that it happened. -
The access to the
stringRefCount
in theIncRef<char>
andDecRef<char>
-kernels must be synchronized using thestrRefCountMtx
. While they read the map, the map could be edited by another thread (adding/removed entries). WhileDecRef<char>
edits the map, other threads may read or modify the map.
Optionally, more script-level test cases would be great.
Feel free to let me know your thoughts.
Thx for the detailed review and the suggestions @pdamme
|
79ac932
to
f0a0881
Compare
I implemented the requested changes. But it seems that 2) won't work and explains why I went that addStrRef() route the other night: We can only delete dynamically allocated memory of course. What we get when deleting constants that the compiler emits is:
So if we really want to delete what is allocated at compile time, we need a way to distinguish that from the run-time data and destruct correctly. MatrixConstants, besides strings, would also be something where this is needed. |
c9d0d45
to
0508865
Compare
Thanks for continuing with this PR, @corepointer. You are absolutely right, we cannot free the string literals due to the way we allocate them in My idea was to always increase the reference counter of a string literal right after its creation. The effect is that the original object is never freed by DAPHNE's garbage collection. However, the memory consumption clearly indicates that MLIR frees it automatically, but I didn't double-check that in the IR. All strings created by kernels (e.g., concat) are properly freed by DAPHNE's garbage collection. At the same time, string literals defined inside UDFs or loop bodies can be used by many function calls or loop iterations (this would have been a challenge with an alternative solution, see below). To my mind, this is a simple trick and we could even apply it to matrix literals later (currently, we unnecessarily copy those for every use). I've implemented this idea on top of your code and pushed a commit. So essentially, the memory leak is fixed, as far as I can tell. I manually tested it with the following DaphneDSL script: def foo(i:si64) {
print("veeery long string" + i); # I used a 10 KB string
}
for(i in 1:100000) {
foo(i);
} During the execution of Before that, I considered another alternative: We could easily ensure that string literals are backed by strings that were allocated dynamically in C++ at compile-time and can be freed. To this end, we just need to adapt if(auto strAttr = op.getValue().dyn_cast<StringAttr>()) {
StringRef sr = strAttr.getValue();
#if 0
// MLIR does not have direct support for strings. Thus, if this is
// ...
#elif 1
Type i8PtrType = LLVM::LLVMPointerType::get(
IntegerType::get(rewriter.getContext(), 8)
);
char * newString = new char[sr.size() + 1];
memcpy(newString, sr.str().c_str(), sr.size() + 1);
uint64_t ptr = reinterpret_cast<uint64_t>(reinterpret_cast<const void *>(newString));
rewriter.replaceOpWithNewOp<LLVM::IntToPtrOp>(
op.getOperation(),
i8PtrType,
rewriter.create<arith::ConstantOp>(
loc,
rewriter.getI64IntegerAttr(ptr)
)
);
#elif 0
// ...
#endif
} That works, but then we get the same problem we used to have with matrix literals: When a string literal is in a UDF or a loop's body, it gets freed after the first function call or loop iteration and is not there anymore for the second time, leading to a crash. For matrix literals, we solved this by always copying the (small) matrix and leaving the original untouched. I think a similar approach can be done for strings, too. However, such artificial copying is not nice. |
0508865
to
5207395
Compare
This minimal change free's the File struct in closeFile() that has previously been malloc'ed in open*() in our C based File I/O code.
A dynamically allocated temporary data structure is now deleted.
Valgrind reported a memory leak before adding this explicitn moduleOp->destroy()
This commit extends the ManageObjectRefs compiler pass and its runtime kernels to track usage of dynamically allocated string data and delete it after its last use. So far this has been found necessary with string concatenation and casting operations. Strings allocated by the compiler as constants will have their reference counter increased upon creation, which should prevent them from being garbage collected. For a detailed description see issue daphne-eu#788 and discussion in pull request daphne-eu#791. Fixes daphne-eu#788, Closes daphne-eu#791 Co-authored-by: Patrick Damme <[email protected]>
5207395
to
2211a32
Compare
I did some clean ups and ran the tests again. The failing sql tests I mentioned above are not related to this PR as they also occurred on main when running the test suite in debug mode.
Thx again. Deleting the branch after successful merge 👍 |
The code in this PR solves the memory leak documented in #788.
In addition three other minor memory leaks are closed - see the corresponding commits.
Assigning @pdamme as requested =)