Prevent OOMs during heap snapshot: Change to streaming out the snapshot data. #51518

NHDaly · 2023-09-29T19:35:49Z

The solution we came up with here is stream out the heap snapshot, to avoid OOMing while recording it, and then do the downsampling via post-processing, to satisfy the Chrome devtools.

This allows you to record a heap snapshot from a running julia process, even (or especially) when it's current memory usage is close to the limit, without the snapshotter pushing it over the edge.

Unfortunately, this currently represents a change in the API: we now need to write out four files instead of one, and we can no longer support the function that takes an IOBuffer.

Linked here is the current version of our reassembly code, which could probably stand to be cleaned up a bit (thanks @Drvi):

https://gist.github.com/Drvi/37726df2a00d385717e79181539e33bc

I'd like to solicit opinions on whether this kind of breaking change is okay or not for a debugging tool like this.

If we don't want to break this API, I think we can add an option, like streaming=false. Then, to support the legacy non-streaming mode, I think we have two options:

We could include the reassembly code in the Profiler stdlib, which maybe we'd want to do anyway, so that people don't have to install another package like HeapSnapshotTools.jl just to use the heap snapshot. Then for the legacy mode, we simply reconstruct the file and write it to the destination.
We could keep both the new and the old C++ code, and toggle between them. This seems annoyingly wasteful and messy though.

This PR currently takes approach 1.

Co-Authored-By: @Drvi

This should prevent the engine from OOMing while recording the snapshot! Now we just need to sample the files, either online, before downloading, or offline after downloading :) If we're gonna do it offline, we'll want to gzip the files before downloading them.

NHDaly · 2023-09-29T19:36:46Z

@apaz-cli and @gbaraldi: Can I get your review?

On reflection, I think I much prefer option 1, so i'm going to push up a commit with that for now.

NHDaly · 2023-09-29T20:18:17Z

Okay I have pushed up another commit to support approach 1. The API options are now:

julia> let io = IOBuffer()
           Profile.take_heap_snapshot(io)   # maybe we want to consider this one deprecated though?
           String(take!(io)[1:100])
       end
"{\"snapshot\":{\"meta\":{\"node_fields\":[\"type\",\"name\",\"id\",\"self_size\",\"edge_count\",\"trace_node_id\",\"det"

julia> Profile.take_heap_snapshot("/tmp/2.heapsnapshot") # streaming=false by default
Recorded heap snapshot: /tmp/2.heapsnapshot
"/tmp/2.heapsnapshot"

julia> Profile.take_heap_snapshot("/tmp/2.heapsnapshot", streaming=true) # the new API
Finished streaming heap snapshot parts to prefix: /tmp/2.heapsnapshot
"/tmp/2.heapsnapshot"

src/gc-heap-snapshot.cpp

vilterp · 2023-09-29T20:22:49Z

What format are the streamed-out files in?

stdlib/Profile/src/heapsnapshot_reassemble.jl

NHDaly · 2023-09-29T20:49:27Z

stdlib/Profile/src/heapsnapshot_reassemble.jl

+
+    _digits_buf = zeros(UInt8, ndigits(typemax(UInt)))
+    println(io, @view(preamble[1:end-2]), ",") # remove trailing "}\n", we don't end the snapshot here
+    println(io, "\"nodes\":[")


The nodes and edges files currently aren't valid json, since we're not writing out the leading and trailing [], and we aren't writing a trailing comma after each line.

I figured it was easier to process this way, which seems to bear out with the code you wrote, @Drvi. But now I'm kind of thinking we may as well have every file we output be valid JSON...?

On the other hand, that's an extra character per node and per edge, of which there can be billions, so this could add a whole GiB right there?

If this is adding a GB, then the whole file is probably 10s of GB, since that one character will only be a small fraction of each line. But maybe we want to consider output these as BSON instead, to get faster encoding performance for them.

In retrospect, i think these are actually truly csvs! I'm going to rename the output files to be .csv. I think that at least makes them self-documenting in their format.

Regarding the streaming time and file size... I agree, writing them out as binary files would be even better! 🤔 They're literally just 2 giant matrices of numbers like:

$ cat 58178_136122584059625.heapsnapshot.edges | head -n 3 0,2,0,1 0,3,0,2 0,2,0,3 $ cat 58178_136122584059625.heapsnapshot.nodes | head -n 3 0,0,0,0,0,0,0 1,1,4482646032,384,0,0,0 1,1,15281513072,384,0,0,0

So we should definitely consider some kind of binary format instead. 👍

Is there something even simpler than BSON for this? Like, i think we could literally just output an array of int64 binary data, and then read it in like that, no?

Heh, so it turns out that writing these as binary data does make it faster, but the files are actually slightly bigger, since i guess a lot of the indexes were smallish numbers (so only a few bytes) whereas their binary format is the full 8 bytes.

But still, the speed probably makes this worth it 👍

A couple of thoughts on the binary format. Since the schema is known, it wouldn't be hard to handle each column separately in the snapshot assembler.

For nodes we know that type is an index into a small array of node_types so that could be a single byte (or less), name and self_size won't need 8 bytes either, and would benefit from varint (aka vbyte) encoding like protobuf does (this eliminates the leading zeros and is relatively easy to implement). edge_count, trace_node_id and detachedness are always zero so they can be omitted altogether. id is an interesting case as currently it's the pointer to the object, but once written to a file, we can just enumerate the nodes to get a unique id so we don't need to write the pointer out (but maybe having all the pointers in your program could be useful for some analysis?).

For edges, again type could be a single byte (or less). name_or_index (the index into strings for some edge types) is interesting as we could use varint encoding but I think we are using typemax(UInt64) for edge types that don't have a corresponding string which wouldn't compress in varint (it would get bigger in fact) so maybe for these edge types we shouldn't encode any name_or_index value at all or have a special value for those. to_node and from_node are bounded by the size of nodes so varint should help again.

One issue with the current approach is that in order to reassemble the snapshot, one needs to update the edge_count for each node, which means that you need to have all nodes in memory and you need to iterate the edges twice (once to update nodes, and then to write them to the assembled snapshot). How about we produce another file, edge_counts, which would basically be an array of edge counts for each node that we accumulate and write at the end (for 10M nodes that would be 40MB)?

You don't just need the edge_counts; the edges need to be grouped and ordered by the nodes. So you read the file by iterating the nodes, seeing how many edges it has, and then those first N edges are coming out of the first node, then you move to the second node, and the next M edges are coming from that node, etc.

I dunno if you can build that file without having all the nodes in memory while we iterate the edges? Maybe I'm not following what you were explaining?

Ah my bad, I thought the edges were written out already ordered, yeah then the edge_count idea doesn't apply.

Mmm yeah makes sense. Yeah it's too bad. :(

Accounting for that was the last bug that I fixed while you were out on vacation. The file format is really gnarly and i think we didn't get to this detail when we talked through it way back at the start. 😊

NHDaly · 2023-09-30T02:44:57Z

What format are the streamed-out files in?

@vilterp Oh, i missed this message.

I commented on it here: #51518 (comment)

Right now, the .strings and .json files are valid json, and the .nodes and .edges are essentially csvs (they're newline-separated rows of values separated by commas).

I'm very open to changing that; this was just the fastest possible change i could make, since we were rushing for the OOM investigation at work. Open to any suggestions!

vilterp · 2023-09-30T02:46:59Z

Sounds good; was just wondering.

…

On Fri, Sep 29, 2023 at 10:45 PM Nathan Daly ***@***.***> wrote: What format are the streamed-out files in? @vilterp <https://github.com/vilterp> Oh, i missed this message. I commented on it here: #51518 (comment) <#51518 (comment)> Right now, the .strings and .json files are valid json, and the .nodes and .edges are essentially csvs (they're newline-separated rows of values separated by commas). I'm very open to changing that; this was just the fastest possible change i could make, since we were rushing for the OOM investigation at work. Open to any suggestions! — Reply to this email directly, view it on GitHub <#51518 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABZLKUQLQXWG6VS5JDZ6TX46BTHANCNFSM6AAAAAA5M5CATI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vtjnash · 2023-09-30T16:00:47Z

src/gc-heap-snapshot.cpp

@@ -106,15 +106,22 @@ struct StringTable {
 };

 struct HeapSnapshot {
-    vector<Node> nodes;
-    // edges are stored on each from_node
-
    StringTable names;
    StringTable node_types;
    StringTable edge_types;
    DenseMap<void *, size_t> node_ptr_to_index_map;


I think you can also avoid creating this table in the serializer process by making it an identity map instead (when used for edges) and streaming out the value of the original pointer also on each node. Then in the later process, you just need to build this map from pointer->number to satisfy converting it into the Javascript format. This dict is also used to serialize the representation/name of the object exactly once, but that can instead be satisfied by knowing the GC will visit each object exactly once as a "from" node when marking all of the out-refs.

Yeah, great points!
You're right, i think it should be a pattern of N edges from node 1 then M edges from node 2, etc. 👍 So we should be able to account for that in the second half of your comment.

I hadn't noticed this map yet in my rush job to stream the data here. I think you're right that we should fix this too 👍 👍

Thanks!

We considered this, but didn't address it in the new PR #52854 since it requires reprocessing of the nodes/edges and some complex logic to merge duplicate nodes and outgoing edges and it could only be an issue for a huge snapshot. We like to leave it for future work.

vtjnash · 2023-09-30T16:05:12Z

src/gc-heap-snapshot.cpp

@@ -106,15 +106,22 @@ struct StringTable {
 };

 struct HeapSnapshot {
-    vector<Node> nodes;
-    // edges are stored on each from_node
-
    StringTable names;


You also might want to consider streaming this field out into the file per-node also, at least in some or all cases, since I think this is often unique? The JSON can probably deal with being a Union{Int,String} field, depending on whether the content was probably unique (or long, like a String) or probably common like a name.

Yeah, i think a nice thing here could be to keep some kind of like a bloom filter or something, and in the nodes file, we write out either an index into this table, or we write out the string itself.

But thinking more about it, i think that it's maybe even fine to just duplicate the strings over and over. The priorities here are:
A) be fast
B) don't OOM
I don't think that file size reduction is nearly as important as those two.

So if we can intern some of the strings, like you suggested, and write out the rest, that's probably good enough, yeah. Keep common and big strings, and write them out uniquely otherwise.

This is the last major cleanup that I think we should do, otherwise this PR looks good to go.
It seems to be working in its current state in order to avoid OOMs in our production setup.

This is addressed in the new PR #52854 by streaming out the string table directly. At the same time, we keep a very limited amount of known strings in memory to reduce duplicates in the string table as much as possible.

stdlib/Profile/src/heapsnapshot_reassemble.jl

gbaraldi · 2023-10-03T18:22:58Z

Btw, when this code was developed it was kind of reverse engineered, but microsoft released some documentation on it if you need a proper reference https://learn.microsoft.com/en-us/microsoft-edge/devtools-guide-chromium/memory-problems/heap-snapshot-schema

stdlib/Profile/src/heapsnapshot_reassemble.jl

This way you can always recover from an OOM

node order. That's the whole reason this is tricky. But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table?

JianFangAtRai · 2024-01-11T04:48:01Z

created a new PR #52854 to continue the work on this PR

NHDaly · 2024-01-11T17:01:13Z

Closing in favor of #52854.

JianFangAtRai · 2024-01-24T19:34:52Z

Hi, the new PR (#52854) is ready for review now. The new PR fixed multiple minor issues in the original PR such as doc, safepoint, alloc type, and so on. The main improvement in the new PR is to stream out the string table during the snapshotting process instead of holding them in memory first and writing them out into a file at the end. To reduce duplicate strings as much as possible, we hold some known strings in memory for deduping purpose to reduce string table size.

We didn't address DenseMap<void *, size_t> node_ptr_to_index_map in the new PR since it requires duplicate nodes and the logic to merge duplicate nodes and outgoing edges, which could be very involving. We leave that for future work since it could only be an issue for a huge heap snapshot. We tested the PR with about 40GB heap snapshot and it didn't crash the process.

This PR is to continue the work on the following PR: Prevent OOMs during heap snapshot: Change to streaming out the snapshot data (#51518 ) Here are the commit history: ``` * Streaming the heap snapshot! This should prevent the engine from OOMing while recording the snapshot! Now we just need to sample the files, either online, before downloading, or offline after downloading :) If we're gonna do it offline, we'll want to gzip the files before downloading them. * Allow custom filename; use original API * Support legacy heap snapshot interface. Add reassembly function. * Add tests * Apply suggestions from code review * Update src/gc-heap-snapshot.cpp * Change to always save the parts in the same directory This way you can always recover from an OOM * Fix bug in reassembler: from_node and to_node were in the wrong order * Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky. But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table? * Debugging messed up edge array idxs * Disable log message * Write the .nodes and .edges as binary data * Remove unnecessary logging * fix merge issues * attempt to add back the orphan node checking logic ``` --------- Co-authored-by: Nathan Daly <[email protected]> Co-authored-by: Nathan Daly <[email protected]>

This PR is to continue the work on the following PR: Prevent OOMs during heap snapshot: Change to streaming out the snapshot data (JuliaLang#51518 ) Here are the commit history: ``` * Streaming the heap snapshot! This should prevent the engine from OOMing while recording the snapshot! Now we just need to sample the files, either online, before downloading, or offline after downloading :) If we're gonna do it offline, we'll want to gzip the files before downloading them. * Allow custom filename; use original API * Support legacy heap snapshot interface. Add reassembly function. * Add tests * Apply suggestions from code review * Update src/gc-heap-snapshot.cpp * Change to always save the parts in the same directory This way you can always recover from an OOM * Fix bug in reassembler: from_node and to_node were in the wrong order * Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky. But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table? * Debugging messed up edge array idxs * Disable log message * Write the .nodes and .edges as binary data * Remove unnecessary logging * fix merge issues * attempt to add back the orphan node checking logic ``` --------- Co-authored-by: Nathan Daly <[email protected]> Co-authored-by: Nathan Daly <[email protected]>

NHDaly added 2 commits September 29, 2023 13:12

Allow custom filename; use original API

7b86d3b

NHDaly requested a review from gbaraldi September 29, 2023 19:35

NHDaly mentioned this pull request Sep 29, 2023

Add ability to sample the heapsnapshot #51381

Open

Support legacy heap snapshot interface. Add reassembly function.

726f9e1