[WIP] Memory Profiler #31534

staticfloat · 2019-03-29T03:56:48Z

Overview

This branch adds a Memory Profiler to the Profile standard library. Accordingly, it re-arranges things such that the time profile code is stored within Profile.Time (but @profile and friends are exported, so it should be a mostly non-breaking change).

Memory Profiler internals

The memory profiler works much the same as the time profiler, it creates large buffers to store event backtraces as well as metadata about the allocation/deallocation. These large buffers are later examined from Julia code and timelines are reconstructed to pair allocations and deallocations appropriately to determine object lifetime, objects that are left "live" at the end of the instrumented region, etc...

Allocations are coarsely classified based upon their allocation domain (CPU, GPU, External, with room for this to grow) and their allocator (big, std or pool, relating to the three allocation code paths within Julia). Because a user may be primarily interested in certain allocations and not at all interested in others (and because Julia may generate many, many allocations over the course of a computation) a rudimentary filtering system exists, where the user may set up a bitmask for domain and allocator, allowing only certain events to be collected. Note that at the time of writing, only the CPU domain is tested.

The recorded allocation information (represented in Julia below):

struct allocation_info_t
    address::Ptr{Cvoid}
    T::Ptr{Cvoid}
    time::Float64
    allocsz::Csize_t
    tag::UInt16
end

address represents the pointer to the object that was allocated. If it remains live at the end of the computation, the user can actually inspect the object. If it is no longer live, accessing that memory is a fun way to randomly crash Julia.
T represents the type of the object that was allocated. This value (unless it is C_NULL) is always valid to dereference, as types are never destroyed. Type information is possibly unavailable, however this should only be the case when an external program is directly calling Julia's GC functions and is unaware of methods such as jl_memprofile_set_typeof(addr, T).
time is the wallclock time at which this event occurred.
allocsz is the number of bytes this allocation was made for.
tag is the or-wise combination of domain, allocator, and whether this event was an allocation or a deallocation.

Julia-facing API

The Julia side of things revolves around the macro @memprofile ex which runs gc(), turns on memory profiling, runs then the enclosed expression, gc()'s again, and finally turns off memory profiling. Once this has been done, the user can analyze the buffers by running Profile.Memory.read_and_coalesce_memprofile_data(), which returns three values: open_chunks, closed_chunks and ghost_chunks, where an "open" chunk is one that was allocated but never freed, a "closed" chunk is one that was allocated and freed, and a "ghost" chunk is one that was freed but never allocated.

These outputs are all "flat" in that there is no collation of results across backtrace locations, or anything like that. There are a host of interesting visualizations and data output formats that could be made upon this information, I just do not have time to do it right now.

TODO

Integrate flat -> tree tools for both Time and Memory profilers. The time profiler already contains some tools to convert the "flat" data arrays into a "tree" format that is substantially more readable and useful. We should be able to abstract that out a bit and create a useful "flat" -> "tree" collator that can deal with both of these kinds of data.
Add NEWS.md once the user-facing API is stabilized.

ViralBShah · 2019-03-29T03:57:48Z

Must do a blog post to announce this as well!

timholy · 2019-03-29T10:42:13Z

Does this replace the --track-allocation=all functionality (and the corresponding tools in https://github.com/JuliaCI/Coverage.jl#memory-allocation), or are there reasons for both?

StefanKarpinski · 2019-03-29T13:22:17Z

Great writeup! Could just be the text of the blog post. Since this is a diagnostic tool, I think there's a little more leeway for API breakage for what it's worth.

stdlib/Profile/src/memory.jl

ararslan · 2019-03-29T19:23:40Z

stdlib/Profile/src/memory.jl

+of an allocated and then freed piece of memory.  This function has also been known to
+create highly competitive dota agents.
+"""
+function open_AI(a::allocation_info_t, alloc_stacktrace)


Why not just extend Base.open/close for this?

Because I don't really like that kind of overloading. It feels like a type pun, there's nothing about this kind of "opening" and "closing" that is similar to Base.open or Base.close, and so there's not reason for things like docstrings, method tables, etc... to be shared between them.

base/error.jl

staticfloat · 2019-03-29T23:44:28Z

Does this replace the --track-allocation=all functionality (and the corresponding tools in JuliaCI/Coverage.jl#memory-allocation), or are there reasons for both?

There is definitely some overlap, but I don't fully grok what kinds of memory the --track-allocation keeps track of; for instance I don't think it will tracked pooled allocations (and it certainly won't tell you when the memory was freed).

I'm not sure what the use cases are for --track-allocation, so I think I'll need someone else to speak up and tell me what it's for to know whether this kind of system could "replace" it. Some of the drawbacks to this system are that you need to pre-allocate a chunk of memory, and if you run out you lose the tail end of your stats. (Luckily @time gives you an estimate on the number of allocations you need to hold, so this isn't such a difficult game to win at, although figuring out the proper mapping of num_allocations -> backtrace buffer length can be a bit of a guessing game) Another drawback is that when the allocator is tracking everything some code slows to a crawl; especially compilation (because compiling creates lots of little objects). I also don't have a separation between "user" code and "system" code; that kind of separation could be very useful, especially if we can integrate it into the C code so that system allocations don't even get recorded.

Also adds `bt_overflow` flag instead of spitting out messages in the middle of profiling, to be used by client profiling code. This change allows for better checking of stack frames that could be incomplete due to insufficient backtrace buffer space. Realistically, a single truncated stack trace in the case of a sampling time profiler is unlikely to create large problems. However when taking backtraces for things such as a memory profiler, it is critical that all backtraces be accurate, and so we allow client code to be somewhat stricter here.

This adds C support for a memory profiler within the GC, tracking locations of allocations, deallocations, etc... It operates in a similar manner as the time profiler with single large buffers setup beforehand through an initialization function, reducing the need for expensive allocations while the program being measured is running. The memory profiler instruments the GC in all locations that the GC statistics themselves are being modified (e.g. `gc_num.allocd` and `gc_num.freed`) by introducing new helper functions `jl_gc_count_{allocd,freed,reallocd}()`. Those utility functions call the `jl_memprofile_track_{de,}alloc()` method to register an address, a size and a tag with the memory profiler. We also track type information as this can be critically helpful when debugging, and to do so without breaking API guarantees we insert methods to set the type of a chunk of memory after allocating it where necessary. The tagging system allows the memory profiler to disambiguate, at profile time, between e.g. pooled allocations and the "big" allocator. It also allows the memory allocator to support tracking multiple "memory domains", e.g. a GPU support package could manually call `jl_memprofile_track_alloc()` any time a chunk of memory is allocated on the GPU so as to use the same system. By default, all values are tracked, however one can set a `memprof_tag_filter` value to track only the values you are most interested in. (E.g. only CPU domain big allocations) To disambiguate the memory and time profilers, we split them out into separate modules.

Co-Authored-By: staticfloat <[email protected]>

timholy

I really like the re-use of the time-profiling infrastructure here, very nice design.

It's been ages since I poked around the gc, and some time since I thought about profiling, so I am not perfectly suited to give feedback here. But overall I think the API needs to be a bit more layered and designed with "readout performance" in mind. I haven't played with this, but I wouldn't be surprised if just "collecting" memprofile data can take a long time; IIUC (and I may not) I think some relatively simple steps could drastically shorten that time.

EDIT: to expand on that, if you have m allocations that occur with the same backtrace of length n instruction pointers, this implementation is O(m*n) in both memory and time, and the construction of the stacktraces (with all the lookup) will be the slow step. If instead you aggregate into sets of unique backtraces (using just the instruction pointers, prior to the lookup) then you get something that's O(m+n) in the slow step. You can modify your AllocationInfo to allow list(s) of event times and allocation sizes so as to aggregate efficiently.

timholy · 2019-04-02T21:08:37Z

src/julia_internal.h

+// Tags applied to memory allocations to specify which domain the memory is
+// stored on, and also which "kind" of memory allocator was used.
+// When filtering, a filter tag value of `0xffff` means "accept everything".
+// We support the "CPU", "GPU" and "External" (e.g. "other") domains.


Where does mmap fall in this scheme?

timholy · 2019-04-02T21:12:34Z

src/memprofile.c

+    }
+
+    // Next up; store allocation/type information
+    memprof_alloc_data[memprof_alloc_data_size].memory_location = v;


Maybe hoist the indexing operation for slightly faster performance?

timholy · 2019-04-02T21:20:53Z

stdlib/Profile/src/compat.jl

@@ -0,0 +1,9 @@
+# Compatibility shims for old users who aren't used to the `Time` and `Memory` sub-modules
+function init(args...; kwargs...)
+    @warn("Profile.init() is deprecated, use Profile.Time.init() or Profile.Memory.init() directly")


I worry this much scoping will be irritating to users. How about transitioning to an all-keyword interface?

There's also some concern about deprecations, I have the impression we're trying to avoid those. If so, it might be necessary to preserve the current interface as-is and just add to it.

timholy · 2019-04-02T21:30:17Z

stdlib/Profile/src/memory.jl

+
+Build a memory profile tag filter that will capture only events matching the given
+filter parameters.  This method is meant to mirror the `#define`'d `JL_MEMPROF_TAG_*`
+values within `julia_internal,.h`.  Valid memory domains are `:cpu`, `:gpu`, `:external`


Better not to encourage users to have to open up julia_internal.h, better just to give the valid values here (and then you can add the more detailed info about correspondence as a comment somewhere).

timholy · 2019-04-02T21:38:22Z

stdlib/Profile/src/memory.jl

+Create an `AllocationInfo` that represents an allocated (and not yet deallocated) chunk
+of memory.  This `AllocationInfo` will later be "closed" to represent the full lifecycle
+of an allocated and then freed piece of memory.  This function has also been known to
+create highly competitive dota agents.


Cool, a chemistry reference 😉 https://en.wikipedia.org/wiki/DOTA_(chelator)

Less frivolously, I think this design encourages serious performance bottlenecks, see comments below in read_and_coalesce_memprofile_data.

timholy · 2019-04-02T21:55:29Z

stdlib/Profile/src/memory.jl

+"""
+    read_and_coalesce_memprofile_data()
+
+Load in the backtrace and allocation data buffers from the C side of Julia-land, returns two


I count 3. Spell out the meaning of ghost_chunks too.

timholy · 2019-04-02T21:59:48Z

stdlib/Profile/src/memory.jl

+            @info("Opening $(chunk_id) $(a.T) $(a.allocsz)")
+
+            # Assert that we're not inserting an identical chunk
+            @assert !(chunk_id in keys(open_chunks_map)) "Doubly-opened memory chunk!"


Initially I thought, "can't this happen just from re-use of pooled memory?" Needs a comment re the deletion below.

timholy · 2019-04-02T22:01:58Z

stdlib/Profile/src/memory.jl

+
+    # Any leftover "open" chunks are just plain old leaked memory.
+    open_chunks = sort(collect(values(open_chunks_map)); by = x -> x.alloc_time)
+    closed_chunks = sort(closed_chunks, by = x -> x.alloc_time)


sort!? (for both)

timholy · 2019-04-02T22:03:45Z

stdlib/Profile/src/memory.jl

+
+            # If this `a` represents a free(), let's see if we're closing a previously opened chunk.
+            if !(chunk_id in keys(open_chunks_map))
+                push!(ghost_chunks, (a, bt))


These aren't ghosts, because ghosts were once alive 😄 More like "lost souls"

timholy · 2019-04-02T22:12:10Z

stdlib/Profile/src/memory.jl

+get_memprofile_tag_filter() = UInt16(ccall(:jl_memprofile_tag_filter, Cint, ()))
+
+"""
+    read_and_coalesce_memprofile_data()


As an API this feels awkward. The name suggests too much is being crammed into one function. If some users will only want detail on the "open" allocations, then the time spent constructing stacktraces for all the closed events will be a huge unnecessary load. Might be best to (1) parcel into types, (2) look for duplicates in the backtraces, and (3) look up only unique backtraces.

yuyichao · 2019-05-03T15:40:47Z

src/gc.c

+void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT
+{
+#ifdef JULIA_ENABLE_THREADING
+    jl_atomic_fetch_add(&gc_num.allocd, sz);


Please don't. This kills performance. The TODO in the code was wrong and this should be made thread local instead. The function that want to read these values is responsible for adding everything up. The only thread synchronization needed for the allocation thread should be to use relaxed atomic load and store and a normal increment.

clarkevans · 2020-08-12T05:47:10Z

For my application, I needed to know the resident memory high-water mark and processing time in order to do resource/workload planning. @btime got me timing and allocations, however, it didn't provide me the resident memory high-water mark. I think the hack below helped me get this memory usage detail. I'm sharing this because it might be useful to help understand a particular benchmarking need. Would the proposal above provide a similar metric?

sizings = tempname()
memfile = "/proc/$(Base.Libc.getpid())/statm"
bashcmd = `/bin/bash -c "while :; do cat $(memfile); sleep 1; done"`
parse_sizing(line::String)::Tuple{Float64, Float64} =
   tuple([round(parse(Int, size)*4096 /1000/1000, digits=2)
             for size in split(strip(line), " ")[1:2]]...)
gc()
(baseline_virtual, baseline_resident) = parse_sizing(read(memfile, String))
timer = run(pipeline(bashcmd, stdout=sizings), wait=false)
@btime my_long_running_process()  samples=5
kill(timer)
samples = [parse_sizing(string(line)) for line in
            split(strip(read(sizings, String)), "\n")]
rm(sizings)
max_virtual = max([a - baseline_virtual for (a,b) in samples]...)
max_resident = max([b - baseline_resident for (a,b) in samples]...)

What I discovered using the hack above was helpful to me. On my 32GB box, the performance is quick till I hit about 5GB, at which point the GC kicked in to keep my memory usage down (but increasing processing time, accounting for 50-80% of time). I could use this hack to see just how much cpu and resident memory I need per unit of work, so that I could do resource planning. Moreover, I was surprised to see just how much GC costs are for my workloads. This also let me speculate just how important vectored representations of fixed size data could be with regard to garbage collection costs; e.g. Vector{SubString} as pointers into a huge memory mapped JSON file rather than thousands of small strings. As a side note, I didn't see how I could open a process and have its output buffered; hence, I used a temporary file. Moreover, I didn't see how I could easily fork() the process, to have one procss monitor memory and the other run the computation; hence, I used bash.

StefanKarpinski requested a review from timholy March 29, 2019 13:17

StefanKarpinski added the minor change Marginal behavior change acceptable for a minor release label Mar 29, 2019

ararslan reviewed Mar 29, 2019

View reviewed changes

staticfloat mentioned this pull request Apr 2, 2019

Adding some Untrained Models FluxML/Metalhead.jl#17

Closed

staticfloat and others added 5 commits April 1, 2019 18:32

Add _reformat_bt() method that takes a single array of backtraces

a74dabb

Standardize backtrace buffers as uintptr_t

5d9f3fa

Update stdlib/Profile/src/memory.jl

a32f2a6

Co-Authored-By: staticfloat <[email protected]>

staticfloat force-pushed the sf/memprofiler branch from 9e476b3 to a32f2a6 Compare April 2, 2019 01:33

timholy reviewed Apr 2, 2019

View reviewed changes

vtjnash mentioned this pull request May 3, 2019

WIP: statistical allocation profiling #31915

Closed

yuyichao requested changes May 3, 2019

View reviewed changes

timholy mentioned this pull request Sep 8, 2019

Consider adding compatibility with Massif - Valgrind's Heap profiler timholy/ProfileView.jl#117

Open

maleadt mentioned this pull request Oct 4, 2019

Memory profiler (continuation of #31534) #33467

Closed

staticfloat closed this Sep 19, 2020

DilumAluthge deleted the sf/memprofiler branch March 25, 2021 21:58

vilterp mentioned this pull request Sep 22, 2021

Heap snapshot #42286

Closed

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Memory Profiler #31534

[WIP] Memory Profiler #31534

staticfloat commented Mar 29, 2019 •

edited

Loading

ViralBShah commented Mar 29, 2019

timholy commented Mar 29, 2019

StefanKarpinski commented Mar 29, 2019

ararslan Mar 29, 2019

staticfloat Mar 29, 2019

staticfloat commented Mar 29, 2019

timholy left a comment •

edited

Loading

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

timholy Apr 2, 2019

yuyichao May 3, 2019

clarkevans commented Aug 12, 2020 •

edited

Loading

[WIP] Memory Profiler #31534

[WIP] Memory Profiler #31534

Conversation

staticfloat commented Mar 29, 2019 • edited Loading

Overview

Memory Profiler internals

Julia-facing API

TODO

ViralBShah commented Mar 29, 2019

timholy commented Mar 29, 2019

StefanKarpinski commented Mar 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staticfloat commented Mar 29, 2019

timholy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkevans commented Aug 12, 2020 • edited Loading

staticfloat commented Mar 29, 2019 •

edited

Loading

timholy left a comment •

edited

Loading

clarkevans commented Aug 12, 2020 •

edited

Loading