Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation Profiler: Types for all allocations #50337

Merged
merged 13 commits into from
Jul 31, 2023
90 changes: 83 additions & 7 deletions doc/src/manual/profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,15 +338,91 @@ argument can be passed to speed it up by making it skip some allocations.
Passing `sample_rate=1.0` will make it record everything (which is slow);
`sample_rate=0.1` will record only 10% of the allocations (faster), etc.

!!! note
!!! compat "Julia 1.11"

Older versions of Julia could not capture types in all cases. In older versions of
Julia, if you see an allocation of type `Profile.Allocs.UnknownType`, it means that
the profiler doesn't know what type of object was allocated. This mainly happened when
the allocation was coming from generated code produced by the compiler. See
[issue #43688](https://github.com/JuliaLang/julia/issues/43688) for more info.

Since Julia 1.11, all allocations should have a type reported.

For more details on how to use this tool, please see the following talk from JuliaCon 2022:
https://www.youtube.com/watch?v=BFvpwC8hEWQ

##### Allocation Profiler Example

The current implementation of the Allocations Profiler _does not
capture types for all allocations._ Allocations for which the profiler
could not capture the type are represented as having type
`Profile.Allocs.UnknownType`.
In this simple example, we use PProf to visualize the alloc profile. You could use another
visualization tool instead. We collect the profile (specifying a sample rate), then we visualize it.
```julia
using Profile, PProf
Profile.Allocs.clear()
Profile.Allocs.@profile sample_rate=0.0001 my_function()
PProf.Allocs.pprof()
```

Here is a more in-depth example, showing how we can tune the sample rate. A
good number of samples to aim for is around 1 - 10 thousand. Too many, and the
profile visualizer can get overwhelmed, and profiling will be slow. Too few,
and you don't have a representative sample.

You can read more about the missing types and the plan to improve this, here:
[issue #43688](https://github.com/JuliaLang/julia/issues/43688).

```julia-repl
julia> import Profile

julia> @time my_function() # Estimate allocations from a (second-run) of the function
0.110018 seconds (1.50 M allocations: 58.725 MiB, 17.17% gc time)
500000

julia> Profile.Allocs.clear()

julia> Profile.Allocs.@profile sample_rate=0.001 begin # 1.5 M * 0.001 = ~1.5K allocs.
my_function()
end
500000

julia> prof = Profile.Allocs.fetch(); # If you want, you can also manually inspect the results.

julia> length(prof.allocs) # Confirm we have expected number of allocations.
1515

julia> using PProf # Now, visualize with an external tool, like PProf or ProfileCanvas.

julia> PProf.Allocs.pprof(prof; from_c=false) # You can optionally pass in a previously fetched profile result.
Analyzing 1515 allocation samples... 100%|████████████████████████████████| Time: 0:00:00
Main binary filename not available.
Serving web UI on http://localhost:62261
"alloc-profile.pb.gz"
```
Then you can view the profile by navigating to http://localhost:62261, and the profile is saved to disk.
See PProf package for more options.

##### Allocation Profiling Tips

As stated above, aim for around 1-10 thousand samples in your profile.

Note that we are uniformly sampling in the space of _all allocations_, and are not weighting
our samples by the size of the allocation. So a given allocation profile may not give a
representative profile of where most bytes are allocated in your program, unless you had set
`sample_rate=1`.

Allocations can come from users directly constructing objects, but can also come from inside
the runtime or be inserted into compiled code to handle type instability. Looking at the
"source code" view can be helpful to isolate them, and then other external tools such as
[`Cthulhu.jl`](https://github.com/JuliaDebug/Cthulhu.jl) can be useful for identifying the
cause of the allocation.

##### Allocation Profile Visualization Tools

There are several profiling visualization tools now that can all display Allocation
Profiles. Here is a small list of some of the main ones we know about:
- [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
- [ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl)
- VSCode's built-in profile visualizer (`@profview_allocs`) [docs needed]
- Viewing the results directly in the REPL
- You can inspect the results in the REPL via [`Profile.Allocs.fetch()`](@ref), to view
the stacktrace and type of each allocation.

#### Line-by-Line Allocation Tracking

Expand Down
1 change: 1 addition & 0 deletions src/gc-alloc-profiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t

extern int g_alloc_profile_enabled;

// This should only be used from _deprecated_ code paths. We shouldn't see UNKNOWN anymore.
#define jl_gc_unknown_type_tag ((jl_datatype_t*)0xdeadaa03)

static inline void maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t *typ) JL_NOTSAFEPOINT {
Expand Down
24 changes: 21 additions & 3 deletions src/gc.c
Original file line number Diff line number Diff line change
Expand Up @@ -1040,13 +1040,20 @@ STATIC_INLINE jl_value_t *jl_gc_big_alloc_inner(jl_ptls_t ptls, size_t sz)
return jl_valueof(&v->header);
}

// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc(jl_ptls_t ptls, size_t sz)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
maybe_record_alloc_to_profile(val, sz, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc_instrumented(jl_ptls_t ptls, size_t sz, jl_value_t *type)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
maybe_record_alloc_to_profile(val, sz, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_big_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_big_alloc_inner`
Expand Down Expand Up @@ -1357,14 +1364,22 @@ STATIC_INLINE jl_value_t *jl_gc_pool_alloc_inner(jl_ptls_t ptls, int pool_offset
return jl_valueof(v);
}

// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
int osize)
{
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
maybe_record_alloc_to_profile(val, osize, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc_instrumented(jl_ptls_t ptls, int pool_offset,
int osize, jl_value_t* type)
{
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
maybe_record_alloc_to_profile(val, osize, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_pool_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_pool_alloc_inner`
Expand Down Expand Up @@ -3836,7 +3851,10 @@ static void *gc_managed_realloc_(jl_ptls_t ptls, void *d, size_t sz, size_t olds
SetLastError(last_error);
#endif
errno = last_errno;
maybe_record_alloc_to_profile((jl_value_t*)b, sz, jl_gc_unknown_type_tag);
// gc_managed_realloc_ is currently used exclusively for resizing array buffers.
if (allocsz > oldsz) {
maybe_record_alloc_to_profile((jl_value_t*)b, allocsz - oldsz, (jl_datatype_t*)jl_buff_tag);
}
return b;
}

Expand Down
2 changes: 2 additions & 0 deletions src/jl_exported_funcs.inc
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@
XX(jl_gc_alloc_3w) \
XX(jl_gc_alloc_typed) \
XX(jl_gc_big_alloc) \
XX(jl_gc_big_alloc_instrumented) \
XX(jl_gc_collect) \
XX(jl_gc_conservative_gc_support_enabled) \
XX(jl_gc_counted_calloc) \
Expand Down Expand Up @@ -185,6 +186,7 @@
XX(jl_gc_new_weakref_th) \
XX(jl_gc_num) \
XX(jl_gc_pool_alloc) \
XX(jl_gc_pool_alloc_instrumented) \
XX(jl_gc_queue_multiroot) \
XX(jl_gc_queue_root) \
XX(jl_gc_safepoint) \
Expand Down
9 changes: 5 additions & 4 deletions src/llvm-final-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -187,12 +187,13 @@ Value *FinalLowerGC::lowerSafepoint(CallInst *target, Function &F)
Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
{
++GCAllocBytesCount;
assert(target->arg_size() == 2);
assert(target->arg_size() == 3);
CallInst *newI;

IRBuilder<> builder(target);
builder.SetCurrentDebugLocation(target->getDebugLoc());
auto ptls = target->getArgOperand(0);
auto type = target->getArgOperand(2);
Attribute derefAttr;

if (auto CI = dyn_cast<ConstantInt>(target->getArgOperand(1))) {
Expand All @@ -203,19 +204,19 @@ Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
if (offset < 0) {
newI = builder.CreateCall(
bigAllocFunc,
{ ptls, ConstantInt::get(T_size, sz + sizeof(void*)) });
{ ptls, ConstantInt::get(T_size, sz + sizeof(void*)), type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sz + sizeof(void*));
}
else {
auto pool_offs = ConstantInt::get(Type::getInt32Ty(F.getContext()), offset);
auto pool_osize = ConstantInt::get(Type::getInt32Ty(F.getContext()), osize);
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize });
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), osize);
}
} else {
auto size = builder.CreateZExtOrTrunc(target->getArgOperand(1), T_size);
size = builder.CreateAdd(size, ConstantInt::get(T_size, sizeof(void*)));
newI = builder.CreateCall(allocTypedFunc, { ptls, size, ConstantPointerNull::get(Type::getInt8PtrTy(F.getContext())) });
newI = builder.CreateCall(allocTypedFunc, { ptls, size, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sizeof(void*));
}
newI->setAttributes(newI->getCalledFunction()->getAttributes());
Expand Down
46 changes: 29 additions & 17 deletions src/llvm-late-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2349,22 +2349,6 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
IRBuilder<> builder(CI);
builder.SetCurrentDebugLocation(CI->getDebugLoc());

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it doesn't set the tag.
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, T_size, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false)
});
newI->takeName(CI);

// LLVM alignment/bit check is not happy about addrspacecast and refuse
// to remove write barrier because of it.
// We pretty much only load using `T_size` so try our best to strip
Expand Down Expand Up @@ -2403,7 +2387,35 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
builder.CreateAlignmentAssumption(DL, tag, 16);
}
}
// Set the tag.

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it specializes the call based on the constant
// size of the object to allocate, to save one indirection, and doesn't set
// the type tag. (Note that if the size is not a constant, it will call
// gc_alloc_obj, and will redundantly set the tag.)
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, T_size, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false),
builder.CreatePtrToInt(tag, T_size),
});
newI->takeName(CI);

// Now, finally, set the tag. We do this in IR instead of in the C alloc
// function, to provide possible optimization opportunities. (I think? TBH
// the most recent editor of this code is not entirely clear on why we
// prefer to set the tag in the generated code. Providing optimziation
// opportunities is the most likely reason; the tradeoff is slightly
// larger code size and increased compilation time, compiling this
// instruction at every allocation site, rather than once in the C alloc
// function.)
auto &M = *builder.GetInsertBlock()->getModule();
StoreInst *store = builder.CreateAlignedStore(
tag, EmitTagPtr(builder, tag_type, T_size, newI), M.getDataLayout().getPointerABIAlignment(0));
Expand Down
14 changes: 8 additions & 6 deletions src/llvm-pass-helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,9 @@ namespace jl_intrinsics {
auto intrinsic = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), T_size },
{ Type::getInt8PtrTy(ctx),
T_size,
T_size }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_BYTES_NAME);
Expand Down Expand Up @@ -236,8 +238,8 @@ namespace jl_intrinsics {
}

namespace jl_well_known {
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc);
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc_instrumented);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc_instrumented);
static const char *GC_QUEUE_ROOT_NAME = XSTR(jl_gc_queue_root);
static const char *GC_ALLOC_TYPED_NAME = XSTR(jl_gc_alloc_typed);

Expand All @@ -251,7 +253,7 @@ namespace jl_well_known {
auto bigAllocFunc = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), T_size },
{ Type::getInt8PtrTy(ctx), T_size , T_size},
false),
Function::ExternalLinkage,
GC_BIG_ALLOC_NAME);
Expand All @@ -267,7 +269,7 @@ namespace jl_well_known {
auto poolAllocFunc = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), Type::getInt32Ty(ctx), Type::getInt32Ty(ctx) },
{ Type::getInt8PtrTy(ctx), Type::getInt32Ty(ctx), Type::getInt32Ty(ctx), T_size },
false),
Function::ExternalLinkage,
GC_POOL_ALLOC_NAME);
Expand Down Expand Up @@ -301,7 +303,7 @@ namespace jl_well_known {
T_prjlvalue,
{ Type::getInt8PtrTy(ctx),
T_size,
Type::getInt8PtrTy(ctx) },
T_size }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_TYPED_NAME);
Expand Down
31 changes: 31 additions & 0 deletions stdlib/Profile/test/allocs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,34 @@ end
@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == String]) >= 1
end

@testset "alloc profiler catches allocs from codegen" begin
@eval begin
struct MyType x::Int; y::Int end
Base.:(+)(n::Number, x::MyType) = n + x.x + x.y
foo(a, x) = a[1] + x
wrapper(a) = foo(a, MyType(0,1))
end
a = Any[1,2,3]
# warmup
wrapper(a)

@eval Allocs.@profile sample_rate=1 wrapper($a)

prof = Allocs.fetch()
Allocs.clear()

@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == MyType]) >= 1
end

@testset "alloc profiler catches allocs from buffer resize" begin
a = Int[]
Allocs.@profile sample_rate=1 for _ in 1:100; push!(a, 1); end

prof = Allocs.fetch()
Allocs.clear()

@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == Profile.Allocs.BufferType]) >= 1
end
Loading