-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb] LRUCache for Swift type system mangling/demangling #9191
base: stable/20230725
Are you sure you want to change the base?
[lldb] LRUCache for Swift type system mangling/demangling #9191
Conversation
@augusto2112 please take a look. Not sure if targeting the right branch though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice! I'm impressed by the performance gains you mentioned.
I think a great use case for this would be an LRU cache from name -> node pointer inside the typeref type system, and look into replacing all the calls of dem.demangleSymbol
to look into the cache.
auto it = map_.find(key); | ||
if (it == map_.end()) | ||
return std::nullopt; | ||
list_.splice(list_.begin(), list_, it->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this is moving the requested element to the front of the list, right? I think it'd be worthwhile adding a comment here explaining that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into a dedicated function
using Node = std::pair<Key, Value>; | ||
using List = std::list<Node>; | ||
List list_; | ||
std::unordered_map<Key, typename List::iterator> map_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm::DenseMap
is almost always more efficient compared to std::unordered_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringMap is ok? I use now it to avoid double allocation for keys
@@ -505,6 +506,8 @@ class TypeSystemSwiftTypeRef : public TypeSystemSwift { | |||
|
|||
/// All lldb::Type pointers produced by DWARFASTParser Swift go here. | |||
ThreadSafeDenseMap<const char *, lldb::TypeSP> m_swift_type_map; | |||
swift_demangle::LRUCache<std::string, CompilerType> m_canonical_types_cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably need to make sure these data structures are thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I thought about this but then forgot. I'll add a mutex
@@ -505,6 +506,8 @@ class TypeSystemSwiftTypeRef : public TypeSystemSwift { | |||
|
|||
/// All lldb::Type pointers produced by DWARFASTParser Swift go here. | |||
ThreadSafeDenseMap<const char *, lldb::TypeSP> m_swift_type_map; | |||
swift_demangle::LRUCache<std::string, CompilerType> m_canonical_types_cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these specialized caches per function are scalable though, as we'd need one cache per function.
I suppose we could only create caches for functions we know are very hot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm going to collect more detailed logs to see how many times a function is called with the same mangled name so we don't cache functions called one or two times.
e7e18d2
to
cb6c653
Compare
I reimplemented
So only the map is responsible for owning strings. |
I collected some statistics about calls to The object that is printed looks like this:struct SomeRandomValueTagged<Tag>: HasDefault {
private var value: SomeClass
private init(value: SomeClass) {
self.value = value
}
static var defaultValue: Self {
Self(value: SomeClass())
}
}
private final class SomeClass {
init() {}
}
enum Tag0 {}
enum Tag1 {}
enum Tag2 {}
enum Tag3 {}
enum Tag4 {}
enum Tag5 {}
enum Tag6 {}
enum Tag7 {}
final class L0<Tag> {
var v0 = SomeRandomValueTagged<(Tag0, Tag)>.defaultValue
var v1 = SomeRandomValueTagged<(Tag1, Tag)>.defaultValue
var v2 = SomeRandomValueTagged<(Tag2, Tag)>.defaultValue
var v3 = SomeRandomValueTagged<(Tag3, Tag)>.defaultValue
var v4 = SomeRandomValueTagged<(Tag4, Tag)>.defaultValue
var v5 = SomeRandomValueTagged<(Tag5, Tag)>.defaultValue
var v6 = SomeRandomValueTagged<(Tag6, Tag)>.defaultValue
var v7 = SomeRandomValueTagged<(Tag7, Tag)>.defaultValue
}
...
let var0 = L0<Void>()
// (lldb) v var0 Calls to
|
Symbol | no cache | cache |
---|---|---|
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag0O_yttGD | 124 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag1O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag2O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag3O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag4O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag5O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag6O_yttGD | 122 | 72 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag7O_yttGD | 122 | 72 |
$s13lldb_perf_dbg2L0Cys4VoidaGD | 126 | 53 |
$s13lldb_perf_dbg2L0CyytGD | 336 | 221 |
$s13lldb_perf_dbg3FooCACycfc | 4 | 4 |
$s13lldb_perf_dbg3FooCACycfc$s13lldb_perf_dbg3FooCACycfc | 0 | 0 |
$s13lldb_perf_dbg9SomeClass33_D6034200B16897271DB50112FD04A664LLC | 18 | 18 |
$s13lldb_perf_dbg9SomeClass33_D6034200B16897271DB50112FD04A664LLCD | 1190 | 585 |
Total | 2652 | 1457 |
Calls to mangleNode
Symbol | no cache | cache |
---|---|---|
$s13lldb_perf_dbg21SomeRandomValueTaggedVD | 72 | 32 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag0O_yttGD | 50 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag1O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag2O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag3O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag4O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag5O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag6O_yttGD | 49 | 9 |
$s13lldb_perf_dbg21SomeRandomValueTaggedVyAA4Tag7O_yttGD | 49 | 9 |
$s13lldb_perf_dbg2L0CD | 35 | 11 |
$s13lldb_perf_dbg2L0Cys4VoidaGD | 39 | 6 |
$s13lldb_perf_dbg2L0CyytGD | 113 | 33 |
$s13lldb_perf_dbg9SomeClass33_D6034200B16897271DB50112FD04A664LLCD | 543 | 84 |
$ss4VoidaD | 39 | 6 |
Total | 1234 | 244 |
First: thanks for looking into this! I believe that this kind of caching would be okay, since no clients ever modify existing nodes. AFAIK, they only ever add new node that point to existing ones. However, I wonder if sharing the demangle context between different threads could be an issue? It might need to guarded by a lock (I haven't checked whether you do this already), which might eat some of the performance benefits. I was at first unsure about using shared pointers to store the demangle context and wondered if maybe the typesystem should own a global context, but I'm starting to think of this as a compression scheme where we unpack the demangle trees when we need to do work on them, and from that point of view something like you have makes more sense. Did you pick this function because it is the most expensive leaf function inside of LLDB? For example, I would imagine that we can get even more out of it by caching the canonical demangle tree, which involves Clang type lookups, but getting the cache key right for that is also going to be more tricky. Does the cache have to be owned by TypeSystemSwiftTypeRef (of which there are 100s on Darwin) or would there be a benefit to it being a static singleton? |
For now, these shared demanglers are not participating in any operations at all. I just need them to be alive as long as I need the associated nodes to be alive. I implemented one operation on the node -
The main concern with a global context is uncontrollable memory consumption. The
Not really. I mean, these functions were pretty hot. But actually, I just noticed heavy mangling and demangling calls in the inverted call tree and then made caches for functions calling them that were the easiest.
Sorry, but I don't understand what you mean by "the canonical demangle tree". Do you mean the result of
I think it doesn't really matter whether it's global or type-system-local. For example the cache for |
@@ -2991,6 +2997,12 @@ TypeSystemSwiftTypeRef::GetCanonicalType(opaque_compiler_type_t type) { | |||
ConstString mangled(mangling.result()); | |||
return GetTypeFromMangledTypename(mangled); | |||
}; | |||
auto impl = [&] { | |||
auto mangled_name = StringRef(AsMangledName(type)); | |||
return m_canonical_types_cache.GetOrCreate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessarily correct, since constructing the canonical demangle tree involves DWARF lookups, but they are not part of the cache key here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases might a DWARF lookup return different results? When we add a new module to the module list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be one such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, a good way to be sure that we understand all implicit dependencies would be to make the function we want to cache astatic function first, then all dependencies become explicit, and we know they need to go into the cache key (or invalidate the cache).
You might not see it on the command line but LLDB is also a library.
One concern is that I think Adrian brings up a great point, in that we need to be very careful in undertanding what the cache key has to be in each of these cases. To move this patch forward I propose we focus on caching the results of |
explicit operator bool() const { return IsValid(); } | ||
|
||
/// Gets a raw pointer to the Node. | ||
swift::Demangle::NodePointer GetRawNode() const & { return m_node; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function is necessary if we already implement the ->
operator.
Interestingly |
Hmm, it looks like Your cached version of |
When a type isn't found or found in the negative cache |
That'd be a way to do it, but if the type is already cached in |
cb6c653
to
e49a36b
Compare
@augusto2112 I replaced some of the calls to |
namespace swift_demangle { | ||
|
||
template <typename Value> | ||
class LRUCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add short doxygen comments for the class and the public methods?
} | ||
|
||
private: | ||
using List = std::list<llvm::StringRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a vector<> and a moving "lru" iterator work and be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate, please? I'm not sure I understand how it could be faster. Both Get
and Put
are O(1) now. With a vector moving an element to the front would be O(N).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of storing the elements of the list in a vector and having an lru
iterator that points to the newest entry. When inserting a new value we overwrite the value that is std::next(lru) and increment lru. And then we let that iterator wrap around like in a ring buffer. This would make scanning through the cache faster because we don't need to chase pointers and also use less memory. (Not that that matters for 10 elements). You are right that inserting an element would be more expensive because we need to copy up to N-1 elements. We could just always move the lru iterator to the right, and overwrite whatever is there whenever there is a cache miss. I think that should still give us a set of the last N used settings, we just don't know their relative order any more. But that might not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it now, thanks. Yeah, this can work, but we have to sacrifice reordering in Get
. I think that's ok. It looks like most accesses to a cache item are sequential. In this case, pretty much any replacement strategy would work (even RR). I doubt it should be called "LRU" then as we effectively discard least recently inserted items.
I'll prototype this variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good point though. I didn't think o this being a global cache. We might be having a lot of long-lived types that we need to demangle over and over again. We might benefit from having a much larger cache, but then we probably need a DenseMap for fast lookups and maybe separately track the LRU list? I don't want to get hung up on the implementation details here. The nice thing is we can change and fine-tune it after an initial version landed, too. So from that point of view we can also keep your list-based implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is DenseMap faster for lookups than StringMap? I chose StringMap because it's used in the ConstString's pools, so I thought it was the map to go for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringMap
copies the string key into the map, if you want to take advantage of the ConstString
pool DenseMap
is probably better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is DenseMap faster for lookups than StringMap?
No it's not. I think your current implementation is fine actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what Augusto said, is also true. The mangled names in the CompilerTypes are guaranteed to be ConstStrings, so a DenseMap<const char *, ...> is more efficient. We don't need to compare the strings at all, just the pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're probably right. The conversion opaque_compiler_type_t->ConstString
will still do a lookup in a StringMap (one of the ConstString's pools), so it will be a strlen
+ 2-3 hash operations + the lookup itself. But we will store just pointers and not copy whole strings in the cache. So probably in the end, it will be the same performance with less memory footprint.
@@ -0,0 +1,28 @@ | |||
//===-- SwiftDemangle.h ---------------------------------------*- C++ -*-===// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .cpp
@@ -1074,6 +1074,12 @@ TypeSystemSwiftTypeRef::GetCanonicalNode(swift::Demangle::Demangler &dem, | |||
|
|||
/// Return the demangle tree representation of this type's canonical | |||
/// (type aliases resolved) type. | |||
std::pair<swift::Demangle::NodePointer, SharedDemangledNode> TypeSystemSwiftTypeRef::GetCanonicalDemangleTreeWithCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we ever want the non-cached version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say no but it's used in DWARFASTParserSwift
and it's tricky to refactor it to use the cached version. Before doing so I wanted us to discuss if we are happy with the API for GetCanonicalDemangleTreeWithCache
. See my concerns above #9191 (comment)
That's why I temporarily left the non-cached version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see three ways of making this API cleaner:
- Make
Transform
clone the node returned byfn
using the new demangler, this way all transformed nodes are guaranteed to be attached to the new demangler. The obvious downside is the unnecessary cloning of nodes. - Change
SharedDemangledNode
to keep aSmallVector
of shared_ptr of demanglers (or perhaps only 2 demanglers, one being optional?) to ensure the entire demangle tree survives, since parts of the demangle tree are owned by different demanglers. The downside is that reasoning aboutSharedDemangledNode
more complex since now twoSharedDemangledNode
can share a demangler. - Create a new class that owns multiple
SharedDemangledNode
(to keep their demanglers alive), but only exposes oneNodePointer
. I think this might be the best approach, but now we have two new classes representing demangled nodes instead of one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! A few small changes and we can merge it in!
@@ -1074,6 +1074,12 @@ TypeSystemSwiftTypeRef::GetCanonicalNode(swift::Demangle::Demangler &dem, | |||
|
|||
/// Return the demangle tree representation of this type's canonical | |||
/// (type aliases resolved) type. | |||
std::pair<swift::Demangle::NodePointer, SharedDemangledNode> TypeSystemSwiftTypeRef::GetCanonicalDemangleTreeWithCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see three ways of making this API cleaner:
- Make
Transform
clone the node returned byfn
using the new demangler, this way all transformed nodes are guaranteed to be attached to the new demangler. The obvious downside is the unnecessary cloning of nodes. - Change
SharedDemangledNode
to keep aSmallVector
of shared_ptr of demanglers (or perhaps only 2 demanglers, one being optional?) to ensure the entire demangle tree survives, since parts of the demangle tree are owned by different demanglers. The downside is that reasoning aboutSharedDemangledNode
more complex since now twoSharedDemangledNode
can share a demangler. - Create a new class that owns multiple
SharedDemangledNode
(to keep their demanglers alive), but only exposes oneNodePointer
. I think this might be the best approach, but now we have two new classes representing demangled nodes instead of one.
} | ||
|
||
private: | ||
using List = std::list<llvm::StringRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringMap
copies the string key into the map, if you want to take advantage of the ConstString
pool DenseMap
is probably better.
@swift-ci test |
e49a36b
to
553cd0e
Compare
Recent changes:
Also, I did quick measurements before and after this commit. The results are much less impressive now but still significant. I need to implement a cache for
|
@adrian-prantl Can you help me with the logic of code and stacktrace
And if I don't patch here we are mutating a node in the cache. What should the correct behavior here be? PS: fails on this test |
{ | ||
std::lock_guard lock{m_mutex}; | ||
if (auto map_it = m_map.find(key); map_it != m_map.end()) { | ||
const auto &map_value = map_it->second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought about this: it'd be nice if in debug mode we assert that the cached value and the one created from the factory are the same (sorry for giving you more work!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. The Node
doesn't seem to provide an equality function. But it has isSimilarTo
which in combination with a simple DFS should be enough to implement it in SharedDemangledNode
.
I have a couple of questions then:
- Do you want this check to be a part of the
LRUCache
class or inGetCachedDemangledSymbol
? I'm asking because inLRUCache
we generally don't need theValue
type to be equatable, but with this check we will require that. - Under what flag should I add this check?
!define(NDEBUG)
works inReleaseAssert
buildsdefined(LLDB_CONFIGURATION_DEBUG)
is defined only in debug builds (AFAICS)- Make a new macro like
LLDB_LRUCACHE_VALIDATE
that defaults toNDEBUG
orLLDB_CONFIGURATION_DEBUG
:This is more flexible but I doubt anyone will ever want to specify it separately.#if !defined(LLDB_LRUCACHE_VALIDATE) && !defined(NDEBUG) #define LLDB_LRUCACHE_VALIDATE #endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetCachedDemangledSymbol
would be ok.
#ifndef NDEBUG
would be the one I'd pick, we already have similar check guarded (such as Verify
) under that.
Do you have a definition for |
|
@augusto2112 I made an equality function and caught an edge case when it fails: two consequent calls to demangler produce different results when a node of the kind Demangling trees
This is somewhat bad. We could make |
@DmT021 looks like you fixed the
Wow, great job catching this! I wasn't aware of
We could probably make our own "Verify" function that is only compiled in
Do you mean just This is definitely an issue, but this issue already exists with the current code, right? It looks like
Maybe instead of having What do you think? |
lldb/source/Plugins/TypeSystem/Swift/TypeSystemSwiftTypeRef.cpp
Outdated
Show resolved
Hide resolved
It's a fix but I still don't know if that's correct. It surely makes the tests pass but I don't know why the original implementation was written and what problems it solves.
There was a copy routine in
Right.
I think it makes sense to have a function like |
@augusto2112 can you take a look once more, please? I added the asserts as you asked. |
This is more like a proof of concept rather than the final solution, but it shows relatively good performance improvement for
ValueObjectPrinter::PrintValueObject
.Tested using
v x
wherex
is a struct with about 35000+ nested unique types. Measured total time of thelldb_private::ValueObjectPrinter::PrintValueObject()
call using Instruments (Time Profiler, default resolution)Before: 24.7 s
After: 17.4 s
There's potential for more performance gains using local
LRUCache
s and the global one (GetCachedDemangledSymbol
). The next obvious candidate for optimization isTypeSystemSwiftTypeRef::DemangleCanonicalType
. But it's trickier - it takesDemangler
as an argument, so I'd have to change the signature and all the references.