Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Index] Convert to hashing to HashBuilder with BLAKE3 #9140

Open
wants to merge 1 commit into
base: stable/20240723
Choose a base branch
from

Conversation

bnbarham
Copy link

Hashing.h is non-deterministic between runs. Update the index hashing to use xxhash for the unit hash and BLAKE3 for the record hash. Ideally we'd use xxhash for the record hash as well, but there's no easy HashBuilder option for it today.

This also removes the hash caching logic from the record hasher, as it turns out to be slower than just hashing everything for BLAKE3 and greatly simplifies the implementation with its removal.

Numbers for indexing Foundation and Cocoa textual includes on an M2 Pro over 10 runs with 3 warmup are as follows.

Build with full re-index (ie. index removed between each run) -

      Current: 688ms +- 8ms
       BLAKE3: 691ms +- 4ms
BLAKE3 cached: 711ms +- 8ms
   No-op hash: 620ms +- 4ms

Same but with an existing index (which would hash but then not write any output) -

      Current: 396ms +- 4ms
       BLAKE3: 394ms +- 4ms
BLAKE3 cached: 419ms +- 3ms
   No-op hash: 382ms +- 5ms

The no-op hash is a little misleading in the full re-index since it will be writing out fewer records. But the existing index case is interesting, showing that hashing is only a small part of the entire build and index.

Also worth noting that there was some fairly significant run-to-run variance of around 30ms, but the above was a generally typical pattern (ie. current about the same as BLAKE3, which is faster than BLAKE3 cached, and no-op is the fastest). The main take away is that this isn't a noticable performance regression.

@bnbarham
Copy link
Author

@swift-ci please test llvm

@bnbarham
Copy link
Author

@swift-ci please test

clang/lib/Index/IndexUnitWriter.cpp Outdated Show resolved Hide resolved
@@ -76,7 +75,9 @@ bool ClangIndexRecordWriter::writeRecord(StringRef Filename,
std::string &Error,
std::string *OutRecordFile) {

auto RecordHash = Hasher.hashRecord(IdxRecord);
std::array<uint8_t, 8> RecordHashArr = index::hashRecord(IdxRecord, Ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t it be easier if index::hashRecord returned uint64_t instead of std::array<uint8_t, 8>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. This will have to be done either way, it's just a matter of where. I actually implemented it that way first, but ended up changing it since std::array is what the HashBuilder API returns. Happy to change if you'd prefer it.

IndexRecordHasher &Hasher;

public:
DeclHashVisitor(IndexRecordHasher &Hasher) : Hasher(Hasher) {}

hash_code VisitDecl(const Decl *D) {
return VisitDeclContext(D->getDeclContext());
void VisitDecl(const Decl *decl) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the current LLVM naming convention regarding uppercase variables names? I know that we’re trying to move to lowercase variable names in the Swift compiler but I’m not sure about LLVM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure honestly. https://discourse.llvm.org/t/top-level-clang-tidy-options-and-variablename-suggestion-on-codingstandards/58783 was the latest discussion I could find and anecdotally I see a bunch of camelCase in clang now.

clang/lib/Index/IndexRecordHasher.cpp Show resolved Hide resolved
clang/lib/Index/IndexRecordHasher.cpp Outdated Show resolved Hide resolved
@bnbarham
Copy link
Author

@swift-ci please test llvm

@bnbarham
Copy link
Author

@swift-ci please test

`Hashing.h` is non-deterministic between runs. Update the index hashing
to use xxhash for the unit hash and BLAKE3 for the record hash. Ideally
we'd use xxhash for the record hash as well, but there's no easy
`HashBuilder` option for it today.

This also removes the hash caching logic from the record hasher, as it
turns out to be slower than just hashing everything for BLAKE3 and
greatly simplifies the implementation with its removal.

Numbers for indexing `Foundation` and `Cocoa` textual includes on an M2
Pro over 10 runs with 3 warmup are as follows.

Build with full re-index (ie. index removed between each run) -
```
      Current: 688ms +- 8ms
       BLAKE3: 691ms +- 4ms
BLAKE3 cached: 711ms +- 8ms
   No-op hash: 620ms +- 4ms
```

Same but with an existing index (which would hash but then not write any
output) -
```
      Current: 396ms +- 4ms
       BLAKE3: 394ms +- 4ms
BLAKE3 cached: 419ms +- 3ms
   No-op hash: 382ms +- 5ms
```

The no-op hash is a little misleading in the full re-index since it will
be writing out fewer records. But the existing index case is
interesting, showing that hashing is only a small part of the entire
build and index.

Also worth noting that there was some fairly significant run-to-run
variance of around 30ms, but the above was a generally typical pattern
(ie. current about the same as BLAKE3, which is faster than BLAKE3
cached, and no-op is the fastest). The main take away is that this isn't
a noticable performance regression.
@bnbarham
Copy link
Author

@swift-ci please test llvm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants