Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast hash/equality for Model objects #129

Open
jakemac53 opened this issue Oct 31, 2024 · 15 comments
Open

Fast hash/equality for Model objects #129

jakemac53 opened this issue Oct 31, 2024 · 15 comments

Comments

@jakemac53
Copy link
Contributor

As a part of performance work, I have been looking into generating hash functions for Model objects (see my WIP branch). It isn't too bad to make a basic implementation, but it is very slow, taking almost an entire second cumalitively computing hashes for each edit in the large JSON benchmark.

My first approach here is to generate "identityHash" functions which do lookups on the node object for each known property, recursively calling "identityHash" on all the nested objects, for example Interface.identityHash looks like this:

  /// Hash code for comparing instances of this extension type.
  int get identityHash => Object.hash(
        Object.hashAll((node['members'] as Map?)
                ?.cast<String, Member>()
                ?.entries
                .map((entry) =>
                    Object.hash(entry.key, entry.value.identityHash)) ??
            const []),
        (node['thisType'] as NamedTypeDesc?)?.identityHash ?? 0,
        Object.hashAll((node['metadataAnnotations'] as List?)
                ?.cast<MetadataAnnotation>()
                ?.map((entry) => entry.identityHash) ??
            const []),
        (node['properties'] as Properties?)?.identityHash ?? 0,
      );

Ultimately the result of this is that even cached macro phases take an unacceptable amount of time (multiple milliseconds), so we will need to come up with something faster and evaluate exactly what is making this so slow.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Oct 31, 2024

Some ideas:

  • It is probably faster to iterate the entries - although we would have to do a switch for the keys to cast things to the correct extension types.
  • We could also potentially try just using the deep hashing functionality from package:collection and work at the raw map level.
  • We could potentially build up a hash in the JsonBufferBuilder as we build objects, but unless we also serialize that hash on the wire it would require reading the entire payload each time when we first receive a buffer. We also only want the hash of the Model field, not the entire buffer, which complicates things.

@davidmorgan
Copy link
Contributor

Hmmmm when the query finishes on the host, the Model should be the only thing in the buffer, e.g. after the initial query here:

model: await _queryTarget(target.qualifiedName))));

the host could then hash the whole buffer and send the result. It'd be nice to have it as a field on Model, we can only do that today by having a Map field since those can be added to after creation.

That's assuming that hashing the buffer as raw bytes turns out to be fast, I hope so :)

Hashing raw bytes will only do what we want if the bytes are sufficiently deterministic, I don't know if that will be the case, but it's a lot easier than hashing objects so I guess it's worth a try :)

@jakemac53
Copy link
Contributor Author

  • We could also potentially try just using the deep hashing functionality from package:collection and work at the raw map level.

I did give this a shot and it was much worse, fwiw (about 3x worse).

@jakemac53
Copy link
Contributor Author

  • It is probably faster to iterate the entries - although we would have to do a switch for the keys to cast things to the correct extension types.

I tried this as well and it seems a bit faster but not enough to meaningfully change the result.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Oct 31, 2024

It looks like just hacking something in to grab the raw bytes and hash those is definitely faster, 150ms or so per edit for the large json example, cumulatively spent doing hash computations. I also edited my branch to save the hash instead of the Model objects for the cached responses, and compare against that, so we do about half as many total hashes. I haven't played around with the different hash algorithms much to see which is fastest. I tried sha1, md5, and sha256, sha1/md5 were close but sha256 was about twice as slow.

Still a lot of time to spend hashing, and I don't love that it is dependent on specific ordering in which the buffer is built up (technically, my other approach also was though, but it would have been relatively trivial to make it not so).

Given that this now only ever actually hashes a given Model object once ever, I don't believe it is worth trying to compute the hash on the fly and store it in the buffer.

@davidmorgan
Copy link
Contributor

I read a bit about what the google3 build does, there is a public doc about a specialized hash, PSHA2, it's based on SHA256 with tweaks to add parallelization so it can run using SIMD instructions, i.e. it uses parallelization within a single CPU. The parallel part of psha256 kicks in for message sizes of 1024 bytes so it looks like it won't be too hard to hit it.

Comparing the command line versions of md5sum, sha256sum, sha1sum and psha2sum on my machine using 1GB of input: md5 hits 0.54GB/s, sha256 is 1.08GB/s, sha1 is 1.23GB/s and psha256 is 1.56GB/s.

Using the package:crypto example script AOT-compiled the numbers are: md5 0.17GB/s, sha1 0.10GB/s, sha256 0.08GB/s

Based on these numbers, if you were using the package:crypto implementations, it looks like there would be about 9x speedup with native-performance PSHA2, 7x with native-performance SHA1, 3x with native-performance md5sum.

I wonder how much of that native-performance boost we could get with ffi? But actually these are so important, I could see an argument for directly supporting them in the platform, e.g. put them in dart:io.

@davidmorgan
Copy link
Contributor

Some bazel discussion bazelbuild/bazel#22011 mentions blake3, which looks even faster, b3sum hits 7.87GB/s, mostly because it also parallelizes across multiple cores: with 1 thread (b3sum --num-threads 1) it's 1.83GB/s which is just a bit faster than PSHA2, then it gets considerably faster as I add 1-3 more threads.

@mraleph
Copy link
Member

mraleph commented Nov 1, 2024

Do you really need to compute hash to compare things? Can't you just (for example) do a simple byte-by-byte comparison of the incoming data? It seems you are just using it a proxy for equality anyway.

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation. CFE is supposed to have fine grained incremental recompilation of the dependencies, analyzer does not - but should eventually implement it anyway. It seems like we are reinventing the same calculation with macros specific twists.

@davidmorgan
Copy link
Contributor

Do you really need to compute hash to compare things? Can't you just (for example) do a simple byte-by-byte comparison of the incoming data? It seems you are just using it a proxy for equality anyway.

If a match means there is no more work to do then it's nice to get a match based on hashes, because then you can get a match without having to store all possible matches as full data.

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation. CFE is supposed to have fine grained incremental recompilation of the dependencies, analyzer does not - but should eventually implement it anyway. It seems like we are reinventing the same calculation with macros specific twists.

Neither analyzer nor CFE has a data model that's immediately useful to macros, because they are private to the analyzer and CFE, which means you can't code against them--they change. So, macros have their own data model that is public and stable. (The JSON representation and corresponding binary format and extension types).

Macros describe what data they need as a query, the host (analyzer or CFE) converts its own data model to the macro model and sends it in response.

Macros usually only care about a part of the code, for example fields in classes with a particular annotation and their types, so what each macro receives is significantly cut down from the full host model. This also means that it should be very common that when a file changes the macro does not have to rerun: something changed but it wasn't what the macro cares about.

This investigation is about noticing that the data being sent to a macro is the same as last time, and so the output from last time can be reused. "The same as last time" is easy to check by keeping a hash from last time and comparing.

It's true that we could perhaps optimize further by pushing some part of the "same as last time" check before the conversion to the macro data model, so for example if the CFE could compare what changed against the macro query before it even starts to do the conversion. But this would be a lot more work to do, and it's possible than convert-then-compare gets us most of the performance, so we obviously check that first.

@jakemac53
Copy link
Contributor Author

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation.

  • We have to implement it twice that way, which is extra work as well as a larger potential for bugs. The question of "what does this macro depend on" is fairly challenging to answer, and the query results hash model is quite trivial and reliable.
  • It is probably a lot more expensive, especially in terms of memory, to maintain back-links across the program to the macro applications which depend on those nodes. The current solution we are exploring just needs to store a single hash per query that the macro runs. We do trade off CPU cost for that, to recompute the queries and hash them, absolutely. We are just exploring now the actual cost of re-running those queries and hashing the results.
  • The hashing approach also could be more compatible with re-using macro results in modular compiler. Although we probably won't actually do this.

@jakemac53
Copy link
Contributor Author

I hooked up the CPU profiler, from head but with #134 applied.

Here are some noteworthy things, from a profile spanning a single incremental edit:

image

  • about 27% of the total time is still spent computing hashes. We definitely need to improve that.

image

  • I don't know exactly all this is doing, but it seems pretty expensive and if we were more deeply integrated we could possibly avoid re-doing this work when it is cached cc @scheglov .

image
image

  • We are spending >10% of the time just re-running the queries to see what has been changed. It looks like almost half that time is spent just creating typed maps though (almost all the created objects here are a result of queries), which doesn't even account for all the work spent building up these objects.
  • This indicates we need to optimize the speed for creating these objects.

image

  • This isn't too unexpected, we do expect a lot of the time to go to parsing, since every library file gets re-parsed, and all augmentation code is parsed twice I believe.

@jakemac53
Copy link
Contributor Author

Fwiw, this is my launch_config.json, which assumes you have already generated a benchmark to run (dart tool/benchmark_generator/bin/main.dart large macro 64 for example):

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "benchmark_debug",
            "request": "launch",
            "type": "dart",
            "program": "pkgs/_macro_tool/bin/main.dart",
            "args": [
                "--workspace=goldens/foo",
                "--packageConfig=.dart_tool/package_config.json",
                "--script=goldens/foo/lib/generated/large/a0.dart",
                "--host=analyzer",
                "--watch"
            ],
            "cwd": "${workspaceFolder}"
        },
    ]
}

@scheglov
Copy link

scheglov commented Nov 6, 2024

One of the previously recorded tree of performance operations in dart-lang/sdk#55784 (comment) provides details what we do in Linker._mergeMacroAugmentations.

  1. Merge separate macro results into single code.
  2. Parse.
  3. Build unlinked data. Mostly hashing.

Similar data internally.

@scheglov
Copy link

scheglov commented Nov 6, 2024

See also my previous benchmarks for hashing, Dart vs. Rust.
Or, maybe more fairly, different algorithms, some of which can be optimized for modern CPUs.

@jakemac53
Copy link
Contributor Author

See also my previous benchmarks for hashing, Dart vs. Rust.

Fwiw, the actual hashing is not the problem in this particular case, it is the work to pull out the interesting bits of the objects that we want to hash that is expensive.

In my PR I am just using Object.hash and Object.hashAll, at least initially, and they consume combined only about 300ms, which isn't nothing to be sure but its not the primary issue by any means.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants