Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track macro dependencies at the library level for improved invalidation #55784

Open
jakemac53 opened this issue May 20, 2024 · 54 comments
Open
Assignees
Labels
area-analyzer Use area-analyzer for Dart analyzer issues, including the analysis server and code completion. feature-macros Implementation of the macros feature P2 A bug or feature request we're likely to work on type-enhancement A request for a change that isn't a bug

Comments

@jakemac53
Copy link
Contributor

jakemac53 commented May 20, 2024

The Problem

Today whenever a library changes (it's API at least), all macros on all libraries that depend on that library (transitively) get re-ran.

In pathological cases, this can result in incremental analysis times which are O(# affected libraries * average time to run a macros per library). We can demonstrate today that if the number of affected macros reaches into the hundreds, this is noticeably slow (multiple seconds).

The need is most pressing here for the analyzer, since it will have to deal with this situation any time a character is typed in any file. Eventually, the CFE will also want to implement this for the incremental compiler, to improve hot reload performance.

Solution

We should track the dependencies of macro applications and only re-execute them if those dependencies change.

To start, "dependency" here should just be tracked at the library granularity. This should be sufficient to get what we want, without adding a huge amount of memory overhead. If desired we can later investigate more fine grained dependency tracking.

Implementation Details

For most API requests this should be relatively easy to implement - all Declaration objects sent to macros already have a Library associated with them. We should track all of those libraries in a set (this could even be done in shared code potentially). Once a macro is done, that set of libraries should be saved in such a way it can be linked to the specific macro application, via expando or map etc.

Whenever a library is invalidated by the API signature of a dependency changing, it should re-use the macro results it got previously, except for the results that depended on the library which caused the invalidation. Macros in the current library will always re-run, because they always are provided some declaration from the current library, and thus always depend on it.

The end goal of this should be that in most cases only the macros in the file which was edited are re-ran, or maybe a handful of others, but not all macros that might have been able to introspect on the changed library.

Harder cases

StaticType

Macros can ask for a StaticType, which encapsulates knowledge about the entire type hierarchy. We will need to record the libraries for all the types in that hierarchy when returning static types.

If we want the library tracking to be handled in shared code, we would need to add an Iterable<Library> field to the StaticTypeImpl so that we know which libraries a static type depends on.

Constant evaluation

If/when we enable constant evaluation, we will also need to track the library dependencies of that, for any const identifiers that were read.

Identifiers in Code objects

Identifiers themselves also could change which library they came from, which would affect the imports in the generated macro augmentations.

For all calls to resolveDeclaration and resolveIdentifier in the augmentation library code, we should track the Uris of the resolved identifiers as well, and add those to the set of dependencies.

OmittedTypeAnnotation

When types are omitted, they might be inferred. We handle that in the macro apis through the special OmittedTypeAnnotation object, which we later allow you to infer through a separate call.

When you do that inference, we need to add the proper dependencies for which libraries that inference depended on, which could get complicated.

New types being introduced in immediate imports

If a new type is added to the import scope, it could change how existing type annotations resolved, without a change to the libraries those types came from.

This usually would result in an error that anyways has to be resolved, but for core types it would not, and there might also be other situations like that, so we need to think carefully about it.

We could say that everything in the import scope is an implicit dependency, as an example (so all immediate imports and their transitive exports).

@jakemac53 jakemac53 added area-analyzer Use area-analyzer for Dart analyzer issues, including the analysis server and code completion. feature-macros Implementation of the macros feature labels May 20, 2024
@jakemac53
Copy link
Contributor Author

cc @davidmorgan @scheglov

@scheglov
Copy link
Contributor

There are a few complications, at least in the analyzer.

  1. When we ask
      var methods = await builder.methodsOf(superclassDeclaration);
      methods[0].returnType;

...the method could have its returnType inferred from a superclass declared in another library. We probably could say that a method has some omitted types, it depends on libraries that declare overridden methods.

  1. When we resolve an identifier, e.g. await builder.typeDeclarationOf(superclass.identifier), we depend not only on the library where this identifier was found. We also depend on the fact that it was not found in other imported libraries. Otherwise it would be a conflict. Not sure yet how to solve this.

  2. In the analyzer we store not separate library summaries, but whole library cycles. And we store these summaries into file system cache using API signature of all dependencies as the key. It looks that we would need to store separately macro application results using a more relaxed key externally, e.g. the combination of just URIs of the library cycle (as opposite to API signatures); and internally we will have a map with per-library key to all macros in the library result (practically test.macro.dart code, and maybe diagnostics - more complicated).

  3. To make it explicit. In the analyzer we will still relink the whole library cycle, using all dependencies. Making this portion of the analyzer incremental is a separate interesting project that might take non-trivial time. We will only try to reuse macro application results when we think it is safe to do.

@jakemac53
Copy link
Contributor Author

...the method could have its returnType inferred from a superclass declared in another library. We probably could say that a method has some omitted types, it depends on libraries that declare overridden methods.

This should be an OmittedTypeAnnotation - we do have an API to infer those types but it is explicit, and we could add in the dependency on the root library at that point? This can be added to the harder cases list :).

2. We also depend on the fact that it was not found in other imported libraries. Otherwise it would be a conflict. Not sure yet how to solve this.

Interesting point, this could be a challenge, but it would anyways result in an error in the user code right? So maybe we don't invalidate the macro, but the user does have to do something to resolve it either way. So it might be OK.

3. In the analyzer we store not separate library summaries, but whole library cycles.

It does sound like you would need a separate cache for macro execution results (which happen to be serializable already).

To simplify things, you could even group up the dependencies of all the macros that run on a single file, into one cached result. So you re-use the entire macro augmentation file or not, in which case the contents of the cache could be the literal resulting augmentation file, which would avoid doing the merging and optimizing/formatting etc also.

The downside being if any macro in the file has to re-run, you have to re-run all of them. But, I think this would still have a huge positive effect overall and quite possibly simplify things further. It might even be faster on the whole, if you can cache the entire resulting merged and optimized augmentation file.

4. We will only try to reuse macro application results when we think it is safe to do.

Yes, more incremental invalidation in general is a task for another time. I just want to get to a state where macros don't severely regress performance for incremental edits to a single file, when lots of libraries with macros depend on that file.

@jakemac53
Copy link
Contributor Author

To clarify I also want this optimization to work for large library cycles - we should still have a good invalidation story that is more granular than library cycles, ideally. There is a concern that lots of applications are just one big library cycle today.

@jakemac53
Copy link
Contributor Author

I updated the section above to capture the OmittedTypeAnnotation case as well as the new types being introduced case.

@srawlins srawlins added the type-enhancement A request for a change that isn't a bug label May 24, 2024
@scheglov scheglov self-assigned this May 24, 2024
@scheglov
Copy link
Contributor

Nothing super interesting yet, but some base to build upon.
https://dart-review.googlesource.com/c/sdk/+/368160

For now my thinking is that I need to get back from the macro framework the list of Declarations and Identifiers that the macro saw. Maybe as part of MacroExecutionResult. There are some caches outside of the analyzer, and so the analyzer cannot know for sure that if some Declaration was not accessed through API, that this means that the macro actually did not access it. Correct me if I'm wrong about caching, when we start with empty cache, etc.

@jakemac53
Copy link
Contributor Author

Correct me if I'm wrong about caching, when we start with empty cache, etc.

The caching today is always per macro application (per phase also). So it should be safe to do it on your end. Although the macro executor code could also do it, which might help with the eventual CFE implementation, if this is already done.

@scheglov
Copy link
Contributor

scheglov commented May 26, 2024

I think the implementation is not so easy as it is described. Or at least there is complexity behind "except for the results that depended on the library which caused the invalidation". Or I don't understand the proposed solution, and think that there is implied simplicity where simplicity is not implied; or the opposite - I see complexity and don't understand that solution is actually simple.

I'm looking from the perspective of loading libraries from the file system cache, after restart, with zero or more files possibly changed with the file system cache was filled. Not live updates of currently loaded libraries after a single file change, this does not fit well the current analyzer architecture.

Let's consider two APIs.

We are running macros in the library L1.

  1. Future<macro.Identifier> resolveIdentifier(Uri library, String name). It either returns an identifier that is exported from library, or throws an exception. Let's say that the library with Uri library is L2.

There are 3 variants for library.

First, the library can be a library outside of the library cycles being linked. In this case it is easy to record dependency - check if the export scope of library has the element named name with the same URI as during the macro application. This probably will cover references to dart:core, package:flutter, etc. When we load cached macro results, we can access LibraryElements of these outside libraries and check that they still export (or not - exception is also a result) the same elements (practically just URI of the declaring library for requested elements).

Second, the L2 is L1 - the library with the macro application. Most often there is no dependency at all. We already use macro results only if the library code is not changed. The identifier is either there already, or is added by other macros by the time we do this request (do we support this use case?), and these macros - we either reuse the whole library macro results or nothing, these results will be the same anyway. Complication: exports (the requested element could be exported from a different library, e.g. we moved the class and export both libraries), but this is probably rare, so we can ignore them initially, and rerun macros if there are exports.

Third, the L2 is a library in the same library cycle that we are linking. Then the logic is mostly the same as in (second). But now we require that we results of running macros in L2 should be reused (so macros produced same results, so the requested element is the same). So, we reuse cached L1 only if we can reuse L2. Complication: exports, again.

  1. Future<macro.TypeDeclaration> typeDeclarationOf(macro.Identifier identifier), more complicated API.

How do we get the identifier? There are several possibilities, but for now let's consider that is a superclass, in c1.dart.

import 'a.dart';
import 'c2.dart';
import 'c3.dart'
@Macro1()
class X extends A {}

If A is already declared in c1.dart, there is again no dependency.

Otherwise, we will use A exported from a.dart (external library), c2.dart, or c1.dart. We know the exact Element at the time of the macro execution.

Here, the fact that a.dart is an external library does not help us, when we are checking if macro results can be reused. Yes, we can know which A it exports (or not). But this does not mean that we will use this A, because another A could be exported from c2.dart. Yes, this is an error, I read this. But I would not like to ignore this, a change is a change, would be better to be precise. So, we need both c2.dart and c3.dart export (or not) the same A. And even if they don't explicitly declare A, any macro could generate one. So, we can reuse macro results of c1.dart only if we can reuse macro results of c2.dart and c3.dart.

And this is not necessary even requires a compile-time error. I can imagine

// c2.dart
import 'y.dart';
@Macro2() // generate A if Y extends Z
class C2 extends Y {}
// c3.dart
import 'y.dart';
@Macro3() // generate A if Y does not extend Z
class C3 extends Y {}

So that changes to Y in y.dart can result in generating A either in c2.dart or in c3.dart.

So, it looks to me that almost always we can reuse either all macro results in the library cycle, or none.

We will still mostly reuse macro results, and a change to a deep dependency will almost never invalidate every transitive consumer. But if we run macros, we will run them for the whole library cycle.

Obviously, let me know if there are holes in my logic.

@jakemac53
Copy link
Contributor Author

But this does not mean that we will use this A, because another A could be exported from c2.dart. Yes, this is an error, I read this. But I would not like to ignore this, a change is a change, would be better to be precise. So, we need both c2.dart and c3.dart export (or not) the same A. And even if they don't explicitly declare A, any macro could generate one. So, we can reuse macro results of c1.dart only if we can reuse macro results of c2.dart and c3.dart.

Importantly, this does not add a dependency on all the files in the library cycle, only the immediate imports.

Yes, it would add more dependencies, but as long as we are considering the final library API from those dependencies (after macro execution) we should usually still get incremental invalidation even within library cycles, unless one of the immediate dependencies macro outputs was affected by the change.

This does somewhat break down if using "convenience libraries" which export tons of other libraries, etc. But it should scale based on the immediate imports + their exports still.

So that changes to Y in y.dart can result in generating A either in c2.dart or in c3.dart.

The dependency on the library defining Y only needs to exist if the macro actually looks at the class declaration for Y, or gets a StaticType for Y.

We will still mostly reuse macro results, and a change to a deep dependency will almost never invalidate every transitive consumer. But if we run macros, we will run them for the whole library cycle.

What specifically makes the library cycles different here? I am not really understanding what concretely makes them different from other import structures. In other words, lets say A depends on B which depends on C. If we also add a dependency from C to A, that shouldn't affect the invalidation semantics when C is changed.

@scheglov
Copy link
Contributor

scheglov commented May 28, 2024

But this does not mean that we will use this A, because another A could be exported from c2.dart. Yes, this is an error, I read this. But I would not like to ignore this, a change is a change, would be better to be precise. So, we need both c2.dart and c3.dart export (or not) the same A. And even if they don't explicitly declare A, any macro could generate one. So, we can reuse macro results of c1.dart only if we can reuse macro results of c2.dart and c3.dart.

Importantly, this does not add a dependency on all the files in the library cycle, only the immediate imports.

This is correct, strictly speaking we depend only of a.dart, c2.dart, and c3.dart, not every library in the cycle.

The implementation would be simpler if we redo macro results for the whole library cycle.

But I guess the main question here is whether we can afford it, or think that there are large library cycles with many libraries that have macros applied. Although in this case, because we depend on immediate imports, we still with high probability will redo all macros, because there is a high probability that one of the immediate imports does use macros. So, if there are a few libraries with macros in the cycle, it does not matter much if we redo them all; and if there are many such libraries, we will redo them all anyway.

Yes, it would add more dependencies, but as long as we are considering the final library API from those dependencies (after macro execution) we should usually still get incremental invalidation even within library cycles, unless one of the immediate dependencies macro outputs was affected by the change.

I'm a bit confused by "considering the final library API from those dependencies". When dependencies of a library are inside the same library cycle, at the moment when we about to start building the elements, there is no final API yet. We can know top-level declarations from user code, but don't know what macros will generate this time.

This does somewhat break down if using "convenience libraries" which export tons of other libraries, etc. But it should scale based on the immediate imports + their exports still.

Yes, with library level dependencies exports are additional complication.
No such issues with cycle level dependencies.

So that changes to Y in y.dart can result in generating A either in c2.dart or in c3.dart.

The dependency on the library defining Y only needs to exist if the macro actually looks at the class declaration for Y, or gets a StaticType for Y.

Yes, I supposed that macros in c2.dart and c3.dart do look at the declaration of Y, and at the identifier of its superclass.

But my point was that external changes might cause changes to macros, and so c1.dart depends on c2.dart and c3.dart.

We will still mostly reuse macro results, and a change to a deep dependency will almost never invalidate every transitive consumer. But if we run macros, we will run them for the whole library cycle.

What specifically makes the library cycles different here? I am not really understanding what concretely makes them different from other import structures. In other words, lets say A depends on B which depends on C. If we also add a dependency from C to A, that shouldn't affect the invalidation semantics when C is changed.

Well, in this example, it is true that if A applies macros it is true that it depends only on B.
But if C does not have macros, it does not matter if we redo macros for the whole cycle, C has nothing to do.
And if C does have macros, then we will redo them if we redo them for A, the same as doing them for the whole cycle.

We can do per-library dependencies, I just wondered if this will make significant difference in the performance, while adding complication in the implementation.

@jakemac53
Copy link
Contributor Author

There is an assumption that large library cycles do exist in practice, and that we need to support them well. This is not an assumption based on evidence as far as I know, but more just based on an educated guess. It is at least plausible to assume these large library cycles do exist, given the fact that we do not tell users to split up their applications into several packages, and so it would be very easy to create large library cycles in apps without a lot of discipline in sub-package library structure.

I'm a bit confused by "considering the final library API from those dependencies". When dependencies of a library are inside the same library cycle, at the moment when we about to start building the elements, there is no final API yet. We can know top-level declarations from user code, but don't know what macros will generate this time

Ah, this is the piece that I was missing. I think we actually require for library cycles, that you run each phase across all libraries in the cycle, before continuing to the next phase in any of them. They are treated essentially as if they are one large library, from this perspective.

Because of that, you cannot run all the macros on just one library in the cycle (which was invalidated), in order to see if other things should be invalidated based on that result changing.

Is that correct?

@scheglov
Copy link
Contributor

There is an assumption that large library cycles do exist in practice, and that we need to support them well. This is not an assumption based on evidence as far as I know, but more just based on an educated guess. It is at least plausible to assume these large library cycles do exist, given the fact that we do not tell users to split up their applications into several packages, and so it would be very easy to create large library cycles in apps without a lot of discipline in sub-package library structure.

Yes, I can see that it might happen.
I also have not seen data that supports or disproves the existence of large library cycles.

I'm a bit confused by "considering the final library API from those dependencies". When dependencies of a library are inside the same library cycle, at the moment when we about to start building the elements, there is no final API yet. We can know top-level declarations from user code, but don't know what macros will generate this time

Ah, this is the piece that I was missing. I think we actually require for library cycles, that you run each phase across all libraries in the cycle, before continuing to the next phase in any of them. They are treated essentially as if they are one large library, from this perspective.

Correct, we run each phase across all libraries in the cycle, before continuing to the next phase in any of them.

Because of that, you cannot run all the macros on just one library in the cycle (which was invalidated), in order to see if other things should be invalidated based on that result changing.

Is that correct?

Yes.

@jakemac53
Copy link
Contributor Author

Given that, it does sound like a more complicated problem than I anticipated.

@davidmorgan has some related ideas to share at the sync up today, so we can try to evaluate whether it solves this problem any better or not.

@davidmorgan
Copy link
Contributor

Thanks Konstantin, thanks Jake.

Per our chat yesterday: I think we should reframe this discussion in terms of data, which is simpler than dependencies. So the question is: what data can the macro host provide; what data does the macro need; how does it change as we run through the phases; when and how can input to one macro change due to output of another macro.

Right now I don't understand enough about the execution model inside the analyzer or the CFE @johnniwinther so I am surely saying things that don't make sense :) ... one thing I'd like to get to is an "implementation doc" that describes what a macro host does in sufficient detail that it describes what is happening both in the analyzer and the CFE including any performance-relevant differences. For example if the analyzer and CFE necessarily take a different approach with different performance characteristics then that might need considering for the design as a whole.

I realize that I have been unhelpfully working in docs instead of on GitHub, let me try to pivot to using GitHub more :)

Suggestion, maybe we start a series of "implementation docs" living next to the feature spec and expanding on what is specified with implementation detail/design, would that be a good way to make progress?

https://github.com/dart-lang/language/blob/main/working/macros/feature-specification.md

I can go ahead and create some wrong outlines to be filled in with the actual details if that's a good way to start ;)

Or, any other approach someone would like to suggest.

Thanks.

@scheglov
Copy link
Contributor

I shared a small document with you.

Discussing dependencies is the same as discussing data. You get a dependency because you asked for a piece of data. Anything that could affect the provided answer, e.g. ClassDeclaration, or the generated resulting code, is now your dependency. The protocol, and how much to answer at once does not matter IMHO. This is just optimization.

So, we still need to decide what to do with A if it could be macro generated either in c2.dart or c3.dart.

@davidmorgan
Copy link
Contributor

Thanks @scheglov, much appreciated :) I sent a PR adding your notes next to the macros spec dart-lang/language#3852 so we can refer to it + evolve into a more general doc including CFE. @johnniwinther a description like this for what the CFE does would be great please, so we can start to understand the differences :)

Re: data or deps, you are right that the difference does not matter for correctness. However the optimization from thinking about data is likely to be very large :) ... it is in the nature of macros that they consume only a small fraction of the information from their dependencies, so almost all reruns triggered by changed dependencies are unnecessary.

As an extreme example, most macros only consume a small fraction of the library they are applied in: probably a single class, when the library might have tens or hundreds of classes. So the % of same-library changes that actually need a rerun is already probably <10% even before looking at imports.

Correctness questions are of course also important :)

I'm going to keep notes on investigation around data-oriented macros on language #3706, I have started to check in exploratory code there that we can use for experiments around performance and correctness.

@davidmorgan
Copy link
Contributor

https://github.com/dart-lang/language/tree/main/working/macros/dart_model now has a runnable example comparing the current macro implementation vs data model approach on a huge library cycle.

The trivial macros in the example only want the list of field names, so the data model approach can easily know that the list of field names did not change, and only rerun for one file.

There is a pretty big gap in complexity between this and something like the JSON macro, I will see next how far I can push it and what useful results I can get along the way :)

@scheglov
Copy link
Contributor

scheglov commented May 30, 2024

it is in the nature of macros that they consume only a small fraction of the information from their dependencies

You maybe think about dependencies as whole libraries or imports.
I try to think about it as information that you saw and might have used.
For the same library, if all you asked is nothing, but were implicitly given the target class declaration, then your dependency is that class and its properties. Not even methods declared in there, just the header.

For the purpose of fine grained invalidation the problem is not the dependencies, but outputs of macros.
As above, from where A came from, which macro could have generated it?
Currently I think there is no way to know, macros don't declare what is the output.

@jakemac53
Copy link
Contributor Author

jakemac53 commented May 30, 2024

I have put some more thought into what I briefly mentioned in the macro implementation sync up. In order to work around the library cycle issue, it seems to me like we need to cache the macro outputs from each phase.

In the situation mentioned above, with deps like A -> B -> C -> A, assuming macro applications in all 3 libraries, and an API change to C, we could do essentially the following:

  • Immediately invalidate B and all its macros (C could be shadowing things from B now).
  • Run all phase 1 macro applications on B.
  • Compare the result with previous phase 1 result (ideally, this would be a hash of the API signature of the declarations only, but it could be a hash of the entire augmentation file).
  • If the result is different, invalidate A and all its macros (identifiers could have been shadowed), and run phase 1 macros on A (this could in turn invalidate other libraries if the result changes from previous result).
  • Run all phase 2 macros applications on B (and possibly A if it was previously invalidated).
  • Compare result with previous phase 2 result.
  • If the result is different, invalidate macro applications from phase 2 and 3 in A (it is also OK to re-run phase 1, but shouldn't be required I don't believe). Run phase 2 macros in A (which could again invalidate other libraries).
  • Run phase 3 macros across all invalidated libraries. This shouldn't be able to cause further invalidation.

@scheglov does that make sense? Does it sound feasible?

@scheglov
Copy link
Contributor

scheglov commented May 30, 2024

We used to run the declarations phase for all libraries of the cycle at once, IIRC because we can introspect other types.
Is it safe now run the declarations phase on individual libraries?

During the types phase, I think we cannot reference any local identifier, right?

/*macro*/ class MyMacro implements ClassTypesMacro {
  const MyMacro();

  @override
  FutureOr<void> buildTypesForClass(declaration, ClassTypeBuilder builder) {
    builder.declareType(
      'X',
      DeclarationCode.fromParts([
        'class X extends ',
        declaration.superclass!.code,
        ' {}',
      ]),
    );
  }
}
@MyMacro()
class B extends A {}

This macro throws an exception

  Invalid argument(s): Unresolved identifier: A
  #0      DeclarationBuilder.resolveIdentifier (package:analyzer/src/summary2/macro_declarations.dart:308:9)
  #1      _Builder._buildIdentifier (package:_macros/src/executor/augmentation_library.dart:73:53)
  #2      _Builder._buildCode (package:_macros/src/executor/augmentation_library.dart:133:9)
  #3      _Builder._buildCode (package:_macros/src/executor/augmentation_library.dart:131:9)
  #4      _Builder.build (package:_macros/src/executor/augmentation_library.dart:156:9)
  #5      AugmentationLibraryBuilder.buildAugmentationLibrary (package:_macros/src/executor/augmentation_library.dart:23:10)

We cannot resolve A because we are still generating types, and cannot possibly know how it will be resolved.
See the document that I shared - types are resolved after the types phase.

We could do theoretically

    var c = declaration.library.uri.resolve('c.dart');
    builder.declareType(
      'X',
      DeclarationCode.fromParts([
        'class X extends ',
        await builder.resolveIdentifier(c, 'C'),
        ' {}',
      ]),
    );

But I think it should not be allowed to resolve identifiers from the same library cycles, because there is no defined order in which the types phase macro are applied across libraries. Maybe C, then B; but maybe B and then C. And then we will not able resolve package:test/c.dart@C. Right?

So, it seems to me that the types phase in B cannot depend on the types phase in C.

@davidmorgan
Copy link
Contributor

For questions of correctness related to ordering, I think we should consider moving to a model with no ordering #3858 ... we can now compare using test cases, hopefully this can be a way to push to a convincing decision one way or the other.

Regarding performance, I am now 90% convinced that evolving the current execution model into an optimal or near-optimal one is not possible. Fine-grained requests and tracking dependencies dynamically based on them cannot be as fast in a serialization-based execution model as broad requests and data structures that can be diffed / recombined.

So I think it would be good to refocus effort on poking holes in the proposed new approach: if there are reasons it won't work, let's find them as quickly as possible; if it will work, let's switch to it. I am ready to add functionality/experiments/benchmarks as needed :) and welcome suggestions.

What convinced me is the benchmark in my exploratory code, documented in the README, in which I apply three trivial macros to a 67k LOC library cycle with 64 files and 640 macro applications.

In my exploratory code, the "macros" introspect by making three RPCs and receiving three responses. Not three responses per macro: three responses for the whole codebase. The data is then chopped up and passed to a generate method that is as nice to write as any other generate method

  String generateFor(Interface clazz) {
    final result = StringBuffer();

    result.writeln('augment class ${clazz.name} {');
    result.writeln('  bool operator==(Object other) =>');
    result.writeln('      other is ${clazz.name}');
    for (final field in clazz.members.values.where((m) => m.isField)) {
      result.writeln('    && other.${field.name} == this.${field.name}');
    }
    result.writeln(';}');

    return result.toString();
  }

then the macros make another RPC each with the results, for a total of nine messages; this is a big example, so they are big results, about 3.5Mb of data:

dart bin/main.dart /tmp/dart_model_benchmark/dartModel/package_under_test
~~~ setup
Launching analyzer on: /tmp/dart_model_benchmark/dartModel/package_under_test
Listening on localhost:26199.
~~~ hosting
Incoming connection.
  EqualsMacro augments 64 uri(s), 1891591 char(s).
  HashCodeMacro augments 64 uri(s), 980681 char(s).
  ToStringMacro augments 64 uri(s), 726593 char(s)

so the current implementation has # of messages on startup: 640 * 3 * 2 * (# introspection calls) + 640 * 3 for the results, many thousands, when it's possible to cut this down to nine.

Then when you go and change one file, the current implementation has to rerun everything because it's a library cycle, while the exploratory implementation just reruns one application; six messages sent, three updates to macro input and three (much smaller this time) updates to macro output.

  EqualsMacro augments 1 uri(s), 29976 char(s).
  HashCodeMacro augments 1 uri(s), 15539 char(s).
  ToStringMacro augments 1 uri(s), 11560 char(s).
Write: /tmp/dart_model_benchmark/dartModel/package_under_test/lib/a0.a.dart

So again it's thousands of requests vs just a few, and the analyzer only has to do work on the output a0.a.dart instead of 64 sets of augmentations.

A complex macro could need hundreds of introspection calls in realistic cases, so there will be large codebases where the current implementation will need the best part of a million(!) messages passed, redescribing most of the package from scratch, and the exploratory implementation still needs six, describing only what changed.

As I said, this feels pretty conclusive to me--looking forward to hearing what others think. Thanks!

@jakemac53
Copy link
Contributor Author

I still don't actually fully understand what exactly is being proposed with the data model approach enough to be able to really provide good feedback at this time.

I think we need some much more specific details about what is actually being proposed to understand what the potential pitfalls might be?

@davidmorgan
Copy link
Contributor

@jakemac53 That's fair :) I'm short on time before the weekend starts here, let me summarize quickly and I can do further writeup as needed on Monday.

Summary added to the query-like API issue since it belongs better there :) feel free to throw any questions/observations there, too. Thanks!

@jakemac53
Copy link
Contributor Author

@scheglov This identifier resolution is only happening in the augmentation library creation phase right?

It seems likely that we will start implicitly including the "parent" library import scope into augmentation files (with the move to the "part" file unification). This might allow us to not do these resolveIdentifier calls at all, or simply provide a fallback where if it can't be resolved, we just use the raw name + original prefix with no new import. The "original prefix" might be a bit tricky, not sure.

We could even more generally rely on this kind of resolution for identifiers which came from the original library.

Would that help here? I think the specific issue is the import might change in the augmentation?

@scheglov
Copy link
Contributor

Ah, interesting.

import 'c2.dart';
import 'c3.dart'
@Macro1()
class X extends A {}

Well, when we talk about reusing macro generated augmentations, we want to make sure that the reused code is the same as we would create if we re-run macros in the library. It can mean something different (e.g. when A moves from c2.dart to c3.dart). Remember, that we relink element models, so the meaning does not matter. When we reuse the macro generated code, it goes through the all the phases of resolution - build elements, resolve references, top-level types, etc. So, if the code that we reuse does not have specific import like import 'c2.dart' as prefix1, so that we generate prefix1.A somewhere, then we can reuse this code. These are kind of "as-is" identifiers.

Yeah, we can still have a conflict, here method foo shadows the import prefix.

import 'dart:math' as foo;

class A {
  final pi = foo.pi;
  
  void foo() {}
}

So, this probably will solve the issue with types.
Time to think about the declarations phase, and introspections there :-)

@scheglov
Copy link
Contributor

No, I don't think this is OK for UX.
We don't force users to put each class into a separate library.
We should not to do this with macros.
If the user wants to separate classes, e.g. model and UI, he will put them into separate libraries.

copybara-service bot pushed a commit that referenced this issue Jun 22, 2024
This saves another 120 ms.

Bug: #55784
Change-Id: I4b195f977b7e49cc5b07dcecd4836b3f25399428
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/372820
Commit-Queue: Konstantin Shcheglov <[email protected]>
Reviewed-by: Brian Wilkerson <[email protected]>
@davidmorgan
Copy link
Contributor

I'm not quite sure I follow.

If there are two unrelated macros running on the same library, why is it clearer to merge the output into one file?

We see this today with built_value output, it has multiple generators: value types and serialization, and the output is merged to one file. Okay, it works. But usually when I go to the generated code I'm either looking for value type code or serializer code, not both, because they do completely different things. So as a user I would not be sad if the output is split to value type generated code and serializer generated code. It would be clearer.

Here is some example large output that is merged :)

Back to the question of performance, I'm not 100% clear on how these numbers relate to incremental performance. If we know a macro needs rerunning for just one library, which parts can we avoid rerunning for the whole library cycle? Which parts are forced to rerun for the whole library cycle?

Thanks.

@scheglov
Copy link
Contributor

I'm not quite sure I follow.

If there are two unrelated macros running on the same library, why is it clearer to merge the output into one file?

They are related by being applied to the same classes.
So, we can see the whole shape of the generated class augmentation.
Or a few classes generated for the same target class.

We see this today with built_value output, it has multiple generators: value types and serialization, and the output is merged to one file. Okay, it works. But usually when I go to the generated code I'm either looking for value type code or serializer code, not both, because they do completely different things. So as a user I would not be sad if the output is split to value type generated code and serializer generated code. It would be clearer.

If _$SimpleValueSerializer, _$SimpleValue, and SimpleValueBuilder were next to each other, it would be fine.
Scrolling is fast and cheap.

Here is some example large output that is merged :)

Do you want to scary me with 7469 lines of generated code?
In this regard, it makes the same impression as if were just a half of this.
A bit too much, but either this, or as much of manually written code.

Back to the question of performance, I'm not 100% clear on how these numbers relate to incremental performance. If we know a macro needs rerunning for just one library, which parts can we avoid rerunning for the whole library cycle? Which parts are forced to rerun for the whole library cycle?

Here are new numbers, which get closer to previously outlined theoretical ones, we skip some extra parse.

/tmp/dart_model_benchmark/json-macro/package_under_test
[time: 2170 ms]
[ ](name: <scheduler>, count: 0, elapsed: 0:00:00.000000, elapsedSelf: -0:00:02.127258)
[ ]  (name: getUnitElement, count: 21, elapsed: 0:00:02.127258, elapsedSelf: 0:00:00.002497)
[ ]    (name: libraryContext, count: 21, elapsed: 0:00:02.124761, elapsedSelf: 0:00:00.126992)(bytesPut: 7485399, cycleCount: 5, libraryCount: 30)
[ ]      (name: link, count: 5, elapsed: 0:00:01.819301, elapsedSelf: 0:00:00.287922)
[ ]        (name: computeLibraryScopes, count: 5, elapsed: 0:00:00.121435, elapsedSelf: 0:00:00.050705)
[ ]          (name: buildMacroApplier, count: 5, elapsed: 0:00:00.070664, elapsedSelf: 0:00:00.070664)
[ ]          (name: executeMacroTypesPhase, count: 5, elapsed: 0:00:00.000066, elapsedSelf: 0:00:00.000028)
[ ]            (name: macroApplier.executeTypesPhase, count: 21, elapsed: 0:00:00.000038, elapsedSelf: 0:00:00.000038)
[ ]        (name: executeMacroDeclarationsPhase, count: 5, elapsed: 0:00:00.190471, elapsedSelf: 0:00:00.000429)
[X]          (name: macroApplier.executeDeclarationsPhase, count: 221, elapsed: 0:00:00.087969, elapsedSelf: 0:00:00.087969)(constructorsOf: 200, methodsOf: 200)
[ ]          (name: addMacroResults, count: 200, elapsed: 0:00:00.102073, elapsedSelf: 0:00:00.001275)
[X]            (name: buildAugmentationLibraryCode, count: 200, elapsed: 0:00:00.006715, elapsedSelf: 0:00:00.006715)
[ ]            (name: kind.addMacroAugmentation, count: 200, elapsed: 0:00:00.036314, elapsedSelf: 0:00:00.002775)(length: 49900)
[ ]              (name: getFileForUri, count: 200, elapsed: 0:00:00.033539, elapsedSelf: 0:00:00.010162)
[ ]                (name: fileState.refresh, count: 200, elapsed: 0:00:00.023377, elapsedSelf: 0:00:00.001193)
[ ]                  (name: getUnlinkedUnit, count: 200, elapsed: 0:00:00.022184, elapsedSelf: 0:00:00.013982)
[ ]                    (name: parseCode, count: 200, elapsed: 0:00:00.008202, elapsedSelf: 0:00:00.008202)(length: 49900)
[ ]            (name: _addMacroAugmentation, count: 200, elapsed: 0:00:00.001160, elapsedSelf: 0:00:00.001160)
[ ]            (name: elements + types, count: 200, elapsed: 0:00:00.056609, elapsedSelf: 0:00:00.056609)
[ ]        (name: executeMacroDefinitionsPhase, count: 5, elapsed: 0:00:00.713325, elapsedSelf: 0:00:00.000396)
[X]          (name: macroApplier.executeDefinitionsPhase, count: 221, elapsed: 0:00:00.286778, elapsedSelf: 0:00:00.286778)(constructorsOf: 200, methodsOf: 200, resolve: 600, typeDeclarationOf: 200)
[ ]          (name: addMacroResults, count: 200, elapsed: 0:00:00.426151, elapsedSelf: 0:00:00.002061)
[X]            (name: buildAugmentationLibraryCode, count: 200, elapsed: 0:00:00.077331, elapsedSelf: 0:00:00.077331)
[ ]            (name: kind.addMacroAugmentation, count: 200, elapsed: 0:00:00.274253, elapsedSelf: 0:00:00.027695)(length: 2314900)
[ ]              (name: getFileForUri, count: 200, elapsed: 0:00:00.246558, elapsedSelf: 0:00:00.014776)
[ ]                (name: fileState.refresh, count: 200, elapsed: 0:00:00.231782, elapsedSelf: 0:00:00.001628)
[ ]                  (name: getUnlinkedUnit, count: 200, elapsed: 0:00:00.230154, elapsedSelf: 0:00:00.092190)
[ ]                    (name: parseCode, count: 200, elapsed: 0:00:00.137964, elapsedSelf: 0:00:00.137964)(length: 2314900)
[ ]            (name: _addMacroAugmentation, count: 200, elapsed: 0:00:00.002012, elapsedSelf: 0:00:00.002012)
[ ]            (name: elements + types, count: 200, elapsed: 0:00:00.070494, elapsedSelf: 0:00:00.070494)
[ ]        (name: mergeMacroAugmentations, count: 5, elapsed: 0:00:00.506148, elapsedSelf: 0:00:00.015948)
[X]          (name: buildAugmentationLibraryCode, count: 30, elapsed: 0:00:00.079454, elapsedSelf: 0:00:00.079454)
[ ]          (name: kind.addMacroAugmentation, count: 20, elapsed: 0:00:00.245704, elapsedSelf: 0:00:00.034075)
[ ]            (name: getFileForUri, count: 20, elapsed: 0:00:00.211629, elapsedSelf: 0:00:00.002483)
[ ]              (name: fileState.refresh, count: 20, elapsed: 0:00:00.209146, elapsedSelf: 0:00:00.000272)
[ ]                (name: getUnlinkedUnit, count: 20, elapsed: 0:00:00.208874, elapsedSelf: 0:00:00.070154)
[ ]                  (name: parseCode, count: 20, elapsed: 0:00:00.138720, elapsedSelf: 0:00:00.138720)(length: 2327350)
[X]          (name: importedFile.parse(), count: 20, elapsed: 0:00:00.165042, elapsedSelf: 0:00:00.000102)(length: 2327350)
[ ]            (name: parseCode, count: 20, elapsed: 0:00:00.164940, elapsedSelf: 0:00:00.164940)(length: 2327350)
[S]      (name: macroCompileKernel, count: 1, elapsed: 0:00:00.178468, elapsedSelf: 0:00:00.178468)

These numbers are for the case when we don't have any cached macro generated results.
From this we can try to subtract operations that we don't have to do if we have macro generated result.

Here, X is almost completely excluded, if there is just one macro out of 200, then as pay 1/200 of this, approximately zero. And S - macro code complication itself, skipped completely.

Which gives 0.087969 + 0.006715 + 0.286778 + 0.077331 + 0.079454 + 0.178468 = 0.716715.
So, we get 2170 - 716 = 1454 ms.

We need to run such operations as kind.addMacroAugmentation and elements + types to merge even reused macro generated code into the library element models that we are linking.

@scheglov
Copy link
Contributor

Another thought that occurred to me is that it might be not worth to attempt to reuse macro results. Here we can save at most 0.087969 + 0.006715 + 0.286778 + 0.077331 = 458 ms, or about 20% of the total time. But this is without not yet included cost of computing information to track dependencies, which might be significant.

Combined with the complexity of the implementation, this makes the idea less palatable.

OTOH, we already improved performance by avoiding parsing code more often than necessary.

@davidmorgan
Copy link
Contributor

If _$SimpleValueSerializer, _$SimpleValue, and SimpleValueBuilder were next to each other, it would be fine. Scrolling is fast and cheap.

Here is some example large output that is merged :)

Do you want to scary me with 7469 lines of generated code? In this regard, it makes the same impression as if were just a half of this. A bit too much, but either this, or as much of manually written code.

It's a specific example where I do end up looking at generated code and would prefer it's split by generator, not merged. It's true the augmentations (or parts, in this case) are always related to the same underlying library, but they're much less closely related to each other.

Anyway, I don't think splitting up is a great feature :) ... except if it turns out to bring a big performance improvement, then it might be worth considering. We could always come back to it later if so. It's odd to think of the IDE UX affecting the compile / incremental compile so much, it is a bit more strongly coupled than I'd like. But then, source code is exactly on that boundary between tooling and human use :) thanks.

@jakemac53
Copy link
Contributor Author

I do agree with Morgan that if merging the files is costing a lot, we should re-evaluate it. We could for example present a merged file to the user still, but not have it be used by the CFE/analyzer, possibly with a source map back to the real files.

@jakemac53
Copy link
Contributor Author

Another thought that occurred to me is that it might be not worth to attempt to reuse macro results. Here we can save at most 0.087969 + 0.006715 + 0.286778 + 0.077331 = 458 ms, or about 20% of the total time. But this is without not yet included cost of computing information to track dependencies, which might be significant.

I don't know which macro this is or how many requests it is making. But, for other macros it would likely be much more worthwhile, so we need to keep that in mind.

Another question - it seems like in the non-macro case you were able to avoid any re-parsing? Is that accurate? If so, it seems like to be competitive, we would also need to get to that same state. Could we cache merged+parsed augmentation results, in the case that all the results that would be merged together are identical?

@scheglov
Copy link
Contributor

Another thought that occurred to me is that it might be not worth to attempt to reuse macro results. Here we can save at most 0.087969 + 0.006715 + 0.286778 + 0.077331 = 458 ms, or about 20% of the total time. But this is without not yet included cost of computing information to track dependencies, which might be significant.

Another though that makes it even less palatable. Because I wanted to see impact of my changes to processing macro (no extra parsing, maybe caching), I used getUnitElement. For any practical purpose, in the analysis server, if we build the element model, we will also resolve corresponding files. This will dilute the percentage of the macro time even more.

I don't know which macro this is or how many requests it is making. But, for other macros it would likely be much more worthwhile, so we need to keep that in mind.

This is for JsonCodable from here.

Another question - it seems like in the non-macro case you were able to avoid any re-parsing? Is that accurate?

Here is what we currently pay for merging.
The item marked with X probably can be avoided, which leaves us with 340 ms.

[ ]        (name: mergeMacroAugmentations, count: 5, elapsed: 0:00:00.506148, elapsedSelf: 0:00:00.015948)
[ ]          (name: buildAugmentationLibraryCode, count: 30, elapsed: 0:00:00.079454, elapsedSelf: 0:00:00.079454)
[ ]          (name: kind.addMacroAugmentation, count: 20, elapsed: 0:00:00.245704, elapsedSelf: 0:00:00.034075)
[ ]            (name: getFileForUri, count: 20, elapsed: 0:00:00.211629, elapsedSelf: 0:00:00.002483)
[ ]              (name: fileState.refresh, count: 20, elapsed: 0:00:00.209146, elapsedSelf: 0:00:00.000272)
[ ]                (name: getUnlinkedUnit, count: 20, elapsed: 0:00:00.208874, elapsedSelf: 0:00:00.070154)
[ ]                  (name: parseCode, count: 20, elapsed: 0:00:00.138720, elapsedSelf: 0:00:00.138720)(length: 2327350)
[X]          (name: importedFile.parse(), count: 20, elapsed: 0:00:00.165042, elapsedSelf: 0:00:00.000102)(length: 2327350)
[ ]            (name: parseCode, count: 20, elapsed: 0:00:00.164940, elapsedSelf: 0:00:00.164940)(length: 2327350)

Yes, for non-macro case we don't have to process toJson() and fromJson() functions twice - first time when generated for a single class, and the second time when merged into a single library augmentation.

If so, it seems like to be competitive, we would also need to get to that same state. Could we cache merged+parsed augmentation results, in the case that all the results that would be merged together are identical?

Yes, I planned caching merged code too, in case when all partial macro generated results were reused. That's why I initially subtracted corresponding mergeMacroAugmentations / buildAugmentationLibraryCode timing.

You say "cache merged+parsed augmentation results". I did not think about serializing parsed AST recently. This might be faster than parsing. Or not. I don't know, I need to try implementing it to see the numbers.

I do agree with Morgan that if merging the files is costing a lot, we should re-evaluate it. We could for example present a merged file to the user still, but not have it be used by the CFE/analyzer, possibly with a source map back to the real files.

There always be some additional cost of macros.
Merging files costs the analysis server less than 15%.

@jakemac53
Copy link
Contributor Author

There always be some additional cost of macros.

For a non-cached result absolutely there will have to be a cost, but I am less concerned about that case. The big question here is around incremental changes, how cheap can we make them, and can it ultimately be equivalent to user written code if the macro doesn't re-run?

I am a bit confused about the answer to that question currently, are there extra costs when macros are involved, even in the cached macro result case? It seems to me that the answer is yes, but it isn't clear to me where that extra time is going.

@scheglov
Copy link
Contributor

Yes, there going to be extra costs even if we reuse cached results.

  1. It will cost something to compute dependency information, and to check for every cached result. This is not fully developed area, but potentially this can be quite expensive, and dig into any savings that we could get.
  2. If there is even a single macro application for which we did not reuse the result, we have to do mergeMacroAugmentations: generate code, parse, etc.
  3. Macros generate more code: 2.6 MB vs 1.8 MB.
  4. Merging augmentations into the element model costs something.

copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
This saves another 160 ms.

Bug: #55784
Change-Id: If626d616999b85edaaa4028cbe5fc70fbc1bf86a
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/372960
Commit-Queue: Konstantin Shcheglov <[email protected]>
Reviewed-by: Brian Wilkerson <[email protected]>
@jakemac53
Copy link
Contributor Author

  1. It will cost something to compute dependency information, and to check for every cached result. This is not fully developed area, but potentially this can be quite expensive, and dig into any savings that we could get.

Hopefully this isn't bad, but it definitely would depend on how sophisticated it was.

  1. If there is even a single macro application for which we did not reuse the result, we have to do mergeMacroAugmentations: generate code, parse, etc.

Yes, but I think that is fine. Especially if we can say "for which the result changed" - this is a potentially important difference. We should be able to re-use the merged result if the macro output was the same, even if it was re-ran.

  1. Macros generate more code: 2.6 MB vs 1.8 MB.

Is this because of different logic or overhead from the augmentation declarations? Are we comparing identical code (functionally)?

  1. Merging augmentations into the element model costs something.

Is this more expensive for macro code than user code?

@scheglov
Copy link
Contributor

  1. It will cost something to compute dependency information, and to check for every cached result. This is not fully developed area, but potentially this can be quite expensive, and dig into any savings that we could get.

Hopefully this isn't bad, but it definitely would depend on how sophisticated it was.

Hopefully. To know it with more precision I will need to implement it.

One of the costs that I know right now, is that when the macro asks for methods for example, the returned declarations are "syntactic", and depend on the tokens of all these methods. My plan is to combine these tokens into a single signature, usually we use MD5 for this. But it is somewhat expensive to compute. I need to look if we can micro-optimize it.

  1. If there is even a single macro application for which we did not reuse the result, we have to do mergeMacroAugmentations: generate code, parse, etc.

Yes, but I think that is fine. Especially if we can say "for which the result changed" - this is a potentially important difference. We should be able to re-use the merged result if the macro output was the same, even if it was re-ran.

True, if the result is the same after the re-run we still can reuse the merged code.

  1. Macros generate more code: 2.6 MB vs 1.8 MB.

Is this because of different logic or overhead from the augmentation declarations? Are we comparing identical code (functionally)?

I presume that this is the same functionality. @davidmorgan can confirm or refute this.

It looks that there are a few differences. Macro generated code has indentation, and uses this. and prefix0..
There are also external stubs, but they are small and unlikely make significant impact.

augment library 'package:package_under_test/a16.dart';

import 'dart:core' as prefix0;

augment class A0 {
  external A0.fromJson(prefix0.Map<prefix0.String, prefix0.Object?> json);
  external prefix0.Map<prefix0.String, prefix0.Object?> toJson();
  augment A0.fromJson(prefix0.Map<prefix0.String, prefix0.Object?> json, )
      : this.a0 = json['a0'] as prefix0.int?,
        this.a1 = json['a1'] as prefix0.int?,

Manual code is un-indented and not prefixed.

if (a97 != null) result['a97'] = a97;
if (a98 != null) result['a98'] = a98;
if (a99 != null) result['a99'] = a99;
return result;
}
factory A0.fromJson(Map<String, Object?> json) {
return A0._(
a0: json['a0'] as int,
a1: json['a1'] as int,
a2: json['a2'] as int,
  1. Merging augmentations into the element model costs something.

Is this more expensive for macro code than user code?

Manual code does not use augmentations, it has all code for toJson() and fromJson() generated inside the classes. So, it does not pay for the augmentation logic.

@jakemac53
Copy link
Contributor Author

Manual code does not use augmentations, it has all code for toJson() and fromJson() generated inside the classes. So, it does not pay for the augmentation logic.

Yeah, I was wondering if it is more expensive to put code in an augmentation versus all in the main library?

copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
…to declarations.

This saves us about 120 ms out of 2000.

Change-Id: Ib352a78b39706b07fc2ee0d5b2c3e6e77d98119c
Bug:#55784
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/373000
Commit-Queue: Konstantin Shcheglov <[email protected]>
Reviewed-by: Brian Wilkerson <[email protected]>
@scheglov
Copy link
Contributor

Manual code does not use augmentations, it has all code for toJson() and fromJson() generated inside the classes. So, it does not pay for the augmentation logic.

Yeah, I was wondering if it is more expensive to put code in an augmentation versus all in the main library?

It is hard to quantify. My best guess was to measure operations that merge members from augmentations into the AugmentedXyzImpl structures. It does not look expensive, something under 20 ms, or < 1%.

@scheglov
Copy link
Contributor

Re: #55784 (comment)

Decided to see if there is anything better than MD5.
It is still quite slow :-(
https://pub.dev/packages/hashlib#benchmarks

MD5 | 168.39 MB/s
XXH128 | 119.32 MB/s

For comparison on Rust

MD5 104857600 bytes * 1 iterations: 179 ms, Speed: 557.3 MiB/s
SHA1 104857600 bytes * 1 iterations: 108 ms, Speed: 922.0 MiB/s
SHA256 104857600 bytes * 1 iterations: 290 ms, Speed: 344.0 MiB/s
xxHash64 104857600 bytes * 1 iterations: 3 ms, Speed: 29021.6 MiB/s
xxHash128 104857600 bytes * 1 iterations: 3 ms, Speed: 29704.1 MiB/s
MurmurHash3 104857600 bytes * 1 iterations: 27 ms, Speed: 3653.0 MiB/s
MD5 5242880 bytes * 10 iterations: 86 ms, Speed: 575.7 MiB/s
SHA1 5242880 bytes * 10 iterations: 52 ms, Speed: 947.0 MiB/s
SHA256 5242880 bytes * 10 iterations: 145 ms, Speed: 344.5 MiB/s
xxHash64 5242880 bytes * 10 iterations: 1 ms, Speed: 32948.0 MiB/s
xxHash128 5242880 bytes * 10 iterations: 1 ms, Speed: 32645.1 MiB/s
MurmurHash3 5242880 bytes * 10 iterations: 13 ms, Speed: 3734.6 MiB/s
MD5 1024 bytes * 5000 iterations: 9 ms, Speed: 530.7 MiB/s
SHA1 1024 bytes * 5000 iterations: 5 ms, Speed: 852.7 MiB/s
SHA256 1024 bytes * 5000 iterations: 15 ms, Speed: 315.6 MiB/s
xxHash64 1024 bytes * 5000 iterations: 0 ms, Speed: 30717.6 MiB/s
xxHash128 1024 bytes * 5000 iterations: 0 ms, Speed: 29533.2 MiB/s
MurmurHash3 1024 bytes * 5000 iterations: 1 ms, Speed: 3319.9 MiB/s
MD5 10240 bytes * 500 iterations: 8 ms, Speed: 565.2 MiB/s
SHA1 10240 bytes * 500 iterations: 5 ms, Speed: 944.3 MiB/s
SHA256 10240 bytes * 500 iterations: 14 ms, Speed: 336.6 MiB/s
xxHash64 10240 bytes * 500 iterations: 0 ms, Speed: 33178.9 MiB/s
xxHash128 10240 bytes * 500 iterations: 0 ms, Speed: 32247.5 MiB/s
MurmurHash3 10240 bytes * 500 iterations: 1 ms, Speed: 3714.1 MiB/s
MD5 10 bytes * 100000 iterations: 11 ms, Speed: 85.6 MiB/s
SHA1 10 bytes * 100000 iterations: 7 ms, Speed: 127.7 MiB/s
SHA256 10 bytes * 100000 iterations: 18 ms, Speed: 52.2 MiB/s
xxHash64 10 bytes * 100000 iterations: 0 ms, Speed: 4630.4 MiB/s
xxHash128 10 bytes * 100000 iterations: 0 ms, Speed: 2886.6 MiB/s
MurmurHash3 10 bytes * 100000 iterations: 1 ms, Speed: 730.6 MiB/s

It looks that theoretically it could be 29704.1 / 168.39 = 176 times faster!
One tiny problem... It cannot be on Dart.

@jakemac53
Copy link
Contributor Author

Given that the MD5 implementation is much closer, it seems like possibly this library could just be optimized. I also noticed that the 64 and 32 bit versions are much faster, and probably would be fine to use here.

@scheglov
Copy link
Contributor

I don't understand what you mean under 64 and 32 bit versions.
Do you mean XXH32 and XXH64?
These indeed run on Dart with the same speed as MD5 on Rust.
But they have less bits, so higher collision probability.
Gemini recommended me to use xxHash128, which has the same number of bits as MD5.

And if we compare Dart's version of XXH64 vs. Rust's, it is still 527 MB vs. 33178, about 60 times slower!

I suspect that xxHash128 is much faster because it better uses modern CPUs.
And MD5 was designed much earlier (1991 ?), without being able to guess what CPUs will provide today.

@scheglov
Copy link
Contributor

OTOH, I might be wrong blaming MD5 per se.
Measuring it separately gives not so bad timings for name: md5:

[time: 2104 ms]

(name: md5-root, count: 0, elapsed: 0:00:00.000000, elapsedSelf: -0:00:00.141439)
  (name: md5, count: 1684, elapsed: 0:00:00.056389, elapsedSelf: 0:00:00.056389)
  (name: utf8Encoder, count: 668633, elapsed: 0:00:00.055708, elapsedSelf: 0:00:00.055708)
  (name: addBytes, count: 669909, elapsed: 0:00:00.029342, elapsedSelf: 0:00:00.029342)

The rest are other operations that feed data into it.
See [!] marked lines below.

        (name: executeMacroDefinitionsPhase, count: 5, elapsed: 0:00:00.731235, elapsedSelf: 0:00:00.000685)
          (name: macroApplier.executeDefinitionsPhase, count: 221, elapsed: 0:00:00.317962, elapsedSelf: 0:00:00.317962)(constructorsOf: 200, methodsOf: 200, resolve: 600, typeDeclarationOf: 200)
          (name: addMacroResults, count: 200, elapsed: 0:00:00.412588, elapsedSelf: 0:00:00.001859)
            (name: buildAugmentationLibraryCode, count: 200, elapsed: 0:00:00.075351, elapsedSelf: 0:00:00.075351)
            (name: kind.addMacroAugmentation, count: 200, elapsed: 0:00:00.307607, elapsedSelf: 0:00:00.021806)(length: 2314900)
              (name: getFileForUri, count: 200, elapsed: 0:00:00.285801, elapsedSelf: 0:00:00.012769)
                (name: fileState.refresh, count: 200, elapsed: 0:00:00.273032, elapsedSelf: 0:00:00.001780)
                    (name: getUnlinkedUnit, count: 200, elapsed: 0:00:00.271252, elapsedSelf: 0:00:00.000479)
                      (name: parseCode, count: 200, elapsed: 0:00:00.131042, elapsedSelf: 0:00:00.131042)(length: 2314900)
[!]                    (name: compute, count: 200, elapsed: 0:00:00.139731, elapsedSelf: 0:00:00.020553)
[!]                      (name: serializeAstUnlinked2, count: 200, elapsed: 0:00:00.119178, elapsedSelf: 0:00:00.014701)
[!]                        (name: apiSignature, count: 200, elapsed: 0:00:00.104477, elapsedSelf: 0:00:00.104477)
            (name: _addMacroAugmentation, count: 200, elapsed: 0:00:00.001762, elapsedSelf: 0:00:00.001762)
            (name: elements + types, count: 200, elapsed: 0:00:00.026009, elapsedSelf: 0:00:00.026009)
        (name: mergeMacroAugmentations, count: 5, elapsed: 0:00:00.448202, elapsedSelf: 0:00:00.014301)
          (name: buildAugmentationLibraryCode, count: 30, elapsed: 0:00:00.086018, elapsedSelf: 0:00:00.086018)
          (name: kind.addMacroAugmentation, count: 20, elapsed: 0:00:00.347883, elapsedSelf: 0:00:00.020151)
            (name: getFileForUri, count: 20, elapsed: 0:00:00.327732, elapsedSelf: 0:00:00.002500)
              (name: fileState.refresh, count: 20, elapsed: 0:00:00.325232, elapsedSelf: 0:00:00.000312)
                  (name: getUnlinkedUnit, count: 20, elapsed: 0:00:00.324920, elapsedSelf: 0:00:00.000116)
                    (name: parseCode, count: 20, elapsed: 0:00:00.155987, elapsedSelf: 0:00:00.155987)(length: 2327350)
[!]                  (name: compute, count: 20, elapsed: 0:00:00.168817, elapsedSelf: 0:00:00.014220)
[!]                    (name: serializeAstUnlinked2, count: 20, elapsed: 0:00:00.154597, elapsedSelf: 0:00:00.013818)
[!]                      (name: apiSignature, count: 20, elapsed: 0:00:00.140779, elapsedSelf: 0:00:00.140779)

Some of it is addString() for tokens that do UTF8 conversion.
Some of it is looping through AST nodes and tokens.

@scheglov
Copy link
Contributor

However interestingly here we might be able to do less work when using macros :-)

Any generated code is already based on the user written code and the signatures of libraries with macros. So, we don't need these apiSignatures for *.macroX.dart files. Looking at the timings above, this might save us 240 ms. I need do measure (just for fun) how much be spend when analyzing non-macro version of this package.

@scheglov
Copy link
Contributor

It takes about 110 ms for non-macro version.
This is not the total truth, because it also includes SDK.

Click to expand!
/Users/scheglov/dart/macro_benchmark/generated/json-manual/package_under_test
[time: 909 ms]
(name: <scheduler>, count: 0, elapsed: 0:00:00.000000, elapsedSelf: -0:00:00.855737)
  (name: getUnitElement, count: 20, elapsed: 0:00:00.855737, elapsedSelf: 0:00:00.011614)
    (name: libraryContext, count: 20, elapsed: 0:00:00.844123, elapsedSelf: 0:00:00.000517)(bytesPut: 1792775, cycleCount: 2, libraryCount: 27)
      (name: libraryCycle, count: 20, elapsed: 0:00:00.398978, elapsedSelf: 0:00:00.007818)
        (name: fileState.refresh, count: 74, elapsed: 0:00:00.391160, elapsedSelf: 0:00:00.027638)
          (name: getUnlinkedUnit, count: 74, elapsed: 0:00:00.363522, elapsedSelf: 0:00:00.000824)
            (name: parseCode, count: 74, elapsed: 0:00:00.161280, elapsedSelf: 0:00:00.161280)(length: 2608204)
            (name: compute, count: 74, elapsed: 0:00:00.201418, elapsedSelf: 0:00:00.038182)
              (name: serializeAstUnlinked2, count: 74, elapsed: 0:00:00.163236, elapsedSelf: 0:00:00.046481)
                (name: apiSignature, count: 74, elapsed: 0:00:00.116755, elapsedSelf: 0:00:00.116755)
      (name: link, count: 2, elapsed: 0:00:00.444628, elapsedSelf: 0:00:00.000084)
        (name: LibraryBuilder.build, count: 2, elapsed: 0:00:00.192845, elapsedSelf: 0:00:00.085109)
          (name: libraryFile, count: 27, elapsed: 0:00:00.107736, elapsedSelf: 0:00:00.000045)
            (name: parseCode, count: 27, elapsed: 0:00:00.107691, elapsedSelf: 0:00:00.107691)(length: 2055608)
        (name: buildOutlines, count: 2, elapsed: 0:00:00.226574, elapsedSelf: 0:00:00.155050)
          (name: computeLibraryScopes, count: 2, elapsed: 0:00:00.071383, elapsedSelf: 0:00:00.060396)
            (name: buildMacroApplier, count: 2, elapsed: 0:00:00.010972, elapsedSelf: 0:00:00.010972)
            (name: executeMacroTypesPhase, count: 2, elapsed: 0:00:00.000015, elapsedSelf: 0:00:00.000015)
          (name: executeMacroDeclarationsPhase, count: 2, elapsed: 0:00:00.000024, elapsedSelf: 0:00:00.000024)
          (name: executeMacroDefinitionsPhase, count: 2, elapsed: 0:00:00.000061, elapsedSelf: 0:00:00.000061)
          (name: mergeMacroAugmentations, count: 2, elapsed: 0:00:00.000056, elapsedSelf: 0:00:00.000040)
            (name: buildAugmentationLibraryCode, count: 27, elapsed: 0:00:00.000016, elapsedSelf: 0:00:00.000016)
        (name: writeLibraries, count: 2, elapsed: 0:00:00.025125, elapsedSelf: 0:00:00.025125)

(name: md5-root, count: 0, elapsed: 0:00:00.000000, elapsedSelf: -0:00:00.050875)
  (name: md5, count: 365, elapsed: 0:00:00.020232, elapsedSelf: 0:00:00.020232)
  (name: utf8Encoder, count: 213990, elapsed: 0:00:00.020505, elapsedSelf: 0:00:00.020505)
  (name: addBytes, count: 214350, elapsed: 0:00:00.010138, elapsedSelf: 0:00:00.010138)

Hm... I probably over-estimated the savings for macro version.
It is not 240 ms, but about the same 110 ms.
The reason is that measuring MD5 timings was itself slowing us down.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Jun 25, 2024

Do you mean XXH32 and XXH64?

Yes

These indeed run on Dart with the same speed as MD5 on Rust.
And if we compare Dart's version of XXH64 vs. Rust's, it is still 527 MB vs. 33178, about 60 times slower!

Right, my point was that if you compare MD5 on Dart & Rust, it is much closer (<4x). You could be right that the algorithm is just more amenable - but it seems worth checking on the Dart implementation to see if it is doing slow (if we want to dig into optimizing this).

@scheglov
Copy link
Contributor

scheglov commented Jun 25, 2024

These indeed run on Dart with the same speed as MD5 on Rust.
And if we compare Dart's version of XXH64 vs. Rust's, it is still 527 MB vs. 33178, about 60 times slower!

Right, my point was that if you compare MD5 on Dart & Rust, it is much closer (<4x). You could be right that the algorithm is just more amenable - but it seems worth checking on the Dart implementation to see if it is doing slow (if we want to dig into optimizing this).

Sure, MD5 on Rust is not so much faster than MD5 on Dart.
But that just means that MD5 is a wrong algorithm, there is no point in optimizing it.
Yes, there might be some win, but nothing big.
We can theoretically get 60x better by switching to another algorithm and FFI.

Which, given the measurements (under "expand" above): (name: md5, count: 365, elapsed: 0:00:00.020232, is not worth doing too. I'm surprised, but it indicates that we spend only 20 ms doing MD5 itself. So, maybe after all I should not worry that tracking dependencies will cost too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-analyzer Use area-analyzer for Dart analyzer issues, including the analysis server and code completion. feature-macros Implementation of the macros feature P2 A bug or feature request we're likely to work on type-enhancement A request for a change that isn't a bug
Projects
Development

No branches or pull requests

6 participants