Evaluate and investigate performance #108

jakemac53 · 2024-10-18T15:27:50Z

There are many aspects to this, I figure we can collect info in this issue so as not to duplicate too much work.

Some specific tasks are:

Establish a benchmarking strategy/suite, may need help from implementation teams but TBD
Improve invalidation based on query results (will probably need some work by @scheglov and @johnniwinther )
Improve granularity of queries to get more granular invalidation
Make improvements once problem areas are identified

jakemac53 · 2024-10-18T15:51:35Z

There should be some very low hanging fruit here - I am seeing almost 100ms to run a phase which is doing nothing at all (types phase, for json_codable).

jakemac53 · 2024-10-18T16:13:48Z

Ok, so I narrowed this down to the initial query for the target, which is taking >80ms.

2024-10-18 16:12:10.951339: Running target query
2024-10-18 16:12:11.039218: Completed target query

I actually suspect this is just us needing to change the TCP connection settings, but will continue to dig. The actual query on the server side is <1ms to compute.

This query is going away soon, but in general queries should not be this slow 🤣

Update: Confirmed we weren't setting tcpNoDelay, fixed in #109. Still more work to do though, we are about 200ms now in total which is overall about 20x faster end to end.

jakemac53 · 2024-10-18T18:01:25Z

#104 should also improves things here.

scheglov · 2024-10-22T20:54:55Z

We previously discussed invalidation in dart-lang/sdk#55784, and it seems to me that timings provided there will still be the same, even with new queries. The analyzer invalidates and links whole library cycles, and has to parse all files of the library cycle. We cannot currently do it with higher granularity. We still want by UX reasons to have single macro generated file per library. We can avoid running individual macros, or individual phases, but this will save us only so much.

jakemac53 · 2024-10-23T14:31:22Z

We previously discussed invalidation in dart-lang/sdk#55784, and it seems to me that timings provided there will still be the same, even with new queries.

This is why I was saying yesterday that I didn't think it would be sufficient to just do the "easy" thing and make an implementation of the macro executor which can return cached results. It sounded like more work would be needed specific to the implementation to cache the fully merged result.

Essentially, you need to be able to ask if any macro application in the entire library was invalidated, and if not be able to re-use the parsed AST for the fully merged result.

So, in general yes I agree, except that the new queries can simplify (and unify) the cache invalidation story, and they can also allow for more granular invalidation.

scheglov · 2024-10-23T16:00:38Z

Yes, I agree that queries might simplify the invalidation checks. Could you walk me through a scenario?

Lets say we have the library L with two macro applications M1 and M2. Where M1 on the types phase introspects class C1 and generates class C2, and configures it so that M2 runs on C1 during the declarations phase. We can remember the query Q1 used by M1, and so during the next linking of L instead of running M1 we compute the answer on Q1, see that it is the same as before, and know that we can reuse the previously generated R1.

However we cannot use previously remembered Q2 query until we apply R1 into L, because Q2 might (and did) introspect the results of the previously run macros, C2 of M1. So, we have to parse results of M1 and integrate partial augmentations into L, and only then move to M2.

It looks to me that we cannot just run all queries from all macros and decide whether we can reuse the merged result of running macros. We still have to do this macro by macro.

jakemac53 · 2024-10-23T17:01:56Z

cc @davidmorgan

Good points, I had forgotten this extra complication of the query results being affected also by macros. So, you need to somehow be able to return the query results as of a specific point in time, which is not only per phase but also even can be affected by individual macros in the phase.

One possible idea, is to track somewhere on each declaration information about when it was introduced (either in the element model or AST, not sure which makes most sense). That sounds like potentially a lot of work to track though, and it isn't immediately obvious the best way to track the stuff that happens mid-phase, although I am sure we could work something out.

Does it sound like I am understanding the issue properly?

jakemac53 · 2024-10-23T17:42:39Z

I am also curious how much of this is a result of the merged augmentation file.

If we decided to not merge augmentations, but instead created one per macro application (and phase), it might simplify/improve this situation.

Then you could save each of the parsed augmentation results separately, and go back through the merging process from the beginning, but with cached results for most of the augmentations?

scheglov · 2024-10-23T18:02:05Z

My thoughts about many macro generated files are still the same as in dart-lang/sdk#55784 (comment).

jakemac53 · 2024-10-24T15:16:14Z

My thoughts about many macro generated files are still the same as in dart-lang/sdk#55784 (comment).

I agree from a usability standpoint that one file is better, but I think performance is the more important usability concern. If not merging the files means we can avoid re-parsing all downstream macro generated code, then I think that is probably worth doing. We can evaluate other options for improving the user experience too.

scheglov · 2024-10-24T20:03:20Z

Hm... Actually, we merge macro generated results into a single file after running all macros.
So, I don't understand: how does it affect the way we compute query results?

jakemac53 · 2024-10-24T21:32:58Z

So, I don't understand: how does it affect the way we compute query results?

Well, I think the difference is we would have to cache all the parsed but pre-merged augmentations, as well as the final parsed merged augmentation? If none of the pre-merged ones had to be regenerated, then we can re-use the final result as well. So, double the AST nodes cached for macro files.

But, if we did that, we might be able to do something reasonable here? Essentially something like this at a very high level:

Remove all macro augmentations from the element model
For each macro application, in application order
- Run the queries of the macro, compare against previous results.
- If no change
  - re-use previous parsed AST for that macro result
  - merge it into the element model
- Else
  - re-run the macro, get a new parsed AST
  - merge it into the element model
  - set some flag that the merged AST is dirty
If the merged AST is dirty
- regenerate it, parse it
Update the element model nodes to point at the new ones in the merged model

At a very high level does that sound correct/feasible? I worry a bit about the last step, it has to happen always no matter what. And in general, we are always repeating the work of merging the augmentations into the element model. Is this more work than if the file was hand written, or would we always be re-creating the entire element model from scratch anyways?

@johnniwinther is this something that would be feasible in the CFE? I don't want to overfit to the analyzer here either.

scheglov · 2024-10-25T18:47:58Z

We probably have different conceptions how the element model is handled in the analyzer. When there is any change to a file in a library cycle itself, or any of its dependencies, we discard the whole element model for this library cycle, and re-link it from scratch. This starts with parsing all files of the library cycle.

So, the step "remove all macro augmentations" does not make sense. If there was a change, there is nothing left already.

Also, saying "re-use previous parsed AST" might have different meaning. We do not, and probably should not, keep parsed ASTs that were used for linking element models after linking is done. It would consume more memory than necessary. Theoretically we could serialize parsed ASTs, and this might be faster to read from some binary format than to parse from scratch. Probably not hard to do, but quite a lot of work, and the performance benefits are unknown until we try.

davidmorgan · 2024-10-28T12:39:36Z

I think it would help to have concrete goals for performance to motivate the work, particularly to guide how far we need to go here in considering different designs and approaches.

Here are my thoughts as a starting point :)

Goals

Better Than Codegen

Macros need to be a better user experience than package:build codegen.

Benchmarking showed macros v1 is approximately a constant factor of 2x faster than codegen. Already a nice improvement.
But, for incremental analysis this is traded off against the fact that macros run more often than codegen, in way that is more blocking for the UI, can't be paused, and doesn't allow manual editing/substitution of the output.
And, for initial analysis codegen has a much better story because prebuilt source can be published, while macros requires building all transitive macros then applying all transitive applications.

There are two paths to improvement here:

Make macros faster: ultimately, if macros are fast enough that users never notice an interactive delay, this problem is solved.
And/or, add ways to move macro runtime "off the critical path" so that users can work despite macro delay, as they can with generators.

We will certainly work on making macros faster, the question is how far we can go: can macros be fast enough that we just say to users "you have no choices here because it's always fast enough"? I think there's a reasonable chance the answer is "no", so now seems a good time to ask what options we have and how much work they are to build.

Scalable

We've talked about macros being "scalable", and I know what I think that means, let's see if we have consensus :)

For me "scalable" means that we've checked how performance scales with a set of measures of input size, and are satisfied with the results.

Example measures of input size: total program number of different macros, total program number of macro applications, library cycle number of different macros, library cycle number of macro applications.

"Satisfied with the results" is tricky because macro runtime is always part of a larger operation: it's part of a larger analysis or compile. We know for example that incremental analysis on a library cycle has to do work on the whole library cycle, so analysis time scales with the library cycle size.

That could tempt us to say that it's okay for macro runtime to also scale with library cycle size. What worries me about that outcome is that if we design macros so that O(library cycle size) is the best possible performance, then some later analyzer optimization work reduces analysis time for large library cycles, macros will be the bottleneck.

So I think it's on us at design time to justify the scalability as much as possible from basic principles: by arguing that the work done is precisely what is necessary and no more.

With this in mind, an important challenge to macro scalability is the generality of the feature: for reasons of usability, a lot of things "just work", sharp edges are minimized. But the common cases, and so the most important cases for performance, will tend to be simple. For example, if a library cycle uses ten different macros, it's extremely unlikely that they all interact, in the sense of output from one macro being input to another. It is much more likely that they are mostly independent: acting on different declarations or within different parts of the namespace when they do touch the same declarations. So we need that to be fast, with little or no cost paid for generality that goes unused.

Braindump

Various performance-related thoughts.

Lazy Macros

The analyzer doesn't actually need full macro output unless the user clicks through to an augmentation file; before that it only needs sufficient output to analyze other files, which could for example mean skipping phase 3.

More fine-grained optimizations are possible. For example, Swift macros can specify in advance the names they will produce. The compiler can then avoid running the macro at all if the possible names are not interesting for the currently-running compiler task. Our macros are much more complicated, of course :)

Optimistic Macros

If macros generally don't interact with each other, then "run quickly assuming no interactions, start a slower run in parallel" might give good results.

Analyzer UX / Parallelism

Most of what the analyzer does is faster than running macros, and the results are wanted with low latency: so it might be an improvement to delay running macros until the faster results are reported, or to push macro evaluation to another process so it doesn't block the faster results.

Just generally we should look for any opportunities to use multiple processes as this looks like free additional resources on most dev machines.

Simple Hacks for Simple Things

With @JsonCodable, when the caret is at a field name and you type new characters, what happens in the output is trivial: the corresponding field name, which is repeated ~four times in the output, gets the same new characters. Could we optimize for such cases by guessing the right output in most cases and running the correct+slow analysis in the background?

If we decide "correct macro analysis after every keypress" is a hard requirement, it's likely the hardest case to optimize for so we should be prepared to tackle it directly.

Pause Button

If we can't make macro execution lazy / off the critical path, another option is to have a way to choose when to run it.

Disk Support

Just generally a lot of the "escape hatches" for codegen performance relate to it working with files on disk that can be reused, prebuilt, and manually hacked on. Maybe it's something we should look at fleshing out.

Share Analyzer Output with CFE

Users of incremental CFE compile are likely also running the analyzer; the CFE could directly use augmentations output by the analyzer in this case.

Publish to Pub

For faster initial analysis / initial compile.

Versioning

Now that we plan to support breaking changes to macros, we are at least not fully stuck if we discover that macros are slow in an important+unfixable way: we can add an alternative faster implementation and deprecate (but not remove) the slow one.

Next Steps

I guess, let's chat in tomorrow's meeting :)

Thanks.

jakemac53 · 2024-10-28T14:13:00Z

Most of what the analyzer does is faster than running macros

This is actually not true, based on dart-lang/sdk#55784 (comment) and dart-lang/sdk#55784 (comment). Macros themselves actually are less than half the time even after some optimizations to other pieces were applied.

This is why I have been trying to address things from the parsing/merging of results angle, that appears to actually be a fundamental issue which making macros faster, or even more incremental, will not solve (not unless the incrementality also fixes those other issues).

davidmorgan · 2024-10-28T15:30:45Z

Most of what the analyzer does is faster than running macros

This is actually not true, based on dart-lang/sdk#55784 (comment) and dart-lang/sdk#55784 (comment). Macros themselves actually are less than half the time even after some optimizations to other pieces were applied.

This is why I have been trying to address things from the parsing/merging of results angle, that appears to actually be a fundamental issue which making macros faster, or even more incremental, will not solve (not unless the incrementality also fixes those other issues).

Sorry, what I meant was: the findings the analyzer could report by totally ignoring macros, or pretending the output didn't change; vs the findings after fully evaluating macros.

But, yes, and one related thought: it might be useful to talk about performance of augmentations as a separate thing, it's not clear to me how much of that incremental cost is special to augmentations vs what we would see with the code just inline.

jakemac53 · 2024-10-28T16:08:32Z

But, yes, and one related thought: it might be useful to talk about performance of augmentations as a separate thing, it's not clear to me how much of that incremental cost is special to augmentations vs what we would see with the code just inline.

@scheglov should comment but my understanding is that this isn't an issue with augmentations themselves. It is an issue with us re-creating new augmentation files all the time whenever macros are applied to a library, as well as the duplicate parsing of all macro produced results. We end up always regenerating and re-parsing macro generated augmentation files whenever a macro re-runs, even if we don't actually re-run the macro itself.

jakemac53 mentioned this issue Oct 18, 2024

set tcpNoDelay option on all sockets for 100x speedup #109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate and investigate performance #108

Evaluate and investigate performance #108

jakemac53 commented Oct 18, 2024

jakemac53 commented Oct 18, 2024

jakemac53 commented Oct 18, 2024 •

edited

Loading

jakemac53 commented Oct 18, 2024

scheglov commented Oct 22, 2024

jakemac53 commented Oct 23, 2024 •

edited

Loading

scheglov commented Oct 23, 2024

jakemac53 commented Oct 23, 2024

jakemac53 commented Oct 23, 2024

scheglov commented Oct 23, 2024

jakemac53 commented Oct 24, 2024 •

edited

Loading

scheglov commented Oct 24, 2024

jakemac53 commented Oct 24, 2024

scheglov commented Oct 25, 2024

davidmorgan commented Oct 28, 2024

jakemac53 commented Oct 28, 2024 •

edited

Loading

davidmorgan commented Oct 28, 2024

jakemac53 commented Oct 28, 2024

Evaluate and investigate performance #108

Evaluate and investigate performance #108

Comments

jakemac53 commented Oct 18, 2024

jakemac53 commented Oct 18, 2024

jakemac53 commented Oct 18, 2024 • edited Loading

jakemac53 commented Oct 18, 2024

scheglov commented Oct 22, 2024

jakemac53 commented Oct 23, 2024 • edited Loading

scheglov commented Oct 23, 2024

jakemac53 commented Oct 23, 2024

jakemac53 commented Oct 23, 2024

scheglov commented Oct 23, 2024

jakemac53 commented Oct 24, 2024 • edited Loading

scheglov commented Oct 24, 2024

jakemac53 commented Oct 24, 2024

scheglov commented Oct 25, 2024

davidmorgan commented Oct 28, 2024

Goals

Better Than Codegen

Scalable

Braindump

Lazy Macros

Optimistic Macros

Analyzer UX / Parallelism

Simple Hacks for Simple Things

Pause Button

Disk Support

Share Analyzer Output with CFE

Publish to Pub

Versioning

Next Steps

jakemac53 commented Oct 28, 2024 • edited Loading

davidmorgan commented Oct 28, 2024

jakemac53 commented Oct 28, 2024

jakemac53 commented Oct 18, 2024 •

edited

Loading

jakemac53 commented Oct 23, 2024 •

edited

Loading

jakemac53 commented Oct 24, 2024 •

edited

Loading

jakemac53 commented Oct 28, 2024 •

edited

Loading