Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change referenced assemblies processing to be on demand in MarkStep #1164

Closed
marek-safar opened this issue May 4, 2020 · 30 comments
Closed
Assignees
Labels
Milestone

Comments

@marek-safar
Copy link
Contributor

For historical reasons ILLinker assembly resolution logic was very explicit and done as a single step in the pipeline LoadReferencesStep this was easy to implement but it does not work well with the more advanced features which are have been and will keep be adding to the linker.

There are several drawbacks which would be addressed if the linker was able to resolve the assembly reference on demand during MarkStep (support for on-demand assembly resolution in any step is not required at this point) as well as potential speed improvement for loading and analyzing fewer assemblies. It'd also get ILLinker closer to csc behaviour and require only necessary references for processing to be specified and not all possible references as it's required today.

Related issues which would be addressed by this

@marek-safar
Copy link
Contributor Author

/cc @vitek-karas

@sbomer
Copy link
Member

sbomer commented Oct 22, 2020

I plan to break the fix up into the following phases:

@vitek-karas
Copy link
Member

Looks good!

(But not substitutions - as those could potentially modify already-marked IL)

Since we basically just remove code it should be fine. We can theoretically overmark in this case, but it's very very unlikely. Also - there should be basically no dependencies from the already marked code into the newly loaded assembly, since otherwise we would have loaded the dependency earlier.

@marek-safar marek-safar added this to the .NET 6.0 milestone Oct 23, 2020
@marek-safar
Copy link
Contributor Author

Few topics/questions which I think are worth thinking before diving into implementation

  • How are we going to handle missing requested assemblies in the code to the developers?
  • What will be processed for descriptors/attributes/substitutions from already loaded assemblies which point to delay loaded assemblies?
  • How will the custom steps pipeline be triggered for delay loaded assemblies?
  • How do we ensure no assembly is loaded after MarkStep or is the expectation to trigger the re-processing anytime?
  • Will be facades handled differently considering they will always trigger another load?

@sbomer
Copy link
Member

sbomer commented Oct 23, 2020

  • How are we going to handle missing requested assemblies in the code to the developers?

Do you mean unresolved refs that are part of the static assembly reference closure? I think these should be handled the same as they are now - either skipped, or treated as an error (depending on --skip-unresolved).

  • What will be processed for descriptors/attributes/substitutions from already loaded assemblies which point to delay loaded assemblies?

Nothing would change in Phase 1 - such XML would be processed up-front, and it could load new assemblies, but those will not have their own XML processed.

In Phase 2 delay loaded assemblies would have their dependencies XML processed recursively (if marked). This XML can pull in new delay loaded assemblies, etc. It will only support XML that adds new dependencies - not substitutions which would modify already loaded assemblies, same goes for attributes XML. In this phase I suggest not to process delay loaded substitutions/attributes. Substitutions from the static reference closure are still processed as before.

In Phase 3 we could add partial support for delay loaded substitutions/attributes. We could limit them to modifying the containing assembly, or we could do as Vitek suggests and allow modifying already loaded assemblies, but without re-processing them - so it could lead to over-marking. I would prefer limiting the modifications since I find this easier to reason about.

  • How will the custom steps pipeline be triggered for delay loaded assemblies?

I think we will need a similar extensibility mechanism for the delay-loaded per-assembly logic. I am imagining that there is a list of per-assembly steps where custom steps can be added. Maybe we could add something like:

interface IAssemblyStep {
  void ProcessAssembly (AssemblyDefinition assembly);
}

Alternatively, maybe we could use ISubStep for this. This part needs more thought - I think it will be easier to design a good API once the implementation is a bit further along and we know exactly how these steps need to interact with the rest of the linker.

  • How do we ensure no assembly is loaded after MarkStep or is the expectation to trigger the re-processing anytime?

I think the latter - we want MarkStep to be able to add assemblies, for example when they are referenced by reflection, and have most of the processing logic run for these delay loaded assemblies (but not logic that would require re-analyzing already loaded assemblies).

  • Will be facades handled differently considering they will always trigger another load?

I don't think they would need special considerations except that some steps need to traverse exported types. Do you foresee issues here?

@vitek-karas
Copy link
Member

How do we ensure no assembly is loaded after MarkStep or is the expectation to trigger the re-processing anytime?

I think this is an interesting question - where exactly in the pipeline is not that interesting, but there will HAVE to be a point after which modifications are not allowed. Maybe a better way to put this would be "Once Sweep step starts, that's it". It's not that we could not handle it on the Cecil side, but we need to be able to freeze the annotations, so that sweep and similar steps can reason about them as a whole. For example the "Sealer" step is like that - we would only want to run it once "Everything" is annotated.

That kind of partially answers the problem with custom steps - we probably need two kinds of custom steps:

  • Per-assembly, can modify annotations, ...
  • Global-analysis, runs on everything - and after certain point can't modify annotations

@MichalStrehovsky
Copy link
Member

We could limit them to modifying the containing assembly, or we could do as Vitek suggests and allow modifying already loaded assemblies, but without re-processing them - so it could lead to over-marking

I would vote for only allowing substitutions within the assembly that defines the thing being substituted. Otherwise we're opening ourselves interesting problems (linker will do substitutions based on value X returned from a method and eliminate branches at all callsites, then assembly Foo walks in and changes X to Y, invalidating linker's original assumptions). It can be dealt with I guess, but do we really need to?

@vitek-karas
Copy link
Member

I agree that substitutions should only work on the same assembly (tricky for those on command line, but those will "preload" all the necessary code, so it should be doable). The potential problem is that linker then does constant propagation across methods (and assemblies) - my original point was about the fact that constant propagation will basically never go "back" into the already processed (loaded) code. So if we limit substitutions like Michal suggests, it should be OK to also limit the constant propagation to only the newly loaded assembly and any future loaded assemblies, but no need to revisit already loaded assemblies.

I think we should write down the flow - the order of things which should happen - ignore the "step" design, just write down the actions we know need to happen. Something like:

  • Request to load assembly -> load it's XMLs and process them -> apply substitutions -> apply constant propagation -> marking -> ....

And then put together a couple of evil cases (where new assemblies are requested in the middle of the above) and how the system should handle those to get the correct result. THEN we can figure out how to actually implement it (steps, or something else, and so on).

@MichalStrehovsky
Copy link
Member

So if we limit substitutions like Michal suggests, it should be OK to also limit the constant propagation to only the newly loaded assembly and any future loaded assemblies, but no need to revisit already loaded assemblies.

I don't have a good mental model of how the constant propagation works - but I though it will also propagate actual simple bodies (i.e. no XML involved, but if the body is return true, it will eliminate what's basically a if (false) branch at the callsite). In that case, we would need to load the called assembly before processing the callsite anyway and at that point we should be looking at the substitutions in that assembly. It doesn't look like we would have a need to revisit loaded assemblies.

@sbomer
Copy link
Member

sbomer commented Oct 23, 2020

(But not substitutions - as those could potentially modify already-marked IL)

Since we basically just remove code it should be fine. We can theoretically overmark in this case

I mistakenly read this as "processing lazy substitutions is fine because they will only remove code, so by not revisiting we are at worst overmarking", but now I see that you meant "not processing lazy substitutions is fine". :) I think we're all on the same page - we should disallow substitutions that modify already loaded assemblies, no revisiting required.

@vitek-karas
Copy link
Member

One catch - this means that the processing pipeline needs to work "in reverse" - that is:

  • We're working on A - we see a reference to B
  • We load B
  • We process B almost fully (XMLs, substitutions, constant propagation)
    • This may need C
    • So we actually first fully process C
  • Then we get back to A

It's recursive - so we might have trouble with deep callstacks...
There's also potential for loops in dependencies - so we will need some mechanism to "stop" the recursion without breaking the algorithm.

In a way it would be better if we could front-load loading references - maybe this will work:

  • When loading A automatically load all direct assembly references recursively and build a list in the "Reverse dependency order" (somehow dealing with loops)
  • Process the assemblies from the list in the reverse order
  • If there are new references found while processing (DynamicDependency, XMLs, ...) - handle those "after"
    • One effect - if something has a dynamic dependency, its substitutions will be processed after the parent is done. And we will NOT revisit the parent. I don't think this will be a problem (we don't "inline" reflection calls, and that's basically the only way this would have impact).

@sbomer
Copy link
Member

sbomer commented Oct 23, 2020

I agree that we need to load the assemblyrefs up-front - doing otherwise would require us to scan method bodies to find callees before inlining. I don't think the recursion will be a huge issue. Dependency XMLs can't be processed immediately anyway - we need to wait until the assembly gets marked. Since we're only looking at static assemblyrefs, I don't think there will be loops (if the input is valid).

I wasn't planning to process those assemblies in reverse order though - I was going to more or less stick with the current design of RemoveUnreachableBlocksStep and iteratively do inlining over the new assemblies. I think the performance should be no worse than what we do today - we can separate performance from correctness for now. In a perfect world we would not only do bottom-up processing of assemblies, but also bottom-up processing of methods, to avoid re-scanning for new inline candidates.

@vitek-karas
Copy link
Member

OK - the bottom-up can potentially mean that we can run the constant propagation one assembly at a time - which will make it a LOT faster. But I agree that it's mostly a perf optimization, so we can look into that later.

@MichalStrehovsky
Copy link
Member

One thing to consider is that the "constant propagation step" currently happens upfront, and potentially runs on method bodies the linker is not even going to look at (they might be unreferenced).

An option to consider would be to ban linker code from accessing the .Body property directly and instead require these accesses to go through a separate component. This component would do the constant propagation and return an updated MethodBody if applicable (i.e. constant propagation would also happen lazily).

@marek-safar
Copy link
Contributor Author

Do you mean unresolved refs that are part of the static assembly reference closure? I think these should be handled the same as they are now - either skipped, or treated as an error (depending on --skip-unresolved).

It cannot be handled the same way. Today the errors are handled upfront with delay-loading it can happen anywhere so with skip-unresolved false we don't want to produce thousands of errors about wrong types but single error which assembly cannot be found and why.

What is I think even more problematic is that you could end up resolving and hence triggering error which should never happen. Consider an example where you resolve attribute but the attribute should not be resolved at all because one of the attribute arguments triggers assembly load which suppressed the attribute processing and its second attribute arguments point to unavailable assembly.

I don't think they would need special considerations except that some steps need to traverse exported types. Do you foresee issues here?

One thing to consider is that if there is assembly ref to the facade and we are in the mode of removing facades we should probably never resolve delay-loaded assembly reference to facade but that can be complicated if for example, facade assembly embeds XML descriptors.

where exactly in the pipeline is not that interesting, but there will HAVE to be a point after which modifications are not allowed. Maybe a better way to put this would be "Once Sweep step starts, that's it".

I think it's interesting if you consider that someone can inject a custom step and trigger assembly load from anywhere (e.g. last step in the pipeline).

In Phase 2 delay loaded assemblies would have their dependencies XML processed recursively (if marked). This XML can pull in new delay loaded assemblies, etc. It will only support XML that adds new dependencies - not substitutions which would modify already loaded assemblies, same goes for attributes XML. In this phase I suggest not to process delay loaded substitutions/attributes. Substitutions from the static reference closure are still processed as before.

Substitutions are not a problem at all. What is problematic is that existing XML descriptors can and they do in wild reference other assemblies, which with delay loading and your current proposal can be loaded anytime which means even after MarkStep. Second problematic areas can be attributes exclusion, how do you ensure that we don't mark custom attributes if they can have they attribute XML file which describes that they should be removed loaded anytime?

One thing to consider is that the "constant propagation step" currently happens upfront, and potentially runs on method bodies the linker is not even going to look at (they might be unreferenced).

This should change for multiple reasons to happen inside MarkStep and body evaluation would be triggered by marking. I tried to make that change a few months ago but didn't finish it. I could bring it back once I have time for coding.

@sbomer
Copy link
Member

sbomer commented Oct 26, 2020

with skip-unresolved false we don't want to produce thousands of errors about wrong types but single error which assembly cannot be found and why

With --skip-unresolved false, the linker will throw an exception on the first resolution failure - the only difference is that now the failure could happen later, during MarkStep. It should still only produce a single error.

one of the attribute arguments triggers assembly load which suppressed the attribute processing
how do you ensure that we don't mark custom attributes if they can have they attribute XML file which describes that they should be removed loaded anytime?

I think we should treat this the same as substitutions - it's another case where delay loaded XML is trying to modify an already loaded assembly, so it shouldn't be supported.

we should probably never resolve delay-loaded assembly reference to facade but that can be complicated if for example, facade assembly embeds XML descriptors.

Good point - I think today facades can have XML that's processed even if the facade is removed. This is also the behavior that would make sense to me for delay loaded facades - I think their XML should be processed if any typerefs that go through the facade are marked, even if the facade isn't kept. I added a note to Phase 2.1 for this.

there will HAVE to be a point after which modifications are not allowed
someone can inject a custom step and trigger assembly load from anywhere (e.g. last step in the pipeline).

I'm not sure we should be trying to enforce such ordering constraints - they feel very specific to what the step wants to do, and would probably require making parts of the API for custom steps immutable or restricting where custom steps can be inserted. Could you provide some examples of problematic behaviors we want to prevent?

The way I look at it is that custom steps will need to be designed to run correctly at a given point in the pipeline, with the understanding that previous steps won't run again. For example something which modifies an assembly in memory after OutputStep should not expect those changes to be written to disk. Similarly, a step which processes some XML after MarkStep should not expect newly marked code to be processed again by MarkStep.

with delay loading and your current proposal can be loaded anytime which means even after MarkStep

Like above, I'd say that if a step after MarkStep loads a new assembly, it needs to be aware that MarkStep won't re-run. I hope to design it so that calling Resolve won't automatically do the per-assembly logic (load references recursively, process XML, inlining, etc.) - instead there should be a separate API. That way any late step which does need to load a new assembly for some reason (hopefully rare) can call Resolve without triggering a bunch of undesired processing.

"constant propagation step" currently happens upfront
This should change for multiple reasons to happen inside MarkStep and body evaluation would be triggered by marking.

How would we reconcile bottom-up constant propagation with top-down marking? Are we ok with only propagating constants into direct callers? My current feeling is that this is orthogonal to the plan above, right?

@marek-safar
Copy link
Contributor Author

With --skip-unresolved false, the linker will throw an exception on the first resolution failure

As I wrote earlier that logic today has no information about which assembly is missing. Secondly, there are cases where no error is reported when Resolve returns null which presumably would now need some extra state to detect missing assembly case.

I think we should treat this the same as substitutions - it's another case where delay loaded XML is trying to modify an already loaded assembly, so it shouldn't be supported.

No, this is the exact opposite. Delay loaded assembly has information which influences already processed types.

I'm not sure we should be trying to enforce such ordering constraints - they feel very specific to what the step wants to do, and would probably require making parts of the API for custom steps immutable or restricting where custom steps can be inserted. Could you provide some examples of problematic behaviors we want to prevent?

What I was trying to steer you towards is having separate APIs for assembly/type resolving. One which can trigger assembly load and one which never will. I'm concern that without that we could end up with hard to trace bugs unless we ensure the whole pipeline can be reliably processed from any point which I don't think is doable. Imagine a situation where e.g. SweepStep triggers new assembly load, do you suggest we should re-processed everything?

My current feeling is that this is orthogonal to the plan above, right?

Correct, it's not related to this plan but as this topic was mentioned by other I shared what I think we should do there.

sbomer added a commit to sbomer/linker that referenced this issue Nov 4, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Nov 4, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Nov 4, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
@vitek-karas
Copy link
Member

I would love to restrict embedded XML to the same assembly only, but it may not be practical. For example currently it's easy to add embedded XML into the app - VS has UI to do that, you just need to know the name. Passing the same file to linker on command line is a much more complicated process for the user, involving modifying the project file. We may need to come up with a bit more relaxed rule which allows the app much more freedom in usage of the XMLs. But in general I agree we should try to limit the reach of the XMLs such that they don't create these issues.

@MichalStrehovsky
Copy link
Member

For example currently it's easy to add embedded XML into the app - VS has UI to do that, you just need to know the name. Passing the same file to linker on command line is a much more complicated process for the user, involving modifying the project file

IMHO if a user needs to make an XML, we already failed to provide a good user experience. Whether they need to do extra work to hook it up doesn't change the fact that we already took them to a place where they need to figure out strange file formats with odd name mangling rules that is disconnected from IDE and refactoring tools. I wouldn't look at the XML as a user scenario.

@vitek-karas
Copy link
Member

I absolutely agree it's not a mainline scenario, but in the foreseeable future, there will be NuGets and other components which are not trim friendly. Some people may want to use XMLs to annotate assemblies they don't have source code for (for example a NuGet) - but I guess for such case some more csproj magic doesn't matter that much.

@sbomer
Copy link
Member

sbomer commented Nov 7, 2020

What behavior do we want for a dynamically loaded assembly that's unused? For example, GetType("MissingType, MyAssembly") would load MyAssembly without requiring code from it. This won't run module initializers (at least on coreclr), but my reading of the spec is that implementations could run module initializers as soon as the module is loaded.

The current behavior allows the linker to remove MyAssembly. Assuming this is what we want, what should happen if the default action is copy? Normally, copy will keep unused statically referenced assemblies - but it's not clear to me what should happen for unused dynamic references.

@MichalStrehovsky
Copy link
Member

GetType("MissingType, MyAssembly", throwOnError: true) throws different exceptions based on whether the file, or the type doesn't exist. Arguably this is very niche, but I would just keep the assembly so that we get this variation right, unless it's too much work (more than 5 minutes :)) to do that (I assume it really doesn't need much work - if we have trouble keeping assemblies with nothing in it marked, we can just mark the "<Module>" type so that we have something marked for sure and keeping the assembly will probably just fall out).

@sbomer
Copy link
Member

sbomer commented Nov 12, 2020

The plan above was still controversial so I've created a doc with a new design that avoids loading references up-front in most cases: #1626 Hopefully iterating on the design doc will be a better way to get to an agreement.

sbomer added a commit to sbomer/linker that referenced this issue Dec 2, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Dec 3, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Dec 7, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Dec 9, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Dec 10, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Dec 14, 2020
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Jan 4, 2021
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Jan 15, 2021
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
sbomer added a commit to sbomer/linker that referenced this issue Jan 20, 2021
This introduces the ability to add assemblies to LinkContext lazily and have them
processed in MarkStep. For now, the only additional processing is to
load references and run the TypeMap logic. Later, embedded XML processing and
constant propagation will also happen for lazy assemblies.

This is Phase 1 outlined in dotnet#1164 (comment).

Fixes dotnet#943
Fixes dotnet#1079
@marek-safar
Copy link
Contributor Author

Closing as implemented in 6.0

tkapin pushed a commit to tkapin/runtime that referenced this issue Jan 31, 2023
…ker#1186)

This removes the easily removable parts of the TypeMap step. Contributes to dotnet/linker#1164.

Commit migrated from dotnet/linker@e9f0009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants