Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic mechanism to codegen sources in V2 #9634

Merged
merged 14 commits into from
Apr 28, 2020

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented Apr 25, 2020

Goals of design

See https://docs.google.com/document/d/1tJ1SL3URSXUWlrN-GJ1fA1M4jm8zqcaodBWghBWrRWM/edit?ts=5ea310fd for more info.

tl;dr:

  1. Protocols now only have one generic target, like avro_library. This means that call sites must declare which language should be generated from that protocol.
    • Must be declarative.
  2. You can still get the original protocol sources, e.g. for ./pants filedeps.
  3. Must work with subclassing of fields.
  4. Must be extensible.
    • Example: Pants only implements Thrift -> Python. A plugin author should be able to add Thrift -> Java.

Implementation

Normally, to hydrate sources, we call await Get[HydratedSources](HydrateSourcesRequest(my_sources_field)). We always use the exact same rule to do this because all sources fields are hydrated identically.

Here, each codegen rule is unique. So, we need to use unions. This means that we also need a uniform product for each codegen rule for the union to work properly. This leads to:

await Get[GeneratedSources](GenerateSourcesRequest, GeneratePythonFromAvroRequest(..))
await Get[GeneratedSources](GenerateSourcesRequest, GenerateJavaFromThriftRequest(..))

Each GenerateSourcesRequest subclass gets registered as a union rule. This achieves goal #4 of extensibility.

--

To still work with subclassing of fields (goal #3), each GenerateSourcesRequest declares the input type and output type, which then allows us to use isinstance() to accommodate subclasses:

class GenerateFortranFromAvroRequest(GenerateSourcesRequest):
    input = AvroSources
    output = FortranSources

--

To achieve goals #1 and #2 of allowing call sites to declaratively either get the original protocol sources or generated sources, we hook up codegen to the hydrate_sources rule and HydrateSourcesRequest type:

protocol_sources = await Get[HydratedSources](HydrateSourcesRequest(avro_sources, for_sources_types=[FortranSources], codegen_enabled=True))

[ci skip-rust-tests]
[ci skip-jvm-tests]

# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Stu, Danny, and Benjy for your help this past week!

src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved

class GenerateFortranFromAvroRequest(GenerateSourcesRequest):
input = AvroSources
output = FortranSources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: this is technically a lie. We are not returning FortranSources. Rather, we are returning GeneratedSources with a Snapshot of Fortran files.

I originally added a new type called CodegenTargetLanguage so that this would be output = Fortran. I got rid of it because that is still a lie, and it led to too much boilerplate.

# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Eric-Arellano : this looks great!

I expect that more change is likely to be necessary, because I expect that codegenerators are going to need to consume other fields of a Target in order to do their work (for example: it's common to have some generator-specific config on the input Target). But this is a great incremental step.

src/python/pants/engine/target.py Outdated Show resolved Hide resolved
class GeneratedSources:
snapshot: Snapshot


@rule
async def hydrate_sources(
Copy link
Member

@stuhood stuhood Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule probably warrants a docstring that incorporates some of the description from this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a lot of documentation now in the rule body and in the docstring of the relevant classes. I think adding docstring to the rule might be noisy.

src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
@@ -1444,5 +1536,6 @@ def rules():
find_valid_configurations,
hydrate_sources,
RootRule(TargetsToValidConfigurationsRequest),
RootRule(GenerateSourcesRequest),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not be a RootRule: the conventions are nowhere near solid, but I think of them as being a bit like a public API, and in this case GenerateSourcesRequest is an implementation detail of HydrateSourcesRequest.

If a test wants to poke at those rules more directly, it can add a RootRule itself to do so. But I think that we should think of the roots that we expose in rulesets as their external inputs for general use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, my understanding of RootRules is that you should use them whenever you directly inject a type into the graph, rather than deriving that type from other rules in the graph? Meaning, almost every Request class should be a RootRule because they are almost always directly created through a Python constructor, rather than being derived from some other rule.

Is this mental model the wrong way of understanding RootRule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mental model the wrong way of understanding RootRule?

Yes, that is the wrong way to think about RootRule. See https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/README.md#gets-and-rootrules

In short: a Param enters the graph either via a Get or via a RootRule. Things that enter as Gets do not need to be declared as roots. This is where the "root" in the name comes from: you only need to declare something a RootRule if it might come in at the "root" of a graph: ie, scheduler.product_request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh. We're declaring way too many RootRules then, I think, in part from bad advice I gave Benjy. I'll clean that up.

cc @benjyw we shouldn't be using RootRule as much as I thought.

Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I expect that codegenerators are going to need to consume other fields of a Target in order to do their work

Benjy and I talked about this use case a couple of minutes ago. I realized this is indeed possible with the current design :) Every AsyncField, like Sources, has an address property. This means that it's possible to do:

await Get[WrappedTarget](Address, sources_field.address)

@@ -1444,5 +1536,6 @@ def rules():
find_valid_configurations,
hydrate_sources,
RootRule(TargetsToValidConfigurationsRequest),
RootRule(GenerateSourcesRequest),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, my understanding of RootRules is that you should use them whenever you directly inject a type into the graph, rather than deriving that type from other rules in the graph? Meaning, almost every Request class should be a RootRule because they are almost always directly created through a Python constructor, rather than being derived from some other rule.

Is this mental model the wrong way of understanding RootRule?

src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
[ci skip-rust-tests]
# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
[ci skip-rust-tests]
[ci skip-jvm-tests]
[ci skip-rust-tests]
[ci skip-jvm-tests]
Stu and Benjy pointed out that it's common for a target to have fields on it that influence how the compiler runs. So, it will be common for the codegen rules to access the original target.

While we could expect rules to call `await Get[WrappedTarget](Address)` directly in the rules, this makes life simpler for them.
# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
# Conflicts:
#	src/python/pants/engine/target.py

[ci skip-rust-tests]
[ci skip-jvm-tests]
We need this for run_setup_py.py
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
Eric-Arellano added a commit that referenced this pull request Apr 28, 2020
…pected (#9641)

Soon, we will add codegen. With this, we need a way to signal which language should be generated, if any. 

@stuhood proposed in #9634 (comment) that we can extend this setting to indicate more generally which Sources fields are valid, e.g. that we expect to work with `PythonSources` and `FilesSources` (or subclasses), but nothing else. All invalid fields would return an empty snapshot and indicate via a new `HydratedSources.output_type` field that it was an invalid sources field.

This means that call sites can still pre-filter sources fields like they typically do via `tgt.has_field()` (and configurations), but they can also use this new sugar. If they want to use codegen in the upcoming PR, they must use this new mechanism.

Further, when getting back the `HydratedSources`, call sites can switch on the type. Previously, they could do this by zipping the original `Sources` with the resulting `HydratedSources`, but this won't work once we have codegen, as the original `Sources` will be, for example, `ThriftSources`.

```python
if hydrated_sources.output_type == PythonSources:
   ...
elif hydrated_sources.output_type == FilesSources:
   ...
```
 
[ci skip-rust-tests]
[ci skip-jvm-tests]
…rate-sources

[ci skip-jvm-tests]
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
Less of a misnomer. This is not the actual output. It's rather a description of the sources.

This also fits in better with `HydrateSourcesRequest.for_sources_types`.
# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now ready for re-review.

Comment on lines +945 to +947
f"Multiple of the registered code generators can generate {output} from {input}. "
"It is ambiguous which implementation to use.\n\nPossible implementations:"
f"{bulleted_list_sep}{bulleted_list_sep.join(possible_generators)}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads:

Multiple of the registered code generators can generate FortranSources from AvroSources. It is ambiguous which implementation to use.

Possible implementations:
  * FortranGenerator1
  * FortranGenerator2

Comment on lines +959 to +965
super().__init__(
f"Multiple of the registered code generators can generate one of "
f"{possible_output_types} from {input}. It is ambiguous which implementation to "
f"use. This can happen when the call site requests too many different output types "
f"from the same original protocol sources.\n\nPossible implementations with their "
f"output type: {bulleted_list_sep}"
f"{bulleted_list_sep.join(possible_generators_with_output)}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads:

Multiple of the registered code generators can generate one of ['FortranSources', 'SmalltalkSources'] from AvroSources. It is ambiguous which implementation to use. This can happen when the call site requests too many different output types from the same original protocol sources.

Possible implementations with their output type:
  * FortranGenerator1 -> FortranSources
  * SmalltalkGenerator -> SmalltalkSources

Comment on lines 1317 to 1320
@final
@classmethod
def can_generate(cls, output_type: Type["Sources"], union_membership: UnionMembership) -> bool:
"""Can this Sources field be used to generate the output_type?"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this method is important. While we generally expect people to rely on HydratedSourcesRequest.for_sources_types to know if codegen is possible, there are some cases where we need to pre-filter without hydrating the field. For example, run_setup_py.py needs this.

This method allows us to fully support the pre-filtering idiom that we normally use with the Target API via tgt.has_field().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here could maybe indicate that most callers won't need to call this, and recommend HydratedSourcesRequest.

class GeneratedSources:
snapshot: Snapshot


@rule
async def hydrate_sources(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a lot of documentation now in the rule body and in the docstring of the relevant classes. I think adding docstring to the rule might be noisy.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Eric! This looks awesome.

Comment on lines 1317 to 1320
@final
@classmethod
def can_generate(cls, output_type: Type["Sources"], union_membership: UnionMembership) -> bool:
"""Can this Sources field be used to generate the output_type?"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here could maybe indicate that most callers won't need to call this, and recommend HydratedSourcesRequest.

# Delete this line to force CI to run Clippy and the Rust tests.
[ci skip-rust-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
@Eric-Arellano Eric-Arellano merged commit 9945df1 into pantsbuild:master Apr 28, 2020
@Eric-Arellano Eric-Arellano deleted the codegen-hydrate-sources branch April 28, 2020 23:01
Eric-Arellano added a commit that referenced this pull request Apr 29, 2020
This creates a generic `protobuf_library()` to be used across all languages. From this, if `pants.backend.codegen.protobuf.python` is activated, then V2 call sites which signal they want to use codegen (via #9634) will automatically convert any `ProtobufSources` into generated Python sources.

### Runtime Protobuf dependency

To function properly, generated Protobuf Python files must depend on the `protobuf` Python wheel. 

Normally, in V1, we would inject this dependency dynamically. For now, the user must explicitly specify the dependency in the BUILD file like they do for any normal Python requirement. 

### No `gen` goal

Users do not have a direct way to inspect the generated files. For now, they would need to have a `python_binary()` depend on the `protobuf_library`s, then inspect the built PEX after `./v2 binary`.

### How we handle source roots

Protoc needs to generate files with the source roots stripped. `src/protobuf/example/f.proto` becomes `example/f_pb2.py`. But, Protoc won't naively understand the input file `src/protobuf/example/f.proto` because it doesn't know how to naively handle source roots.

In V1, we used the flag `--proto_path` to teach Protoc how to understand source roots.

Instead, here, we take the approach of V2 Python of stripping off the source roots for these reasons:

1) We have lots of utility rules for stripping source roots already.
2) Those utilities have important performance optimizations.
     * Specifically, we only inspect 1 `sources` file for the target to determine the source root, rather than naively inspecting every `sources` file.
3) Parity with v2 Python implementation.

[ci skip-rust-tests]
[ci skip-jvm-tests]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants