Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support @generated marker to skip code formatting #4877

Merged
merged 1 commit into from
Sep 8, 2021

Conversation

stepancheg
Copy link
Contributor

@stepancheg stepancheg commented Jun 20, 2021

@generated marker is used by certain tools to understand that the
file is generated, so it should be treated differently than a file
written by a human:

  • linters should not be invoked on these files,
  • diffs in these files are less important,
  • and these files should not be reformatted.

This PR proposes builtin support for @generated marker.

I have not found a standard for a generated file marker, but:

My personal story is that rust-protobuf project which I maintain
was broken twice because of incompatibilities/bugs in rustfmt marker
handling: one, two.
(Also, rust-protobuf started generating @generated marker 6 years ago).

While rustfmt AST markers are useful to apply to a certain AST
elements, disable whole-file-at-once all-tools-at-once text level
marker might be easier to use and more reliable for generated code.

Copy link
Member

@calebcartwright calebcartwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi 👋 thanks for opening the PR and the detailed data in the description!

This is actually a duplicate of #3958 with a different resolution. #3958 was resolved via #4296 but this was done against a different, experimental branch in the project for a potential v2 of rustfmt, and the support for @generated hasn't yet made into the rustfmt mainline and release train.

If you'd be interested in trying to grab those commits from #4296 (can be found on the rustfmt-2.0.0-rc2 branch https://github.com/rust-lang/rustfmt/tree/rustfmt-2.0.0-rc.2) then I'd be happy to get those changes incorporated into rustfmt. However, I don't want to move forward with the alternative implementation you've proposed here, as always running an up front full text scan of every file is too open ended and involved considering that generators can easily stick the marker at the top of file as #4296 supports

@stepancheg
Copy link
Contributor Author

stepancheg commented Jun 21, 2021

I did not get it, that branch is called RC, does it means it will be released soon?

There are two issues with that implementation:

  • only comment on the first line is supported, which is too restrictive. Not life and death issue, but just unnecessary restriction, it is not noticable for performance or anything to read more. Maybe scan five lines at least?
  • that implementation opens the file twice, once to understand the file is not generated and second one to actually parse it. I suspect on average parsing the whole file and then scan for the marker will be faster than opening the file twice. Simply because:
    • Majority of files are not generated, so default non-generated case should be optimized
    • Opening a file is several microseconds at least
    • Scanning for the @generated marker even in the whole file is faster than 1us

So I did a little benchmark. On my mac laptop the results are:

  • scan for "@generated" marker in a 10K file — 500ns
  • read a first line of the file — 15000ns

Of course, it is not life and death issue either.

What I would do:

  • take the mentioned patch as a base (because on average it is better quality than mine — at least it has a configuration option)
  • patch it to open/read the file once
  • scan for @generated in the whole file (ideally) or at least in the longer file prefix
  • keep the logic of avoiding parsing if there's a @generated marker

@calebcartwright
Copy link
Member

I did not get it, that branch is called RC, does it means it will be released soon?

No, and don't really want to get off topic here. The only point is that an implementation was already added, but on a different branch.

so default non-generated case should be optimized

sure, but keep in mind that the skip with @generated use case will be the least common use case by far so we shouldn't be optimizing for this at the expense of the use cases that are actually common (regardless of how big/small the penalty is)

I suspect on average parsing the whole file and then scan for the marker will be faster than opening the file twice.

I'm not sure I follow. In both cases, including the existing implementation as well as your initial proposal, the file is still parsed to produce an AST prior to checking any contents in the associated span.

only comment on the first line is supported, which is too restrictive. Not life and death issue, but just unnecessary restriction, it is not noticable for performance or anything to read more. Maybe scan five lines at least?

Can you elaborate on why this matters? The whole point of generated code in this context is that it's not human written and rustfmt shouldn't have to be concerned with the arbitrary nature of human-written code like we do everywhere else, since any generators leveraging this pattern should be able to easily conform to a standard pattern.

There's no valid use case I can see for generators needing to stick @generated at the end of the file, much less some random location within the file. Adding support for generated to rustfmt in such an open ended way would open up the risk of potential unintended false positives where users happen to have @generated somewhere within their real, user-written and user managed code since this is something that has historically never meant "rustfmt skip this file please", and they may not be aware of the meaning in other languages.

That's not a risk I'm willing to accept.

If a generator wants to use this alternative instead of the other, more commonly used options, then it seems entirely reasonable to stick something at (or near) the top of the file.

I'm open to improvements/alternatives to what's already been incorporated, but not with what's originally been proposed here. Additionally, the original implementation should be improvable now, as we already have the corresponding span for the module which will cover the entire file (subsequent upstream AST changes), and we can peek inside the corresponding snippet without have to (potentially) open the file again.

The only question really is: how much of that snippet do we really need to search? If you have some examples/benefits of expanding that beyond the first line that'd be helpful and will certainly consider, but the answer can't be "the entire snippet" 😄

@stepancheg
Copy link
Contributor Author

stepancheg commented Jun 21, 2021

I did not get it, that branch is called RC, does it means it will be released soon?
No, and don't really want to get off topic here. The only point is that an implementation was already added, but on a different branch.

Was just trying to understand how things work. In particular, against which branch I need to submit a PR. Now I got it, against master, and that 2-branch is not going anywhere soon. Just that -rc name confused me.

I'm not sure I follow. In both cases, including the existing implementation as well as your initial proposal, the file is still parsed to produce an AST prior to checking any contents in the associated span.

Let's skip this part of the discussion, because I think we are on the same page: implementation should be efficient.

Can you elaborate on why this matters? The whole point of generated code in this context is that it's not human written and rustfmt shouldn't have to be concerned with the arbitrary nature of human-written code like we do everywhere else, since any generators leveraging this pattern should be able to easily conform to a standard pattern.

  • In rust-protobuf I placed @generated marker on the third line.
  • I did some code search in some undisclosed codebase where generated markers are used often, the result is:
... a lot of markers on various lines, and the top by count on line number is:
    119 4
   6240 3
    193 2
   8104 1

so quite a lot of generated markers are not on the first line. A lot of markers are on the second line because it is python or shell scripts which require shebang on the first line. But sometimes people place it on the third line and further. I don't know why people do that. Probably from some aesthetic preferences. That's why I suggested using top 5 lines.

There's no valid use case I can see for generators needing to stick @generated at the end of the file

I think there are no practical reasons to put in in the end of the file, but there might be reasons to read it below 5th line, for example:

  • It might be more consistent across tooling: like if there's another tool which supports the marker, ideally rules should be the same across all the tools.
  • Legal department (or a linter provided by legal department) may require a license to be in the beginning of the file, so @generated marker need to be inserted after the license header

But I'm fine with allowing @generated only at the top 5, or even only on the first line.

This is a copy of rust-lang#4296 with these changes:
* file is not reopened again to find if the file is generated
* first five lines are scanned for `@generated` marker instead of one
* no attempt is made to only search for marker in comments

`@generated` marker is used by certain tools to understand that the
file is generated, so it should be treated differently than a file
written by a human:
* linters should not be invoked on these files,
* diffs in these files are less important,
* and these files should not be reformatted.

This PR proposes builtin support for `@generated` marker.

I have not found a standard for a generated file marker, but:
* Facebook [uses `@generated` marker](https://tinyurl.com/fb-generated)
* Phabricator tool which was spawned from Facebook internal tool
  [also understands `@generated` marker](https://git.io/JnVHa)
* Cargo inserts `@generated` marker into [generated Cargo.lock files](https://git.io/JnVHP)

My personal story is that rust-protobuf project which I maintain
was broken twice because of incompatibilities/bugs in rustfmt marker
handling: [one](stepancheg/rust-protobuf#493),
[two](stepancheg/rust-protobuf#551).
(Also, rust-protobuf started generating `@generated` marker
[6 years ago](https://git.io/JnV5h)).

While rustfmt AST markers are useful to apply to a certain AST
elements, disable whole-file-at-once all-tools-at-once text level
marker might be easier to use and more reliable for generated code.
@stepancheg
Copy link
Contributor Author

stepancheg commented Jun 22, 2021

OK, I picked #4296 and applied the following changes:

  • original commit author @topecongiro preserved
  • file is no longer reopened again to find is the file is generated
  • marker is searched in the first five lines instead of one
  • no attempt is made to search for a marker only in comments:
    • original patch allowed marked to be specified like const X: &str = "@generated"; // — this line contains a comment and a marker
    • proper extraction of marker only in comment would require more work

Good or something need to be adjusted?

@calebcartwright
Copy link
Member

Thanks @stepancheg. I didn't get a chance to look at the updates yet, but I like the idea of combining the best parts of both approaches, particularly since there's been some upstream changes in the AST around the mod spans we can now leverage.

Will try to take a look tomorrow

@stepancheg
Copy link
Contributor Author

GitHub super-linter now has opt-in supports for @generated.

Also, I have created a website to promote using @generated as standard.

@calebcartwright
Copy link
Member

Also, I have created a website to promote using @generated as standard.

Nice!

I forgot to post but I'm good with the updated approach. I'd like to add some more tests and want to experiment with some simplification based on the spans, but I'll handle that separately.

I have some PR juggling to do with a subtree sync so will be a bit longer before this makes it into nightly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants