Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch invokes of linters (and other dynamically sized process invokes) #13462

Closed
ryanking opened this issue Nov 2, 2021 · 9 comments · Fixed by #14186
Closed

Batch invokes of linters (and other dynamically sized process invokes) #13462

ryanking opened this issue Nov 2, 2021 · 9 comments · Fixed by #14186
Assignees

Comments

@ryanking
Copy link
Contributor

ryanking commented Nov 2, 2021

Pants does not limit the number of files passed to black and isort at the same time, which leads to errors like Error launching process: Os { code: 7, kind: Other, message: "Argument list too long" }.

Pants version
2.7.0

OS
MacOS

Additional info
We have about 4500 python files in our repo, so when I run a lint across the whole repo at once I get something like "Run Black on 4434 files.",.

@ryanking ryanking added the bug label Nov 2, 2021
@stuhood stuhood changed the title pants sends to many files to black and isort pants sends too many files to black and isort Nov 2, 2021
@stuhood
Copy link
Member

stuhood commented Nov 2, 2021

Sorry for the trouble! We really ought to be generating batches in this case for performance reasons anyway: we only support either "run for all" or "run per file" (https://www.pantsbuild.org/docs/reference-lint#advanced-options), and the performance sweet spot (not to mention the size that would avoid issues like this) is likely to be batches of a size somewhere in between.

@stuhood
Copy link
Member

stuhood commented Jan 5, 2022

Assuming that we do batching, it seems like there is also an opportunity to align with #9964, and to dynamically batch based on how much parallelism is available. But dynamically choosing batch sizes based on parallelism would require manipulating cache keys (because otherwise hosts with different numbers of cores would never get remote cache hits)... and that is a risky business, which also requires figuring out how to safely split the outputs of tools (or to require that batched processes opt-out of capturing outputs when they succeed, for example).

@stuhood stuhood changed the title pants sends too many files to black and isort Batch invokes of linters Jan 5, 2022
@stuhood stuhood changed the title Batch invokes of linters Batch invokes of linters (and other dynamically sized process invokes) Jan 5, 2022
@stuhood stuhood self-assigned this Jan 5, 2022
@thejcannon
Copy link
Member

thejcannon commented Jan 7, 2022

So it turns out there is a "hack" that can be used in the meantime.

Setting the relevant parallelism options for each tool turns out to be a "poor-mans-batch" is faster than not setting it (obviously) but also faster than the --per-file-caching.

E.g.
If I wanted to run fmt with isort and black and I have a 64 core machine. Running with --black-args='-W 64' --isort-args='--jobs=64' is always the fastest option.
Whereas with lint (which runs in parallel) you probably want to take N cores and divide by M checkers... --black-args='-W 16' --isort-args='--jobs=16' --flake8-args='--jobs=16' --pylint-args='--jobs=16'.

Depending on how long fixing the issue might take, it might be worth finding/writing clever little plugins to set the num jobs smartly based on CPU count

@stuhood
Copy link
Member

stuhood commented Jan 7, 2022

@thejcannon : Thanks for the report: that's very useful information. I don't think that I realized that so many Python linters supported parallelization arguments, but it makes sense. Will post a summary soon.

@thejcannon
Copy link
Member

FWIW as well, I'd expect those linters to have better parallelization in-tool than out-of-tool because they know what can/can't be shared across processes and have the ability to play clever tricks.

That being said out-of-tool parallelization still beats no parallelization 😉

@stuhood
Copy link
Member

stuhood commented Jan 7, 2022

There are a few angles to this issue and to #9964:

In the long term, we want to lower the per-process overhead (from 1. sandbox creation, 2. process startup, 3. inability to reuse shared inputs) to the point where running per-file is clearly the right choice to maximize cache hit rates (by reducing/reusing sandboxes, and even reusing processes in sandboxes with nailgun).

But on the (long) road to zero per-process overhead, we should implement batching both for the purposes of:

  1. correctness in the case of very large argument lists (when processes don't have support for @argfiles... i.e., the original problem of this ticket)
  2. parallelism (when processes don't have native support for parallelizing themselves)

Because batching inherently affects the output of processes, it will require use of an explicit API. While it would be very interesting to be able to automatically batch all single-file processes, we cannot generically split outputs. It would also mean doing lots of redundant work in @rules, and potentially throwing away knowledge that files will be processed identically, only to rediscover it later.

AFAICT, there is no need for the rust code's cooperation in this, so I'm planning to implement this as a new @rule to ease iteration: probably something like:

@dataclass
class BatchProcess:
  # A process representing all inputs, but with a partial `args` list which will be filled in with (portions
  # of) the `xargs` list.
  process_template: Process
  # A list of arguments which will be partitioned as needed and appended to the `args`.
  xargs: tuple[str, ...]

Future implementations could support templating into other positions in the args list. Additionally, how to handle process outputs can be configurable, but will likely start out as per-process stdout/stderr concatenated, and output digests merged.


This will get us a small amount of additional parallelism in some cases, but as @thejcannon pointed out: the vast majority of the processes we invoke support their own parallellism, and so BatchProcess should not partition the xargs for the purposes of parallelism (by default): only to avoid OS arg length limits (EDIT: having written up #9964, there might still be some advantage to lowering the batch size here, since it allows better behavior when we're overcommitted. Note to self: use stable hash prefix partitioning).

To take advantage of that internal parallelism, we will likely also implement #9964: expect more there soon.

@thejcannon
Copy link
Member

I like the idea! Some thoughts (which also apply to #9964 so I'll repeat there):

  • For clarification, I assume caching won't be any different here than with --no-per-file-caching, right? This is essentially still an "all-in-one" run, just now parallel under-the-hood.
  • You likely were already going to do this, but it likely makes sense to error if this and --per-file-caching are enabled.
  • It is worth some time to think about how this change affects the dynamic UI. Seeing Running pylint on XYZ files in parallel in one swimlane and having N other swimlanes idle is likely very confusing. I can't say I have a good idea for how to fix this (other than maybe prodash in Render dynamic UI with prodash #13481 ), but it warrants thought.

Additionally for this issue (and not #9964) it might also be prudent to warn if this is enabled for a tool that is supported via #9964, since that will almost certainly result in faster runs

@thejcannon
Copy link
Member

thejcannon commented Jan 8, 2022

(Also duping with #9964)

As a single datapoint, on my 64-core machine with formatters yapf and isort and linters pylint and flake8. I see that with a warm cache (all prereqs already handled via warmup run, then touching N files with a timestamp comment and re-running):

  • Running fmt on 5 files:
    • --per-file-caching: 1.3 seconds
    • --no-per-file-caching: 1.5 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: .974 seconds
  • Running fmt on 100 files:
    • --per-file-caching: 6.8 seconds
    • --no-per-file-caching: 13.6 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 3.1 seconds
  • Running lint (with pylint + flake8 only) on 30 files:
    • --per-file-caching: 10.9 seconds
    • --no-per-file-caching: 32.3 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 13.8 seconds
  • Running lint (with pylint + flake8 only) on 250 files:
    • --per-file-caching: 59.9 seconds
    • --no-per-file-caching: 120.2 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 30.7 seconds

So I expect this issue to land less performant than bullet 3 in each case (which would be #9964), but not by much.

@thejcannon
Copy link
Member

With enough similar metrics you might be able to deprecate ``per-file-caching` 🤔

stuhood added a commit that referenced this issue Jan 13, 2022
To prepare to add batching for formatters as part of #13462, this change removes the need to implement per-backend `@rules` that pipeline `FmtRequest`s that apply to a particular language. Instead, targets are grouped by which `FmtRequest`s apply to them, and then those requests are run sequentially.

There will be further changes to the formatting API in support of #13462, so this API is not final.

[ci skip-rust]
[ci skip-build-wheels]
stuhood added a commit that referenced this issue Jan 19, 2022
As described in #13462, there are correctness concerns around not breaking large batches of files into smaller batches in `lint` and `fmt`. But there are other reasons to batch, including improving the performance of linters which don't support internal parallelism (by breaking them into multiple processes which _can_ be parallelized).

This change adds a function to sequentially partition a list of items into stable batches, and then uses it to create batches by default in `lint` and `fmt`. Sequential partitioning was chosen rather than bucketing by hash, because it was easier to reason about in the presence of minimum and maximum bucket sizes.

Additionally, this implementation is at the level of the `lint` and `fmt` goals themselves (rather than within individual `lint`/`fmt` `@rule` sets, as originally suggested [on the ticket](#13462 (comment))) because that reduces the effort of implementing a linter or formatter, and would likely ease doing further "automatic"/declarative partitioning in those goals (by `Field` values, for example).

`./pants --no-pantsd --no-local-cache --no-remote-cache-read fmt lint ::` runs about ~4% faster than on main.

Fixes #13462.

[ci skip-build-wheels]
stuhood added a commit that referenced this issue Jan 19, 2022
…ility. (#14210)

As a followup to #14186, this change improves the stability (and thus cache hit rates) of batching by removing the minimum bucket size. It also fixes an issue in the tests, and expands the range that they test.

As mentioned in the expanded comments: capping bucket sizes (in either the `min` or the `max` direction) can cause streaks of bucket changes: when a bucket hits a `min`/`max` threshold and ignores a boundary, it increases the chance that the next bucket will trip a threshold as well.

Although it would be most-stable to remove the `max` threshold entirely, it is necessary to resolve the correctness issue of #13462. But we _can_ remove the `min` threshold, and so this change does that.

[ci skip-rust]
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants