Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching of lint and fmt invokes #14186

Merged
merged 5 commits into from
Jan 19, 2022

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Jan 18, 2022

As described in #13462, there are correctness concerns around not breaking large batches of files into smaller batches in lint and fmt. But there are other reasons to batch, including improving the performance of linters which don't support internal parallelism (by breaking them into multiple processes which can be parallelized).

This change adds a function to sequentially partition a list of items into stable batches, and then uses it to create batches by default in lint and fmt. Sequential partitioning was chosen rather than bucketing by hash, because it was easier to reason about in the presence of minimum and maximum bucket sizes.

Additionally, this implementation is at the level of the lint and fmt goals themselves (rather than within individual lint/fmt @rule sets, as originally suggested on the ticket) because that reduces the effort of implementing a linter or formatter, and will likely ease doing further declarative partitioning in those goals (by Field values, for example).

./pants --no-pantsd --no-local-cache --no-remote-cache-read fmt lint :: runs about ~4% faster than on main.

Fixes #13462.

[ci skip-build-wheels]

[ci skip-rust]

[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-rust]
[ci skip-rust]
[ci skip-build-wheels]
@stuhood stuhood force-pushed the stuhood/partition-fmt-and-lint branch from 7531a57 to e28c35b Compare January 18, 2022 18:43
@stuhood stuhood marked this pull request as ready for review January 18, 2022 18:57

# To stably partition the arguments into ranges of at least `size_min`, we sort them, and
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
# an item hash prefixed with a threshold of zeros.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely missing what the leading zeros check is about. You explain trying to accommodate adding items disturbing batches minimally, but I don't understand the mechanism of how this helps. Is it tied to characteristics of the Rust hash function used? Maybe a unit test of this function that shows how adding an item to an N bucket chain only results in ~1 bucket changing contents?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to unit tests of this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the comment and added a test.

src/python/pants/util/collections.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks.

"--batch-size",
advanced=True,
type=int,
default=128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically shouldn't change this default after we release this. Thoughts if it's worth us trying to benchmark what the optimal number is? I imagine that's hard to arrive at, including depending on your machine's specs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Could be fun to benchmark though 😉

Copy link
Member Author

@stuhood stuhood Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit of benchmarking to settle on 128 here on an integration branch with #9964 included: 128 was best by ~1%. Additional benchmarking and adjustment after both this and #9964 have landed will be good, since they interplay strongly with one another: I'll include more numbers over there.

Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Yea, there are potentially a lot of dimensions here. But I think that from a complexity perspective, we're not going to want to expose per-tool batch size knobs without some quantitative justification (post landing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this one had me stumped until I tried it. My assumption was that the batches would fill all the threads and so in-tool parallelization would only result in over-allocating your resources. What I see though is that depending on the number of threads available, number of files in the target-list, and batch size, there are many points in time you're running fewer batches than available threads.

Of course, (and I hope to show this via data) using the poor-man's in-tool parallelization is likely not ideal as it isn't dynamic and would result in over-allocation of resources in the "bursts" where there are more-than-thread-count of rules.

src/python/pants/core/goals/fmt.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/lint.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/lint.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/lint.py Show resolved Hide resolved

# To stably partition the arguments into ranges of at least `size_min`, we sort them, and
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
# an item hash prefixed with a threshold of zeros.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to unit tests of this function.

Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python looks good to me! Tomorrow, time-permitting I can give you some timings on our monorepo (and compare them to the poor-man's in-tool parallelism as well).

Additionally, I'd be willing to spend a few cycles giving a data point or two on batch size.


It might be worth measuring the cache characteristics of this change over a few changes. Conceivably this change help both performance and cache size (as adding new files would only invalidate less-than-N buckets)


And speaking of measuring over a few changes, this has got my wheels spinning on possible bucketing techniques. I'm curious to see how partition_sequentially works (shown via unit tests, ideally 😉 ).

@@ -39,7 +40,7 @@ def register_options(cls, register):


class TffmtRequest(FmtRequest):
pass
field_set_type = TerraformFieldSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not specific to this PR)

I find it somewhat noteworthy that as Pants evolves it's way of doing things, little idiosyncrasies come out of the woodwork (like this one). In #14182 I noticed GofmtRequest didn't inherit from LintRequest.

It seems to me potentially hazardous that code can "seemingly work" in one dimension or another, but then a change brings to light that it was incorrectly configured. I wonder when third-party plugins become more standard how we can be proactive about avoiding these possible idiosyncrasies.

Copy link
Member Author

@stuhood stuhood Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular change is because mypy can't check some usages of ClassVar, unfortunately. This is an abstract ClassVar that it failed to check was implemented for this subclass of FmtRequest.

In #14182 I noticed GofmtRequest didn't inherit from LintRequest.

That is because @union doesn't actually require that a type extend any particular interface, although @kaos has prototyped changing that: #12577. There is rough consensus that we should change unions, but not exactly how.

"--batch-size",
advanced=True,
type=int,
default=128,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Could be fun to benchmark though 😉

src/python/pants/core/goals/lint.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/lint.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/lint.py Outdated Show resolved Hide resolved
@stuhood
Copy link
Member Author

stuhood commented Jan 19, 2022

Tomorrow, time-permitting I can give you some timings on our monorepo (and compare them to the poor-man's in-tool parallelism as well).

Thanks! As mentioned in the comments, #9964 will further affect this. My own benchmarking has shown that combining the two is a little bit faster than either alone.

batch.append(item)
if (
len(batch) >= size_min
and native_engine.hash_prefix_zero_bits(item_key) >= zero_prefix_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the overhead of calling through to Rust is, but are we sure that doing it a lot in a tight loop like this is worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be very, very low... similar to calling into any builtin Python function. The rust function takes a reference to the Python string, so it shouldn't even be copied at the boundary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> timeit.timeit(lambda: native_engine.hash_prefix_zero_bits("example"), number=1000000)
0.19578884500000981
>>> timeit.timeit(lambda: hash("example"), number=1000000)
0.14594443799998658

batch: list[_T] = []

def emit_batch() -> list[_T]:
assert batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we idiomatically use assert in non-test code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few other non-test instances of assert in util... but this seems like a safe case: validating that a private function is called correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @stuhood using asserts in normal code is a good way to assert assumptions, document what might otherwise not be necessarily obvious, and of course to coerce mypy.


# To stably partition the arguments into ranges of at least `size_min`, we sort them, and
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
# an item hash prefixed with a threshold of zeros.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .

[ci skip-rust]

[ci skip-build-wheels]
@stuhood stuhood force-pushed the stuhood/partition-fmt-and-lint branch from b83c36c to d14468e Compare January 19, 2022 05:52
@stuhood
Copy link
Member Author

stuhood commented Jan 19, 2022

Applied all review feedback: please take another look.

# probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on
# average, we look for `Z == log2(N)` zero bits.
#
# Breaking on these deterministic boundaries means that adding any single item will affect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for ./pants lint bob:: followed by ./pants lint :: - those will likely have completely different buckets - no hits at all. If that's right and all this fanciness is to support repeatedly running the same Pants command - that's great but seems useful to call out somewhere. I'm not sure where. Here is not the place since this would be about the larger picture and the call after call inputs to this function from a higher layer. If I've got this wrong and this always works - that's awesome. I'm still mystified though how that works. If I've got it right, it would be great to get some comment that does this spelling out that we're optimizing a particular use pattern.

Copy link
Member Author

@stuhood stuhood Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for ./pants lint bob:: followed by ./pants lint :: - those will likely have completely different buckets - no hits at all.

This is mostly about optimizing the re-running of a single command over time, yea. But because the inputs are sorted before hashing, bob:: should fall entirely into sequential buckets within the entire set of ::, and some of those might be able to hit in the larger set.

Unfortunately, it's hard to make promises about that in the presence of the min/max thresholds. If we were breaking purely along bucket boundaries (without additionally setting min/max sizes), we could make better guarantees (because the bucket boundaries within bob:: would be the same when the rest of :: was added).

Thanks for raising this point though... it does actually seem like a desirable enough use case to try and strengthen this further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if strengthening this further, or even using the current trickery, belongs in this PR. IIUC batching is purely a win over the status quo today, even if the batches are simply formed from fixed size buckets over the sorted input and nothing more. If that's true, then this PR is only muddied by the fiddly bits here and maybe adding the fiddly bits en-masse could be done as a follow up that gives magic speed ups for free.

Copy link
Member Author

@stuhood stuhood Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")... it should be a quick change I think. It is actually desirable for smaller batches to improve cache hit rates (which is why the per-file-caching flag existed in the first place).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eric-Arellano brought up a good point above about the fact that changing the batch size or implementation could have negative repercussions (due to the cache being invalidated). However as @jsirois points out any batching is a win.

IMO the solution with #9964 is the "baseline" and any improvement over that is gravy. If Pants finds better solutions/numbers over time I'm OK with a one-time cache bust which result in speedier runs going forward.

Because of the amount of variance here with tools/machine/repo I think the best approach is probably a data-driven one. You might try building a performance test suite to test various numbers/algorithms and asking the community (and/or Toolchain's customers) to run it overnight (or during a dead time) on a machine or two. I know I'd happily volunteer some CPU cycles to help find the optimal config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")

Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...it was not, in fact, a "quick change", because removing the min threshold and attempting to use only the hash to choose boundaries requires increasing the threshold to the point where our sweet spot of bucket size (~128) is too likely to never encounter a good boundary.

Since the code as-is already snaps to boundaries, I feel fairly confident that we'll already get some matches in the bob:: vs :: case, but can defer looking at that to a followup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't help myself. #14210.

"--batch-size",
advanced=True,
type=int,
default=128,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a power of two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

@Eric-Arellano
Copy link
Contributor

I want to check my mental model with this and #9964. We avoid too much parallelism thanks to the engine:

  • it will only schedule processes when there are open semaphores
  • it will adjust internal parallelism (--jobs) appropriately throughout the session
  • You have to make sure your global options are configured appropriately, which is how Pants determines the upper threshold of concurrency.

# probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on
# average, we look for `Z == log2(N)` zero bits.
#
# Breaking on these deterministic boundaries means that adding any single item will affect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")

Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is

return set(tuple(p) for p in partition_sequentially(items, key=str, size_min=size_min))

# We start with base items containing every other element from a sorted sequence.
all_items = sorted((f"item{i}" for i in range(0, 64)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not burn a tree for this, but FYI you already had a generator expression and the extra () is superfluous:

Suggested change
all_items = sorted((f"item{i}" for i in range(0, 64)))
all_items = sorted(f"item{i}" for i in range(0, 64))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eric-Arellano I think there is a flake8 plugin that detects those... maybe we should add it to this repo.
https://pypi.org/project/flake8-comprehensions/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will work on this. PRs coming soon.

@stuhood stuhood merged commit 9c1eb9f into pantsbuild:main Jan 19, 2022
@stuhood stuhood deleted the stuhood/partition-fmt-and-lint branch January 19, 2022 18:28
stuhood added a commit that referenced this pull request Jan 19, 2022
…ility. (#14210)

As a followup to #14186, this change improves the stability (and thus cache hit rates) of batching by removing the minimum bucket size. It also fixes an issue in the tests, and expands the range that they test.

As mentioned in the expanded comments: capping bucket sizes (in either the `min` or the `max` direction) can cause streaks of bucket changes: when a bucket hits a `min`/`max` threshold and ignores a boundary, it increases the chance that the next bucket will trip a threshold as well.

Although it would be most-stable to remove the `max` threshold entirely, it is necessary to resolve the correctness issue of #13462. But we _can_ remove the `min` threshold, and so this change does that.

[ci skip-rust]
[ci skip-build-wheels]
@thejcannon
Copy link
Member

I didn't see this sooner. Does check belong in this list too?

@stuhood
Copy link
Member Author

stuhood commented Jan 20, 2022

I didn't see this sooner. Does check belong in this list too?

Possibly. But the big difference with check is that we're expecting tools there to need transitive deps, and to thus benefit more from internal sharing of work than in lint/fmt.

go, java, scala and potentially python should probably all be getting "graph-shaped" parallelism (which allows invokes to consume the output of checked/compiled dependencies). See my most recent comment on the mypy ticket in particular: #10864 (comment)

stuhood added a commit that referenced this pull request Jan 24, 2022
…14184)

When tools support internal concurrency and cannot be partitioned (either because they don't support it, such as in the case of a PEX resolve, or because of overhead to partitioning as fine-grained as desired), Pants' own concurrency currently makes it ~impossible for them to set their concurrency settings correctly.

As sketched in #9964, this change adjusts Pants' local runner to dynamically choose concurrency values per process based on the current concurrency.
1. When acquiring a slot on the `bounded::CommandRunner`, a process takes as much concurrency as it a) is capable of, as configured by a new `Process.concurrency_available` field, b) deserves for the purposes of a fairness (i.e. half, for two processes). This results in some amount of over-commit.
2. Periodically, a balancing task runs and preempts/re-schedules processes which have been running for less than a very short threshold (`200ms` currently) and which are the largest contributors to over/under-commit. This fixes some over/under-commit, but not all of it, because if a process becomes over/under-committed after it has been running a while (because other processes started or finished), we will not preempt it.

Combined with #14186, this change results in an additional 2% speedup for `lint` and `fmt`. But it should also have a positive impact on PEX processes, which were the original motivation for #9964.

Fixes #9964. 

[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Jan 29, 2022
This adds generic support for `lint` implementations that do not deal with targets. That allows us to merge `validate` into `lint`, which is much cleaner.

## CLI specs

As before with the `validate` goal, it's not very intuitive how to get Pants to run on files not owned by targets, which you want for `validate`. `::` only matches files owned by targets, whereas `**` matches _all_ files regardless of targets.

So, users of `regex-lint` should typically use `./pants lint '**'` rather than `./pants lint ::`, which is not intuitive

https://docs.google.com/document/d/1WWQM-X6kHoSCKwItqf61NiKFWNSlpnTC5QNu3ul9RDk/edit#heading=h.1h4j0d5mazhu proposes changing `::` to match all files, so you can simply use `./pants lint ::`. I don't think we need to block on this proposal? This is still forward progress, and also `validate`/`regex-lint` is not used very much fwict.

## Batching

We don't yet batch per #14186, although it would be trivial for us to hook up. I'm only waiting to do it till we can better reason about if it makes sense to apply here too.

## The `fmt` goal

Note that we need more design for `fmt` before we can apply this same change there. fmt is tricky because we run each formatter for a certain language sequentially so that they don't overwrite each other; but we run distinct languages in parallel. We would need some way to know which "language" target-less files are for.

## "Inferred targets"

A related technology would be inferred targets, where you don't need a BUILD flie but we still have a target: #14074.

This is a complementary technology. The main difference here is that we can operate on files that will _never_ have an owning target, such as a BUILD file itself.

[ci skip-rust]
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch invokes of linters (and other dynamically sized process invokes)
6 participants