Batching of `lint` and `fmt` invokes #14186

stuhood · 2022-01-18T17:38:57Z

As described in #13462, there are correctness concerns around not breaking large batches of files into smaller batches in lint and fmt. But there are other reasons to batch, including improving the performance of linters which don't support internal parallelism (by breaking them into multiple processes which can be parallelized).

This change adds a function to sequentially partition a list of items into stable batches, and then uses it to create batches by default in lint and fmt. Sequential partitioning was chosen rather than bucketing by hash, because it was easier to reason about in the presence of minimum and maximum bucket sizes.

Additionally, this implementation is at the level of the lint and fmt goals themselves (rather than within individual lint/fmt @rule sets, as originally suggested on the ticket) because that reduces the effort of implementing a linter or formatter, and will likely ease doing further declarative partitioning in those goals (by Field values, for example).

./pants --no-pantsd --no-local-cache --no-remote-cache-read fmt lint :: runs about ~4% faster than on main.

Fixes #13462.

[ci skip-build-wheels]

[ci skip-rust] [ci skip-build-wheels]

…hly a given size. [ci skip-build-wheels]

[ci skip-build-wheels] [ci skip-rust]

[ci skip-rust] [ci skip-build-wheels]

jsirois · 2022-01-18T20:19:56Z

src/python/pants/util/collections.py

+
+    # To stably partition the arguments into ranges of at least `size_min`, we sort them, and
+    # create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
+    # an item hash prefixed with a threshold of zeros.


I'm definitely missing what the leading zeros check is about. You explain trying to accommodate adding items disturbing batches minimally, but I don't understand the mechanism of how this helps. Is it tied to characteristics of the Rust hash function used? Maybe a unit test of this function that shows how adding an item to an N bucket chain only results in ~1 bucket changing contents?

+1 to unit tests of this function.

I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .

Expanded the comment and added a test.

src/python/pants/util/collections.py

Eric-Arellano

Neat, thanks.

Eric-Arellano · 2022-01-18T20:47:38Z

src/python/pants/core/goals/fmt.py

+            "--batch-size",
+            advanced=True,
+            type=int,
+            default=128,


We technically shouldn't change this default after we release this. Thoughts if it's worth us trying to benchmark what the optimal number is? I imagine that's hard to arrive at, including depending on your machine's specs.

Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Could be fun to benchmark though 😉

I did a bit of benchmarking to settle on 128 here on an integration branch with #9964 included: 128 was best by ~1%. Additional benchmarking and adjustment after both this and #9964 have landed will be good, since they interplay strongly with one another: I'll include more numbers over there.

Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Yea, there are potentially a lot of dimensions here. But I think that from a complexity perspective, we're not going to want to expose per-tool batch size knobs without some quantitative justification (post landing).

OK this one had me stumped until I tried it. My assumption was that the batches would fill all the threads and so in-tool parallelization would only result in over-allocating your resources. What I see though is that depending on the number of threads available, number of files in the target-list, and batch size, there are many points in time you're running fewer batches than available threads.

Of course, (and I hope to show this via data) using the poor-man's in-tool parallelization is likely not ideal as it isn't dynamic and would result in over-allocation of resources in the "bursts" where there are more-than-thread-count of rules.

src/python/pants/core/goals/fmt.py

src/python/pants/core/goals/lint.py

Eric-Arellano · 2022-01-18T20:55:32Z

src/python/pants/util/collections.py

+
+    # To stably partition the arguments into ranges of at least `size_min`, we sort them, and
+    # create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
+    # an item hash prefixed with a threshold of zeros.


+1 to unit tests of this function.

thejcannon

The Python looks good to me! Tomorrow, time-permitting I can give you some timings on our monorepo (and compare them to the poor-man's in-tool parallelism as well).

Additionally, I'd be willing to spend a few cycles giving a data point or two on batch size.

It might be worth measuring the cache characteristics of this change over a few changes. Conceivably this change help both performance and cache size (as adding new files would only invalidate less-than-N buckets)

And speaking of measuring over a few changes, this has got my wheels spinning on possible bucketing techniques. I'm curious to see how partition_sequentially works (shown via unit tests, ideally 😉 ).

thejcannon · 2022-01-18T21:13:14Z

src/python/pants/backend/terraform/lint/tffmt/tffmt.py

@@ -39,7 +40,7 @@ def register_options(cls, register):


 class TffmtRequest(FmtRequest):
-    pass
+    field_set_type = TerraformFieldSet


(Not specific to this PR)

I find it somewhat noteworthy that as Pants evolves it's way of doing things, little idiosyncrasies come out of the woodwork (like this one). In #14182 I noticed GofmtRequest didn't inherit from LintRequest.

It seems to me potentially hazardous that code can "seemingly work" in one dimension or another, but then a change brings to light that it was incorrectly configured. I wonder when third-party plugins become more standard how we can be proactive about avoiding these possible idiosyncrasies.

This particular change is because mypy can't check some usages of ClassVar, unfortunately. This is an abstract ClassVar that it failed to check was implemented for this subclass of FmtRequest.

In #14182 I noticed GofmtRequest didn't inherit from LintRequest.

That is because @union doesn't actually require that a type extend any particular interface, although @kaos has prototyped changing that: #12577. There is rough consensus that we should change unions, but not exactly how.

thejcannon · 2022-01-18T21:15:28Z

src/python/pants/core/goals/fmt.py

+            "--batch-size",
+            advanced=True,
+            type=int,
+            default=128,


Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔

Could be fun to benchmark though 😉

src/python/pants/core/goals/lint.py

stuhood · 2022-01-19T01:06:19Z

Tomorrow, time-permitting I can give you some timings on our monorepo (and compare them to the poor-man's in-tool parallelism as well).

Thanks! As mentioned in the comments, #9964 will further affect this. My own benchmarking has shown that combining the two is a little bit faster than either alone.

benjyw · 2022-01-19T04:08:22Z

src/python/pants/util/collections.py

+        batch.append(item)
+        if (
+            len(batch) >= size_min
+            and native_engine.hash_prefix_zero_bits(item_key) >= zero_prefix_threshold


I don't know what the overhead of calling through to Rust is, but are we sure that doing it a lot in a tight loop like this is worth it?

It should be very, very low... similar to calling into any builtin Python function. The rust function takes a reference to the Python string, so it shouldn't even be copied at the boundary.

>>> timeit.timeit(lambda: native_engine.hash_prefix_zero_bits("example"), number=1000000) 0.19578884500000981 >>> timeit.timeit(lambda: hash("example"), number=1000000) 0.14594443799998658

benjyw · 2022-01-19T04:15:05Z

src/python/pants/util/collections.py

+    batch: list[_T] = []
+
+    def emit_batch() -> list[_T]:
+        assert batch


Not sure we idiomatically use assert in non-test code?

There are a few other non-test instances of assert in util... but this seems like a safe case: validating that a private function is called correctly.

I'm with @stuhood using asserts in normal code is a good way to assert assumptions, document what might otherwise not be necessarily obvious, and of course to coerce mypy.

benjyw · 2022-01-19T04:29:30Z

src/python/pants/util/collections.py

+
+    # To stably partition the arguments into ranges of at least `size_min`, we sort them, and
+    # create a new batch sequentially once we have the minimum number of entries, _and_ we encounter
+    # an item hash prefixed with a threshold of zeros.


I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .

[ci skip-rust] [ci skip-build-wheels]

stuhood · 2022-01-19T06:02:28Z

Applied all review feedback: please take another look.

jsirois · 2022-01-19T06:27:12Z

src/python/pants/util/collections.py

+    # probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on
+    # average, we look for `Z == log2(N)` zero bits.
+    #
+    # Breaking on these deterministic boundaries means that adding any single item will affect


Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for ./pants lint bob:: followed by ./pants lint :: - those will likely have completely different buckets - no hits at all. If that's right and all this fanciness is to support repeatedly running the same Pants command - that's great but seems useful to call out somewhere. I'm not sure where. Here is not the place since this would be about the larger picture and the call after call inputs to this function from a higher layer. If I've got this wrong and this always works - that's awesome. I'm still mystified though how that works. If I've got it right, it would be great to get some comment that does this spelling out that we're optimizing a particular use pattern.

Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for ./pants lint bob:: followed by ./pants lint :: - those will likely have completely different buckets - no hits at all.

This is mostly about optimizing the re-running of a single command over time, yea. But because the inputs are sorted before hashing, bob:: should fall entirely into sequential buckets within the entire set of ::, and some of those might be able to hit in the larger set.

Unfortunately, it's hard to make promises about that in the presence of the min/max thresholds. If we were breaking purely along bucket boundaries (without additionally setting min/max sizes), we could make better guarantees (because the bucket boundaries within bob:: would be the same when the rest of :: was added).

Thanks for raising this point though... it does actually seem like a desirable enough use case to try and strengthen this further.

I'm not sure if strengthening this further, or even using the current trickery, belongs in this PR. IIUC batching is purely a win over the status quo today, even if the batches are simply formed from fixed size buckets over the sorted input and nothing more. If that's true, then this PR is only muddied by the fiddly bits here and maybe adding the fiddly bits en-masse could be done as a follow up that gives magic speed ups for free.

The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")... it should be a quick change I think. It is actually desirable for smaller batches to improve cache hit rates (which is why the per-file-caching flag existed in the first place).

@Eric-Arellano brought up a good point above about the fact that changing the batch size or implementation could have negative repercussions (due to the cache being invalidated). However as @jsirois points out any batching is a win.

IMO the solution with #9964 is the "baseline" and any improvement over that is gravy. If Pants finds better solutions/numbers over time I'm OK with a one-time cache bust which result in speedier runs going forward.

Because of the amount of variance here with tools/machine/repo I think the best approach is probably a data-driven one. You might try building a performance test suite to test various numbers/algorithms and asking the community (and/or Toolchain's customers) to run it overnight (or during a dead time) on a machine or two. I know I'd happily volunteer some CPU cycles to help find the optimal config.

The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")

Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is

...it was not, in fact, a "quick change", because removing the min threshold and attempting to use only the hash to choose boundaries requires increasing the threshold to the point where our sweet spot of bucket size (~128) is too likely to never encounter a good boundary.

Since the code as-is already snaps to boundaries, I feel fairly confident that we'll already get some matches in the bob:: vs :: case, but can defer looking at that to a followup.

I couldn't help myself. #14210.

thejcannon · 2022-01-19T17:27:35Z

src/python/pants/core/goals/fmt.py

+            "--batch-size",
+            advanced=True,
+            type=int,
+            default=128,


Does this need to be a power of two?

Eric-Arellano · 2022-01-19T17:53:56Z

I want to check my mental model with this and #9964. We avoid too much parallelism thanks to the engine:

it will only schedule processes when there are open semaphores
it will adjust internal parallelism (--jobs) appropriately throughout the session
You have to make sure your global options are configured appropriately, which is how Pants determines the upper threshold of concurrency.

Eric-Arellano · 2022-01-19T17:55:50Z

src/python/pants/util/collections.py

+    # probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on
+    # average, we look for `Z == log2(N)` zero bits.
+    #
+    # Breaking on these deterministic boundaries means that adding any single item will affect


The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")

Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is

Eric-Arellano · 2022-01-19T17:56:47Z

src/python/pants/util/collections_test.py

+        return set(tuple(p) for p in partition_sequentially(items, key=str, size_min=size_min))
+
+    # We start with base items containing every other element from a sorted sequence.
+    all_items = sorted((f"item{i}" for i in range(0, 64)))


Please do not burn a tree for this, but FYI you already had a generator expression and the extra () is superfluous:

Suggested change

all_items = sorted((f"item{i}" for i in range(0, 64)))

all_items = sorted(f"item{i}" for i in range(0, 64))

@Eric-Arellano I think there is a flake8 plugin that detects those... maybe we should add it to this repo.
https://pypi.org/project/flake8-comprehensions/

That'd be great!

ok. will work on this. PRs coming soon.

…ility. (#14210) As a followup to #14186, this change improves the stability (and thus cache hit rates) of batching by removing the minimum bucket size. It also fixes an issue in the tests, and expands the range that they test. As mentioned in the expanded comments: capping bucket sizes (in either the `min` or the `max` direction) can cause streaks of bucket changes: when a bucket hits a `min`/`max` threshold and ignores a boundary, it increases the chance that the next bucket will trip a threshold as well. Although it would be most-stable to remove the `max` threshold entirely, it is necessary to resolve the correctness issue of #13462. But we _can_ remove the `min` threshold, and so this change does that. [ci skip-rust] [ci skip-build-wheels]

thejcannon · 2022-01-20T01:57:21Z

I didn't see this sooner. Does check belong in this list too?

stuhood · 2022-01-20T18:51:53Z

I didn't see this sooner. Does check belong in this list too?

Possibly. But the big difference with check is that we're expecting tools there to need transitive deps, and to thus benefit more from internal sharing of work than in lint/fmt.

go, java, scala and potentially python should probably all be getting "graph-shaped" parallelism (which allows invokes to consume the output of checked/compiled dependencies). See my most recent comment on the mypy ticket in particular: #10864 (comment)

…14184) When tools support internal concurrency and cannot be partitioned (either because they don't support it, such as in the case of a PEX resolve, or because of overhead to partitioning as fine-grained as desired), Pants' own concurrency currently makes it ~impossible for them to set their concurrency settings correctly. As sketched in #9964, this change adjusts Pants' local runner to dynamically choose concurrency values per process based on the current concurrency. 1. When acquiring a slot on the `bounded::CommandRunner`, a process takes as much concurrency as it a) is capable of, as configured by a new `Process.concurrency_available` field, b) deserves for the purposes of a fairness (i.e. half, for two processes). This results in some amount of over-commit. 2. Periodically, a balancing task runs and preempts/re-schedules processes which have been running for less than a very short threshold (`200ms` currently) and which are the largest contributors to over/under-commit. This fixes some over/under-commit, but not all of it, because if a process becomes over/under-committed after it has been running a while (because other processes started or finished), we will not preempt it. Combined with #14186, this change results in an additional 2% speedup for `lint` and `fmt`. But it should also have a positive impact on PEX processes, which were the original motivation for #9964. Fixes #9964. [ci skip-build-wheels]

This adds generic support for `lint` implementations that do not deal with targets. That allows us to merge `validate` into `lint`, which is much cleaner. ## CLI specs As before with the `validate` goal, it's not very intuitive how to get Pants to run on files not owned by targets, which you want for `validate`. `::` only matches files owned by targets, whereas `**` matches _all_ files regardless of targets. So, users of `regex-lint` should typically use `./pants lint '**'` rather than `./pants lint ::`, which is not intuitive https://docs.google.com/document/d/1WWQM-X6kHoSCKwItqf61NiKFWNSlpnTC5QNu3ul9RDk/edit#heading=h.1h4j0d5mazhu proposes changing `::` to match all files, so you can simply use `./pants lint ::`. I don't think we need to block on this proposal? This is still forward progress, and also `validate`/`regex-lint` is not used very much fwict. ## Batching We don't yet batch per #14186, although it would be trivial for us to hook up. I'm only waiting to do it till we can better reason about if it makes sense to apply here too. ## The `fmt` goal Note that we need more design for `fmt` before we can apply this same change there. fmt is tricky because we run each formatter for a certain language sequentially so that they don't overwrite each other; but we run distinct languages in parallel. We would need some way to know which "language" target-less files are for. ## "Inferred targets" A related technology would be inferred targets, where you don't need a BUILD flie but we still have a target: #14074. This is a complementary technology. The main difference here is that we can operate on files that will _never_ have an owning target, such as a BUILD file itself. [ci skip-rust] [ci skip-build-wheels]

stuhood added 4 commits January 15, 2022 15:37

Fix Tffmt not setting a field_set.

71a874d

[ci skip-rust] [ci skip-build-wheels]

Add partition_sequentially to stably partition into batches of roug…

f737f93

…hly a given size. [ci skip-build-wheels]

Partition fmt by default.

8b219af

[ci skip-build-wheels] [ci skip-rust]

Partition lint by default.

e28c35b

[ci skip-rust] [ci skip-build-wheels]

stuhood force-pushed the stuhood/partition-fmt-and-lint branch from 7531a57 to e28c35b Compare January 18, 2022 18:43

stuhood marked this pull request as ready for review January 18, 2022 18:57

stuhood requested review from Eric-Arellano, benjyw and jsirois January 18, 2022 20:09

jsirois reviewed Jan 18, 2022

View reviewed changes

Eric-Arellano requested a review from thejcannon January 18, 2022 20:45

Eric-Arellano reviewed Jan 18, 2022

View reviewed changes

thejcannon reviewed Jan 18, 2022

View reviewed changes

thejcannon mentioned this pull request Jan 18, 2022

RFC: Add "implicit" dependees flag option #14182

Draft

benjyw reviewed Jan 19, 2022

View reviewed changes

Review feedback.

d14468e

[ci skip-rust] [ci skip-build-wheels]

stuhood force-pushed the stuhood/partition-fmt-and-lint branch from b83c36c to d14468e Compare January 19, 2022 05:52

jsirois approved these changes Jan 19, 2022

View reviewed changes

thejcannon mentioned this pull request Jan 19, 2022

Add priorities to rules #14208

Closed

thejcannon reviewed Jan 19, 2022

View reviewed changes

Eric-Arellano approved these changes Jan 19, 2022

View reviewed changes

stuhood merged commit 9c1eb9f into pantsbuild:main Jan 19, 2022

stuhood deleted the stuhood/partition-fmt-and-lint branch January 19, 2022 18:28

This was referenced Jan 19, 2022

[internal] Remove the minimum bucket size of batching to improve stability. #14210

Merged

Dynamically choose per-process concurrency for supported processes #14184

Merged

stuhood mentioned this pull request Jan 20, 2022

Improve MyPy performance #10864

Closed

Eric-Arellano mentioned this pull request Jan 29, 2022

Merge validate goal with lint goal #14102

Merged

stuhood mentioned this pull request Feb 23, 2022

Draft Helm plugin for Pants #14487

Closed

This was referenced Apr 5, 2022

Add test case showing issue with isort for missing transient deps. #15002

Draft

isort behavior depends on which files are available in the sandbox #15069

Open

	all_items = sorted((f"item{i}" for i in range(0, 64)))
	all_items = sorted(f"item{i}" for i in range(0, 64))

Batching of lint and fmt invokes #14186

Batching of lint and fmt invokes #14186

Conversation

stuhood commented Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Arellano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thejcannon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Jan 19, 2022

Choose a reason for hiding this comment

stuhood Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Arellano commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thejcannon commented Jan 20, 2022

stuhood commented Jan 20, 2022 • edited Loading

Batching of `lint` and `fmt` invokes #14186

Batching of `lint` and `fmt` invokes #14186

stuhood commented Jan 18, 2022 •

edited

Loading

stuhood Jan 19, 2022 •

edited

Loading

thejcannon left a comment •

edited

Loading

stuhood Jan 19, 2022 •

edited

Loading

stuhood Jan 19, 2022 •

edited

Loading

stuhood Jan 19, 2022 •

edited

Loading

stuhood commented Jan 20, 2022 •

edited

Loading