-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batching of lint
and fmt
invokes
#14186
Changes from 1 commit
71a874d
f737f93
8b219af
e28c35b
d14468e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ | |
from dataclasses import dataclass | ||
from typing import TypeVar, cast | ||
|
||
from pants.core.goals.style_request import StyleRequest | ||
from pants.core.goals.style_request import StyleRequest, style_batch_size_help | ||
from pants.core.util_rules.source_files import SourceFiles, SourceFilesRequest | ||
from pants.engine.console import Console | ||
from pants.engine.engine_aware import EngineAwareReturnType | ||
|
@@ -139,7 +139,11 @@ def register_options(cls, register) -> None: | |
removal_version="2.11.0.dev0", | ||
removal_hint=( | ||
"Formatters are now broken into multiple batches by default using the " | ||
"`--batch-size` argument." | ||
"`--batch-size` argument.\n" | ||
"\n" | ||
"To keep (roughly) this option's behavior, set [fmt].batch_size = 1. However, " | ||
"you'll likely get better performance by using a larger batch size because of " | ||
"reduced overhead launching processes." | ||
), | ||
help=( | ||
"Rather than formatting all files in a single batch, format each file as a " | ||
|
@@ -156,20 +160,7 @@ def register_options(cls, register) -> None: | |
advanced=True, | ||
type=int, | ||
default=128, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need to be a power of two? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. |
||
help=( | ||
"The target minimum number of files that will be included in each formatter batch.\n" | ||
"\n" | ||
"Formatter processes are batched for a few reasons:\n" | ||
"\n" | ||
"1. to avoid OS argument length limits (in processes which don't support argument " | ||
"files)\n" | ||
"2. to support more stable cache keys than would be possible if all files were " | ||
"operated on in a single batch.\n" | ||
"3. to allow for parallelism in formatter processes which don't have internal " | ||
"parallelism, or -- if they do support internal parallelism -- to improve scheduling " | ||
"behavior when multiple processes are competing for cores and so internal " | ||
"parallelism cannot be used perfectly.\n" | ||
), | ||
help=style_batch_size_help(uppercase="Formatter", lowercase="formatter"), | ||
) | ||
|
||
@property | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ | |
import collections | ||
import collections.abc | ||
import math | ||
from typing import Any, Callable, Iterable, Iterator, MutableMapping, Sequence, TypeVar | ||
from typing import Any, Callable, Iterable, Iterator, MutableMapping, TypeVar | ||
|
||
from pants.engine.internals import native_engine | ||
|
||
|
@@ -77,7 +77,7 @@ def ensure_str_list(val: str | Iterable[str], *, allow_single_str: bool = False) | |
|
||
|
||
def partition_sequentially( | ||
items: Sequence[_T], | ||
items: Iterable[_T], | ||
*, | ||
key: Callable[[_T], str], | ||
size_min: int, | ||
|
@@ -95,7 +95,15 @@ def partition_sequentially( | |
# To stably partition the arguments into ranges of at least `size_min`, we sort them, and | ||
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter | ||
# an item hash prefixed with a threshold of zeros. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm definitely missing what the leading zeros check is about. You explain trying to accommodate adding items disturbing batches minimally, but I don't understand the mechanism of how this helps. Is it tied to characteristics of the Rust hash function used? Maybe a unit test of this function that shows how adding an item to an N bucket chain only results in ~1 bucket changing contents? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to unit tests of this function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Expanded the comment and added a test. |
||
zero_prefix_threshold = math.log(size_min // 8, 2) | ||
# | ||
# The hashes act like a (deterministic) series of rolls of an evenly distributed die. The | ||
# probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on | ||
# average, we look for `Z == log2(N)` zero bits. | ||
# | ||
# Breaking on these deterministic boundaries means that adding any single item will affect | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is mostly about optimizing the re-running of a single command over time, yea. But because the inputs are sorted before hashing, Unfortunately, it's hard to make promises about that in the presence of the Thanks for raising this point though... it does actually seem like a desirable enough use case to try and strengthen this further. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if strengthening this further, or even using the current trickery, belongs in this PR. IIUC batching is purely a win over the status quo today, even if the batches are simply formed from fixed size buckets over the sorted input and nothing more. If that's true, then this PR is only muddied by the fiddly bits here and maybe adding the fiddly bits en-masse could be done as a follow up that gives magic speed ups for free. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason I think that doing it here might be useful is that it would adjust the meaning of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Eric-Arellano brought up a good point above about the fact that changing the batch size or implementation could have negative repercussions (due to the cache being invalidated). However as @jsirois points out any batching is a win. IMO the solution with #9964 is the "baseline" and any improvement over that is gravy. If Pants finds better solutions/numbers over time I'm OK with a one-time cache bust which result in speedier runs going forward. Because of the amount of variance here with tools/machine/repo I think the best approach is probably a data-driven one. You might try building a performance test suite to test various numbers/algorithms and asking the community (and/or Toolchain's customers) to run it overnight (or during a dead time) on a machine or two. I know I'd happily volunteer some CPU cycles to help find the optimal config. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...it was not, in fact, a "quick change", because removing the min threshold and attempting to use only the hash to choose boundaries requires increasing the threshold to the point where our sweet spot of bucket size (~128) is too likely to never encounter a good boundary. Since the code as-is already snaps to boundaries, I feel fairly confident that we'll already get some matches in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't help myself. #14210. |
||
# either one bucket (if the item does not create a boundary) or two (if it does create a | ||
# boundary). | ||
zero_prefix_threshold = math.log(max(4, size_min) // 4, 2) | ||
size_max = size_min * 2 if size_max is None else size_max | ||
|
||
batch: list[_T] = [] | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -12,6 +12,7 @@ | |||||
assert_single_element, | ||||||
ensure_list, | ||||||
ensure_str_list, | ||||||
partition_sequentially, | ||||||
recursively_update, | ||||||
) | ||||||
|
||||||
|
@@ -85,3 +86,23 @@ def test_ensure_str_list() -> None: | |||||
ensure_str_list(0) # type: ignore[arg-type] | ||||||
with pytest.raises(ValueError): | ||||||
ensure_str_list([0, 1]) # type: ignore[list-item] | ||||||
|
||||||
|
||||||
@pytest.mark.parametrize("size_min", [0, 1, 16, 32, 64, 128]) | ||||||
def test_partition_sequentially(size_min: int) -> None: | ||||||
# Adding an item at any position in the input sequence should affect either 1 or 2 (if the added | ||||||
# item becomes a boundary) buckets in the output. | ||||||
|
||||||
def partitioned_buckets(items: list[str]) -> set[tuple[str, ...]]: | ||||||
return set(tuple(p) for p in partition_sequentially(items, key=str, size_min=size_min)) | ||||||
|
||||||
# We start with base items containing every other element from a sorted sequence. | ||||||
all_items = sorted((f"item{i}" for i in range(0, 64))) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please do not burn a tree for this, but FYI you already had a generator expression and the extra
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Eric-Arellano I think there is a flake8 plugin that detects those... maybe we should add it to this repo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That'd be great! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok. will work on this. PRs coming soon. |
||||||
base_items = [item for i, item in enumerate(all_items) if i % 2 == 0] | ||||||
base_partitions = partitioned_buckets(base_items) | ||||||
|
||||||
# Then test that adding any of the remaining items elements (which will be interspersed in the | ||||||
# base items) only affects 1 or 2 buckets in the output. | ||||||
for to_add in [item for i, item in enumerate(all_items) if i % 2 == 1]: | ||||||
updated_partitions = partitioned_buckets([to_add, *base_items]) | ||||||
assert 1 <= len(base_partitions ^ updated_partitions) <= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically shouldn't change this default after we release this. Thoughts if it's worth us trying to benchmark what the optimal number is? I imagine that's hard to arrive at, including depending on your machine's specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔
Could be fun to benchmark though 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit of benchmarking to settle on
128
here on an integration branch with #9964 included:128
was best by ~1%. Additional benchmarking and adjustment after both this and #9964 have landed will be good, since they interplay strongly with one another: I'll include more numbers over there.Yea, there are potentially a lot of dimensions here. But I think that from a complexity perspective, we're not going to want to expose per-tool batch size knobs without some quantitative justification (post landing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this one had me stumped until I tried it. My assumption was that the batches would fill all the threads and so in-tool parallelization would only result in over-allocating your resources. What I see though is that depending on the number of threads available, number of files in the target-list, and batch size, there are many points in time you're running fewer batches than available threads.
Of course, (and I hope to show this via data) using the poor-man's in-tool parallelization is likely not ideal as it isn't dynamic and would result in over-allocation of resources in the "bursts" where there are more-than-thread-count of rules.