Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

stuhood · 2020-06-04T23:19:19Z

As described on #9253:

Toolchain encountered several failures in CI when resolving requirements PEXes with an error code of -9 and no stdout or stderr. An error code of 9 signifies an Out of Memory error (OOM).

We reasoned that this was likely from not configuring --python-setup-resolver-jobs. The default value is None, which means that we use Pex's default of the # of cores. This is a good default for direct invocations of Pex, but is a bad default for Pants because of Pant's concurrency model.

Specifically, --python-setup-resolver-jobs is configured for each distinct process spawned by Pants. So, the total parallelism is really = --python-setup-resolver-jobs x --process-execution-local-parallelism.

We default --process-execution-local-parallelism to cpu_count x 2. This means that the default total parallelism for V2 users would be (cpu_count x 2) x cpu_count! For example, with a CPU count of 4 and using the default value for everything, total parallelism would be 32!

This applies to more than just CI though: in any context, either on a laptop, in CI, or in the context of remote execution, this type of value is challenging to set correctly.

@jsirois is working on splitting out resolve from install in PEX, and that might help mitigate the particular case of PEX, but this will continue to apply generally to parallel tools.

In the context of remote execution we can probably assume that workers are opaque, and safely configure a value on a "per-host"/destination/toolchain/platform basis (in a Multi-Platform Speculation sense).

But locally, we should likely choose this kind of value dynamically based on concurrency. As explained on #9253, hardcoding this value too low will hurt you when you're waiting for only one resolve to complete. A sketch of native support for this would maybe look like adding support for a _PANTS_CONCURRENCY environment variable that our executors would set dynamically (post cache-key calculation in the case of a local environment) based on how many other processes were running.

The text was updated successfully, but these errors were encountered:

See #9253 for the motivation. We didn't land the change then because we had to account for both v1 and v2 usages, and we didn't want to optimize for v2 at the harm of v1. Now we only have v2 to worry about. This partially closes #9964, although in a much more naive way. [ci skip-rust] [ci skip-build-wheels]

Eric-Arellano · 2020-08-11T01:13:17Z

While this was naively fixed in #10584, I'm reopening because this wasn't fixed as described above.

Eric-Arellano · 2020-10-30T06:11:35Z

Addressed by #11006.

stuhood · 2021-03-12T01:10:17Z

Reopened because I still think that we can fully automate this.

stuhood · 2021-04-14T03:42:44Z

This also applies to pylint: adjusted the title.

stuhood · 2021-08-09T23:49:56Z

This has become more relevant recently since there are now a variety of PEX shapes, and not all invokes are likely to run concurrently. In particular, when using a lockfile/constraints.txt, the run to build the constraints PEX will generally run alone (and take a while to run), while lots of dependent PEX builds will run in parallel (and run quickly). This makes it very challenging to set the --python-setup-resolver-jobs value correctly manually.

stuhood · 2022-01-07T22:09:20Z

As mentioned on #13462, we should implement this ticket in order to take better advantage of internal parallelism of processes.

We should likely add the following fields to Process (or perhaps to an optional dataclass field):

class Process:
  ..
  # If non-zero, the amount of parallelism that this process is capable of given its inputs. This value
  # does not directly set the number of cores allocated to the process: that is computed based on
  # availability, and provided as a template value in the arguments of the process.
  maximum_parallelism: int = 0

This will require changes to the BoundedCommandRunner on the rust side, to support dynamically computing and providing this value based on the number of slots available. But the computed value is also interesting: since we don't have a complete view of which processes will be started, we will need to choose concurrency limits based only on the current process and inbound processes (we could also optionally use history in the future, but that is tetchy).

The initial strategy I'm thinking of is to overcommit by default: when a process enters the semaphore, it will record its own concurrency as min(parallelism_slots_available, maximum_parallelism) (or perhaps a progression like total_slots / #processes_already_running), but will still only take one semaphore slot. Even if no parallelism slots are available, we will still schedule the process (i.e., we'll overcommit).

Overcommitting has some downsides: so, optionally in a followup (unsure if worth the complexity / or whether it would actually give a speedup): if the computed parallelism for a given process ends up not matching its maximum, a task can check while the process runs (maybe only once, a short fixed period after it has started: e.g. 250ms or so) to see if it can cancel and re-start it with a better parallelism value.

thejcannon · 2022-01-08T16:52:52Z

I like the idea! Some thoughts (which also apply to #13462 so it's repeated there):

For clarification, I assume caching won't be any different here than with --no-per-file-caching, right? This is essentially still an "all-in-one" run, just now parallel under-the-hood.
You likely were already going to do this, but it likely makes sense to error if this and --per-file-caching are enabled.
It is worth some time to think about how this change affects the dynamic UI. Seeing Running pylint on XYZ files in parallel in one swimlane and having N other swimlanes idle is likely very confusing. I can't say I have a good idea for how to fix this (other than maybe prodash in Render dynamic UI with prodash #13481 ), but it warrants thought.

thejcannon · 2022-01-08T17:10:41Z

(Also duping with #13462)

As a single datapoint, on my 64-core machine with formatters yapf and isort and linters pylint and flake8. I see that with a warm cache (all prereqs already handled via warmup run, then touching N files with a timestamp comment and re-running):

Running fmt on 5 files:
- --per-file-caching: 1.3 seconds
- --no-per-file-caching: 1.5 seconds
- --no-per-file-caching and running tools in parallel via --tool-args: .974 seconds
Running fmt on 100 files:
- --per-file-caching: 6.8 seconds
- --no-per-file-caching: 13.6 seconds
- --no-per-file-caching and running tools in parallel via --tool-args: 3.1 seconds
Running lint (with pylint + flake8 only) on 30 files:
- --per-file-caching: 10.9 seconds
- --no-per-file-caching: 32.3 seconds
- --no-per-file-caching and running tools in parallel via --tool-args: 13.8 seconds
Running lint (with pylint + flake8 only) on 250 files:
- --per-file-caching: 59.9 seconds
- --no-per-file-caching: 120.2 seconds
- --no-per-file-caching and running tools in parallel via --tool-args: 30.7 seconds

thejcannon · 2022-01-08T17:23:56Z

With enough similar metrics you might be able to deprecate ``per-file-caching` 🤔

…14184) When tools support internal concurrency and cannot be partitioned (either because they don't support it, such as in the case of a PEX resolve, or because of overhead to partitioning as fine-grained as desired), Pants' own concurrency currently makes it ~impossible for them to set their concurrency settings correctly. As sketched in #9964, this change adjusts Pants' local runner to dynamically choose concurrency values per process based on the current concurrency. 1. When acquiring a slot on the `bounded::CommandRunner`, a process takes as much concurrency as it a) is capable of, as configured by a new `Process.concurrency_available` field, b) deserves for the purposes of a fairness (i.e. half, for two processes). This results in some amount of over-commit. 2. Periodically, a balancing task runs and preempts/re-schedules processes which have been running for less than a very short threshold (`200ms` currently) and which are the largest contributors to over/under-commit. This fixes some over/under-commit, but not all of it, because if a process becomes over/under-committed after it has been running a while (because other processes started or finished), we will not preempt it. Combined with #14186, this change results in an additional 2% speedup for `lint` and `fmt`. But it should also have a positive impact on PEX processes, which were the original motivation for #9964. Fixes #9964. [ci skip-build-wheels]

stuhood added the engine label Jun 4, 2020

stuhood mentioned this issue Jul 9, 2020

Add option for specifying environment variable with concurrency slot #10297

Merged

Eric-Arellano mentioned this issue Aug 10, 2020

Lower the default for --python-setup-resolver-jobs #10584

Merged

Eric-Arellano closed this as completed in #10584 Aug 11, 2020

Eric-Arellano reopened this Aug 11, 2020

Eric-Arellano closed this as completed Oct 30, 2020

stuhood reopened this Mar 12, 2021

stuhood changed the title ~~Pex parallelism should be chosen dynamically, and possibly per platform/host~~ Per-process parallelism should be chosen dynamically, and possibly per platform/host Apr 14, 2021

stuhood mentioned this issue Apr 14, 2021

design and implement tests support in Go backend #11913

Closed

benjyw removed the engine label Sep 9, 2021

stuhood added this to 2022 Performance Push Jan 5, 2022

stuhood mentioned this issue Jan 5, 2022

Batch invokes of linters (and other dynamically sized process invokes) #13462

Closed

This was referenced Jan 19, 2022

Batching of lint and fmt invokes #14186

Merged

Dynamically choose per-process concurrency for supported processes #14184

Merged

stuhood closed this as completed in #14184 Jan 24, 2022

stuhood moved this to Done in 2022 Performance Push Jan 24, 2022

stuhood mentioned this issue Mar 23, 2022

"Tentpole"/straggler processes underconsume slots #14886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

stuhood commented Jun 4, 2020 •

edited

Loading

Eric-Arellano commented Aug 11, 2020

Eric-Arellano commented Oct 30, 2020

stuhood commented Mar 12, 2021

stuhood commented Apr 14, 2021

stuhood commented Aug 9, 2021

stuhood commented Jan 7, 2022 •

edited

Loading

thejcannon commented Jan 8, 2022 •

edited

Loading

thejcannon commented Jan 8, 2022 •

edited

Loading

thejcannon commented Jan 8, 2022

Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

Comments

stuhood commented Jun 4, 2020 • edited Loading

Eric-Arellano commented Aug 11, 2020

Eric-Arellano commented Oct 30, 2020

stuhood commented Mar 12, 2021

stuhood commented Apr 14, 2021

stuhood commented Aug 9, 2021

stuhood commented Jan 7, 2022 • edited Loading

thejcannon commented Jan 8, 2022 • edited Loading

thejcannon commented Jan 8, 2022 • edited Loading

thejcannon commented Jan 8, 2022

stuhood commented Jun 4, 2020 •

edited

Loading

stuhood commented Jan 7, 2022 •

edited

Loading

thejcannon commented Jan 8, 2022 •

edited

Loading

thejcannon commented Jan 8, 2022 •

edited

Loading