Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-process parallelism should be chosen dynamically, and possibly per platform/host #9964

Closed
stuhood opened this issue Jun 4, 2020 · 9 comments · Fixed by #10584 or #14184
Closed

Comments

@stuhood
Copy link
Member

stuhood commented Jun 4, 2020

As described on #9253:

Toolchain encountered several failures in CI when resolving requirements PEXes with an error code of -9 and no stdout or stderr. An error code of 9 signifies an Out of Memory error (OOM).

We reasoned that this was likely from not configuring --python-setup-resolver-jobs. The default value is None, which means that we use Pex's default of the # of cores. This is a good default for direct invocations of Pex, but is a bad default for Pants because of Pant's concurrency model.

Specifically, --python-setup-resolver-jobs is configured for each distinct process spawned by Pants. So, the total parallelism is really = --python-setup-resolver-jobs x --process-execution-local-parallelism.

We default --process-execution-local-parallelism to cpu_count x 2. This means that the default total parallelism for V2 users would be (cpu_count x 2) x cpu_count! For example, with a CPU count of 4 and using the default value for everything, total parallelism would be 32!

This applies to more than just CI though: in any context, either on a laptop, in CI, or in the context of remote execution, this type of value is challenging to set correctly.

@jsirois is working on splitting out resolve from install in PEX, and that might help mitigate the particular case of PEX, but this will continue to apply generally to parallel tools.


In the context of remote execution we can probably assume that workers are opaque, and safely configure a value on a "per-host"/destination/toolchain/platform basis (in a Multi-Platform Speculation sense).

But locally, we should likely choose this kind of value dynamically based on concurrency. As explained on #9253, hardcoding this value too low will hurt you when you're waiting for only one resolve to complete. A sketch of native support for this would maybe look like adding support for a _PANTS_CONCURRENCY environment variable that our executors would set dynamically (post cache-key calculation in the case of a local environment) based on how many other processes were running.

@stuhood stuhood added the engine label Jun 4, 2020
Eric-Arellano added a commit that referenced this issue Aug 11, 2020
See #9253 for the motivation. We didn't land the change then because we had to account for both v1 and v2 usages, and we didn't want to optimize for v2 at the harm of v1. Now we only have v2 to worry about.

This partially closes #9964, although in a much more naive way.

[ci skip-rust]
[ci skip-build-wheels]
@Eric-Arellano
Copy link
Contributor

While this was naively fixed in #10584, I'm reopening because this wasn't fixed as described above.

@Eric-Arellano Eric-Arellano reopened this Aug 11, 2020
@Eric-Arellano
Copy link
Contributor

Addressed by #11006.

@stuhood
Copy link
Member Author

stuhood commented Mar 12, 2021

Reopened because I still think that we can fully automate this.

@stuhood stuhood changed the title Pex parallelism should be chosen dynamically, and possibly per platform/host Per-process parallelism should be chosen dynamically, and possibly per platform/host Apr 14, 2021
@stuhood
Copy link
Member Author

stuhood commented Apr 14, 2021

This also applies to pylint: adjusted the title.

@stuhood
Copy link
Member Author

stuhood commented Aug 9, 2021

This has become more relevant recently since there are now a variety of PEX shapes, and not all invokes are likely to run concurrently. In particular, when using a lockfile/constraints.txt, the run to build the constraints PEX will generally run alone (and take a while to run), while lots of dependent PEX builds will run in parallel (and run quickly). This makes it very challenging to set the --python-setup-resolver-jobs value correctly manually.

@stuhood
Copy link
Member Author

stuhood commented Jan 7, 2022

As mentioned on #13462, we should implement this ticket in order to take better advantage of internal parallelism of processes.

We should likely add the following fields to Process (or perhaps to an optional dataclass field):

class Process:
  ..
  # If non-zero, the amount of parallelism that this process is capable of given its inputs. This value
  # does not directly set the number of cores allocated to the process: that is computed based on
  # availability, and provided as a template value in the arguments of the process.
  maximum_parallelism: int = 0

This will require changes to the BoundedCommandRunner on the rust side, to support dynamically computing and providing this value based on the number of slots available. But the computed value is also interesting: since we don't have a complete view of which processes will be started, we will need to choose concurrency limits based only on the current process and inbound processes (we could also optionally use history in the future, but that is tetchy).

The initial strategy I'm thinking of is to overcommit by default: when a process enters the semaphore, it will record its own concurrency as min(parallelism_slots_available, maximum_parallelism) (or perhaps a progression like total_slots / #processes_already_running), but will still only take one semaphore slot. Even if no parallelism slots are available, we will still schedule the process (i.e., we'll overcommit).

Overcommitting has some downsides: so, optionally in a followup (unsure if worth the complexity / or whether it would actually give a speedup): if the computed parallelism for a given process ends up not matching its maximum, a task can check while the process runs (maybe only once, a short fixed period after it has started: e.g. 250ms or so) to see if it can cancel and re-start it with a better parallelism value.

@thejcannon
Copy link
Member

thejcannon commented Jan 8, 2022

I like the idea! Some thoughts (which also apply to #13462 so it's repeated there):

  • For clarification, I assume caching won't be any different here than with --no-per-file-caching, right? This is essentially still an "all-in-one" run, just now parallel under-the-hood.
  • You likely were already going to do this, but it likely makes sense to error if this and --per-file-caching are enabled.
  • It is worth some time to think about how this change affects the dynamic UI. Seeing Running pylint on XYZ files in parallel in one swimlane and having N other swimlanes idle is likely very confusing. I can't say I have a good idea for how to fix this (other than maybe prodash in Render dynamic UI with prodash #13481 ), but it warrants thought.

@thejcannon
Copy link
Member

thejcannon commented Jan 8, 2022

(Also duping with #13462)

As a single datapoint, on my 64-core machine with formatters yapf and isort and linters pylint and flake8. I see that with a warm cache (all prereqs already handled via warmup run, then touching N files with a timestamp comment and re-running):

  • Running fmt on 5 files:
    • --per-file-caching: 1.3 seconds
    • --no-per-file-caching: 1.5 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: .974 seconds
  • Running fmt on 100 files:
    • --per-file-caching: 6.8 seconds
    • --no-per-file-caching: 13.6 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 3.1 seconds
  • Running lint (with pylint + flake8 only) on 30 files:
    • --per-file-caching: 10.9 seconds
    • --no-per-file-caching: 32.3 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 13.8 seconds
  • Running lint (with pylint + flake8 only) on 250 files:
    • --per-file-caching: 59.9 seconds
    • --no-per-file-caching: 120.2 seconds
    • --no-per-file-caching and running tools in parallel via --tool-args: 30.7 seconds

@thejcannon
Copy link
Member

With enough similar metrics you might be able to deprecate ``per-file-caching` 🤔

stuhood added a commit that referenced this issue Jan 24, 2022
…14184)

When tools support internal concurrency and cannot be partitioned (either because they don't support it, such as in the case of a PEX resolve, or because of overhead to partitioning as fine-grained as desired), Pants' own concurrency currently makes it ~impossible for them to set their concurrency settings correctly.

As sketched in #9964, this change adjusts Pants' local runner to dynamically choose concurrency values per process based on the current concurrency.
1. When acquiring a slot on the `bounded::CommandRunner`, a process takes as much concurrency as it a) is capable of, as configured by a new `Process.concurrency_available` field, b) deserves for the purposes of a fairness (i.e. half, for two processes). This results in some amount of over-commit.
2. Periodically, a balancing task runs and preempts/re-schedules processes which have been running for less than a very short threshold (`200ms` currently) and which are the largest contributors to over/under-commit. This fixes some over/under-commit, but not all of it, because if a process becomes over/under-committed after it has been running a while (because other processes started or finished), we will not preempt it.

Combined with #14186, this change results in an additional 2% speedup for `lint` and `fmt`. But it should also have a positive impact on PEX processes, which were the original motivation for #9964.

Fixes #9964. 

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
4 participants