Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop/Document multi-level parallelism policy #644

Closed
rrnewton opened this issue Jul 21, 2015 · 35 comments
Closed

Develop/Document multi-level parallelism policy #644

rrnewton opened this issue Jul 21, 2015 · 35 comments

Comments

@rrnewton
Copy link
Contributor

I can see that stack test defaults to parallel builds. But that refers to parallelism between different test-suites, right? In the case of both builds and tests (with test-framework or tasty) there's the issue of "inner loop" parallelism within each build or test target. I suppose we can't parallelize the tests until we have some way for the build-tool to know it's a tasty/test-framework suite, not just exitcode-stdio-1.0, but that still leaves build-parallelism. Does Stack currently pass -j to GHC?

I haven't seen comprehensive benchmark numbers, but a small value like -j2 or -j4 for GHC offers some benefit. The current implementation is not scalable however (for example, I haven't seen anything good come of -j32).

Nested parallelism of course raises the issue of N * N jobs being spawned for N cores. As long as its quadratic oversubscription and not exponential, I think this is not that big a problem for CPU usage, but it very often can create problems with memory usage (or hitting ulimits / max # of processes).

There are several related cabal issues, but it's a little hard to tell the current status with them spanning a lot of time and various merged and unmerged pull requests:

@rrnewton rrnewton changed the title Document multi-level parallelism policy Develop/Document multi-level parallelism policy Jul 21, 2015
@snoyberg
Copy link
Contributor

I'm seeing about three different things being raised here:

  • Asking GHC to build a single package in parallel
  • Telling a test suite to run its test cases in parallel
  • Running multiple test suites from a single package in parallel

stack does none of these right now. The only thing stack does in parallel is process individual packages in parallel. Does that answer your question?

As to what should stack be doing... I don't see a downside to passing -j to GHC when available. I'd rather avoid running test suites in parallel, but there's really no strong reason for that. I don't see how stack can have any impact on the insides of the test suite itself, since that's entirely up to the test framework.

@borsboom
Copy link
Contributor

One tricky thing is deciding how many threads GHC should be running if stack is running multiple builds (i.e. if you have 8 CPUs and stack is running 8 builds, each GHC shouldn't be running 8 of its own threads). Simplest might be to only pass -j if stack is only building a single package.

@borsboom borsboom added this to the Later improvements milestone Jul 21, 2015
@rrnewton
Copy link
Contributor Author

Given that neither stack nor cabal nor GHC has a notion of global resource management on the machine, I guess in the medium term what I'd like is enough knobs to experiment with this.

For example, it would be great to be able to "turn it up" to max parallelism -- parallel packages, parallel targets within a package, parallel modules within ghc --make plus parallel tests. And then do some benchmarking to see what speedups and memory usage look like.

We can already vary things pretty easily by generating .cabal/.stack files that pass the right arguments through to GHC and to the test-suites. I guess the critical thing on which to get help from stack itself is the third bullet @snoyberg mentioned -- multiple test-suites (plus profiling/non-profiling/documentation) in parallel within one package, which corresponds to haskell/cabal#2623.

By the way, as far as I know, no one's properly analyzed GHC builds from a parallel algorithms perspective. I.e. we need to profile the dependence graph and do a work/span analysis to figure out what is limiting our scaling. (We're working on a research project where we may be hacking on fine-grain parallelism inside certain GHC phases, but it only makes sense if there's a coherent picture for parallelism at the coarser grain too.)

In lieu of a GHC server mode (which has been discussed on various issues), we can't directly implement an inter-process work-stealing type policy that mimics the now-standard intra-process load balancing approach. But we can do what that Gentoo builder seems to be doing and simply delay some tasks to manage resource use. The more look-ahead or previous knowledge we have about the task DAG the smarter we can be in prioritizing the critical path.

@snoyberg
Copy link
Contributor

I'm tempted to close this issue, and just add a link to my comment above to the FAQ. Any objection?

@snoyberg snoyberg self-assigned this Jul 31, 2015
@rrnewton
Copy link
Contributor Author

That's fine. Fixing this so that stack does something smart would be a big project not solved in a day, and that addresses the "document" part.

@snoyberg
Copy link
Contributor

Added here: https://github.com/commercialhaskell/stack/wiki/FAQ#how-does-stack-handle-parallel-builds-what-exactly-does-it-run-in-parallel

@alexanderkjeldaas
Copy link
Contributor

It would be great if this issue, since it's linked from the FAQ, tells me how to build at least my package in parallel using some ghc option, possibly specified in the cabal file or as an option to stack.

Alternatively, explicitly say that this cannot be done.

@mgsloan
Copy link
Contributor

mgsloan commented Jun 7, 2016

@alexanderkjeldaas No such flag necessary, stack will build your package in parallel with other things if it can.

Perhaps you mean having ghc build modules in parallel? Unfortunately in my experience this doesn't speed things up as much as I'd hoped. You can do --ghc-options -j5 (or whatever number).

@alexanderkjeldaas
Copy link
Contributor

Yes that option is interesting - ghc seems to be able to use 6x the CPU and finish on exactly the sime time. Impressive!

@sjakobi
Copy link
Member

sjakobi commented Jun 8, 2016

Yes that option is interesting - ghc seems to be able to use 6x the CPU and finish on exactly the sime time.

Related GHC ticket: https://ghc.haskell.org/trac/ghc/ticket/9221

I did some simple timings on my machine (i3-2350M, 2 physical cores + hyperthreading) and always got the shortest build times with -j2 or -j3. The relative speedup varied a lot depending on the package, e.g. ~30% with vector-algorithms vs. ~10% with haskell-src-exts.

I was wondering how hard it would be to detect when stack doesn't use all of its build threads and in that case to pass -j2 or -j3 to the build jobs until all threads are used.
Build times would probably still be quite far from their optimum but I don't believe that this could result in build times that are worse than the status quo.

@Blaisorblade
Copy link
Collaborator

Related question (might need its own issue): how do I tell stack to set -j4 by default for itself, aside from ghc-options? I found nothing in http://docs.haskellstack.org/en/latest/yaml_configuration/#non-project-specific-config, http://docs.haskellstack.org/en/latest/faq/ or by googling.

Studying this issue suggests also builds are parallel, and stack's source suggests it defaults to -j $(getNumProcessors), already in 1.1.2, which sounds good (depending on answers to the other questions), and that this can be tuned through e.g. jobs: 4 in config.yaml.

@mgsloan
Copy link
Contributor

mgsloan commented Jun 13, 2016

stack build -j2. It's among the options listed in stack --help

@Blaisorblade
Copy link
Collaborator

@mgsloan I'm asking for docs on setting that by default, for all invocations.

@alexanderkjeldaas
Copy link
Contributor

Also a different setting for the current project only would make sense.
Dependencies might be built in parallel, but the current project won't.

On Mon, Jun 13, 2016 at 3:42 PM, Paolo G. Giarrusso <
[email protected]> wrote:

@mgsloan https://github.com/mgsloan I'm asking for docs on setting that by
default
, for all invocations.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#644 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAUtqb6aRw4-xKiiZobCsWQJiboF8Ml_ks5qLV5cgaJpZM4Fcn3C
.

@runeksvendsen
Copy link

Can anyone tell me how many parallel ghc processes stack will spawn? This is distinct from the ghc -j option, which specifies the level of parallelism within each ghc process, whereas I'm talking about how many of these ghc processes stack (or is it cabal?) keeps running at the same time for building dependencies in parallel.

I'm benchmarking some code on a 32-core machine, and stack seems to only spawn 8 concurrent ghc instances (building 8 dependencies in parallel), resulting in, at most, 25% use (8) of the 32 cores.

Based on my testing this figure should be equal to at least the number of available CPU cores, perhaps as much as five times that number, as ghc often seems to have a hard time using up just a single core fully (doing IO I presume). So if we set it to 5*NUM_CORES then each individual ghc process can use (on average) as little as 20% of one core, and we'd still be using all cores fully.

An actual use case for this would be automatically spawning the build process onto high-CPU VMs, so we can build stuff in 2 minutes rather than half an hour.

@mgsloan
Copy link
Contributor

mgsloan commented Aug 9, 2016

@runeksvendsen By default, it will build NUM_CORES packages concurrently.

An actual use case for this would be automatically spawning the build process onto high-CPU VMs, so we can build stuff in 2 minutes rather than half an hour.

Often we can't actually build that many packages concurrently, and so your CPUs remain unsaturated. You can pass -j2 and similar to ghc via --ghc-options -j2. Unfortunately, in my experience this hasn't helped build time as much as I'd hoped.

@alexanderkjeldaas
Copy link
Contributor

alexanderkjeldaas commented Oct 29, 2016

@snoyberg the above link Added here: https://github.com/commercialhaskell/stack/wiki/FAQ#how-does-stack-handle-parallel-builds-what-exactly-does-it-run-in-parallel is dead, then the new FAQ links back to this issue, making it circular as this issue was closed because this is supposedly documented.

reopen issue?

@alexanderkjeldaas
Copy link
Contributor

Also a separate issue that I'll mention here, a -j low-memory would be good to have for CI machines with limited RAM. It's not a problem to pre-fetch, and maybe configure in parallel, but not build.

@Blaisorblade Blaisorblade reopened this Oct 29, 2016
@Blaisorblade
Copy link
Collaborator

Reopened as requested. After the issue was closed, it seems more than one question was asked and not addressed by the docs (sorry if I'm wrong).
Re -j low memory: how low? Your request makes sense, but I ask because right now, at least on machines with 1G RAM, GHC tends to segfault rather than report it failed allocation.

@alexanderkjeldaas
Copy link
Contributor

I actually don't know what stack does by default, but I just tried to setup CI on buddy.works, and stack build will give me the following out-of-the-box:

thyme-0.3.5.5: copy/register 

  --  While building package JuicyPixels-3.2.8 using: 

        /home/app/.stack/setup-exe-cache/x86_64-linux/setup-Simple-Cabal-1.24.0.0-ghc-8.0.1 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.24.0.0 build --ghc-options " -ddump-hi -ddump-to-file" 

      Process exited with code: ExitFailure (-9) (THIS MAY INDICATE OUT OF MEMORY) 

      Logs have been written to: /mysecretproject/.stack-work/logs/JuicyPixels-3.2.8.log 

      Configuring JuicyPixels-3.2.8... 

      Building JuicyPixels-3.2.8... 

      Preprocessing library JuicyPixels-3.2.8... 

So some quick fix I could do to make CI work out-of-the-box is what I'd need.

@alexanderkjeldaas
Copy link
Contributor

For practical purposes, something like MAKEFLAGS for stack would be nice to have when stack is called from within some other build system. In that case it would be easy to slap an STACKBUILDFLAGS=-j1 <somebuildthing> to see if it solves the problem, instead of having to retrofit injecting stack build options through that other tool.

Not a big issue, but if someone is going to look at this, might as well add it.

@runeksvendsen
Copy link

Re -j low memory: how low?

For what it's worth, I experienced this on a 600M RAM f1-micro Google Cloud instance:

runesvend@cloudstore-test:~/code/test-gcloud-datastore$ stack install
Downloaded nightly-2016-09-15 build plan.    
Updating package index Hackage (mirrored at https://github.com/commercialhaskell/all-cabal-hashes.gi

Fetched package index.    
Populated index cache.    
stack: out of memory (requested 1048576 bytes)
runesvend@cloudstore-test:~/code/gcloud-datastore$ free -h
             total       used       free     shared    buffers     cached
Mem:          594M       195M       398M       4.2M       2.8M       113M
-/+ buffers/cache:        80M       514M

@Blaisorblade
Copy link
Collaborator

  • For what it's worth, I don't recommend trying with less than 2G RAM. And enough swap enabled. Most failures otherwise appear due to a single GHC instance and stack can't do much about them; @runeksvendsen's log shows a failure due to stack, but GHC requires far more memory so I see little point in trying to fix it.
  • By default, stack uses --jobs to the number of processors (a reasonable default). By default stack does not tell GHC to build at once multiple modules of a package, unless stack is explicitly configured otherwise in ~/.stack/config.yaml via ghc-options.

@alexanderkjeldaas You might want to run stack build -j1 (which forces building at most one package at a time) and see if that helps (it might, your trace looks like stack is setting -j2).

@runeksvendsen
Copy link

I can confirm that I had to completely give up trying to build my project on a 600M RAM machine. It worked OK to begin with, and it built GHC fine, but the closer it got towards actually finishing, the quicker all RAM was consumed.

I found a 1.7G RAM machine to be sufficient, however. Although, the build process sometimes requires a restart due to the occasional out-of-memory error (which, as mentioned, can be avoided -- while capping concurrency/performance -- by using eg. -j1).

@metaleap
Copy link

By default, it will build NUM_CORES packages concurrently.

Quick note for the maintainers of the Windows build in case they're not already using it: NUMBER_OF_PROCESSORS will typically be set (certainly on "pro" editions / server editions / developer machines) in a like manner.

@Anrock
Copy link

Anrock commented May 21, 2018

@Blaisorblade

By default, stack uses --jobs to the number of processors (a reasonable default)

Is it actual for Windows? I believe i'm seeing a difference in build time when running stack build and stack build -j16 on my Win10 machine.

@Blaisorblade
Copy link
Collaborator

Blaisorblade commented May 25, 2018

@Anrock That should be correct for Windows too; to debug, please describe your machine (maybe in a new issue?) — if you have hyperthreading, it's not obvious whether "number of processors" will actually count physical cores or logical threads, though it appears to count threads by default.

Just to double-check, please try calling by hand the underlying GHC API we use, GHC.Conc.getNumProcessors — example session below (my machine has 4 cores and 8 threads, but I have no Windows machine):

$ ghci
GHCi, version 8.4.2: http://www.haskell.org/ghc/  :? for help
Prelude> import GHC.Conc
Prelude GHC.Conc> getNumProcessors
8

Sources I consulted:

@Anrock
Copy link

Anrock commented May 25, 2018

@Blaisorblade false alarm, it works as expected. I did some benchmarking and for builds with --jobs 8 and without it results are same.

As a note: i'm running Win10 Pro 1803 on AMD FX8350 with 4 physical cores and hyperthreading so 8 logical cores total. getNumProcessors returns 8.

@snoyberg
Copy link
Contributor

There are no clear steps to be taken here, closing. If people would like to see doc improvements in the FAQ, please consider sending a PR.

@ProofOfKeags
Copy link

ProofOfKeags commented Nov 25, 2020

Is it possible to get stack to build a single package with module level parallelism? I find that building the Cabal library is often a bottleneck in the dependency graph of many projects and I'd like to be able to force it to run that one in parallel since it has 234 modules in it as of Nov'20. I saw some discussion up thread about not wanting to have every package have its own parallelism on the same level since it might cause a lot of cpu thrash.

Ideally the entire build system would share a single work queue rather than forcing packages into a particular "lane". As someone with a 24core dev machine this is something that is immensely useful to me, but I'm not sure how to go about thinking about this.

I suspect that the design of Cabal itself might be the limiting factor here but I do not know enough to be able to say one way or another. Is there any sort of workaround such that stack build [package] would only do package parallelism for dependencies but module parallelism for the top level target? If that was the case we could exert a bit more control over the whole process by having a series of commands to build "points of interest" in the graph to speed it up.

@chadbrewbaker
Copy link
Contributor

Thumb twiddling on a multicore box. Would this option be hard to add, mostly for building cabal packages in parallel?

stack build --genmake; make -j

@nikita-volkov
Copy link

nikita-volkov commented Feb 11, 2022

One tricky thing is deciding how many threads GHC should be running if stack is running multiple builds (i.e. if you have 8 CPUs and stack is running 8 builds, each GHC shouldn't be running 8 of its own threads). Simplest might be to only pass -j if stack is only building a single package.

This is solvable with a simple algorithm.

Stack knows how many parallel builds it's running at the end of executing each parallel build for a package. It can also know how many CPUs it has available, if it gets to control how many each build should be using. So if it's at the stage when it can only build one package and it has 4 CPUs available, it can start this particular build with -j4. If it has two packages to build, it can start building both with -j2. If it has five, it leaves one package in the queue and starts 4 builds without j.

@rrnewton
Copy link
Contributor Author

Yes, that sounds like a reasonable heuristic. Note, however, that an adversarial schedule can screw over that strategy by a bunch of jobs finishing and freeing available CPUs right after you make the decision to do -j4. Kunal Agarwal @ Washington University studies this two level job scheduling problem, and there are some meaningful results in the area.

But in GHC's case the problem is also made a bit easier by the fact that GHC's internal parallelism is not very scalable. So telling GHC to use 32 cores wouldn't make sense...

@hasufell
Copy link
Contributor

i.e. if you have 8 CPUs and stack is running 8 builds, each GHC shouldn't be running 8 of its own threads

This is already a wrong assumption from the C world, where jobs=cpus is a somewhat ok heuristic. In haskell, a single package can blow up 16GB worth of ram and 2 such packages built in parallel can bring your entire system down. This happened frequently at one of my previous companies with pandoc+amazonka. We had to run stack with -j1 in order to not trigger OOM or cause swapping to make the machine unresponsive for 15+ minutes.

@mistmist
Copy link

there's a relatively simple solution for the two-level problem: implement the GNU make job server protocol at all levels.

see for example cargo + rustc:

rust-lang/cargo#1744
rust-lang/rust#42682

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests