Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC of WIP: Concurrent Cargo #1764

Closed
wants to merge 12 commits into from
Closed

RFC of WIP: Concurrent Cargo #1764

wants to merge 12 commits into from

Conversation

Byron
Copy link
Member

@Byron Byron commented Jun 27, 2015

In the current state, using cargo concurrently on the same system results in undesired behaviour.

One way to alleviate this would be to use the local file-system's locking facilities to lock resources before trying to use or adjust them. Both Windows (LockFileEx(...)) and Unix (fcntl(...)) provide such facilities, which should allow a cross-platform implementation with equivalent behaviour.

There are two ways to deal with locks:

  • Abort gracefully when a lock could not be acquired without blocking (similar to try_lock()). This is equivalent to the current behaviour, except that the cargo process who managed to acquire the lock will not be disturbed by the competing process.
  • Wait for a lock, and try to avoid duplication of work when the lock was acquired (similar to lock()). This allows multiple concurrent cargo processes to work collaboratively.

The choice above should be configurable.

In the past days I have been adjusting the file-lock crate, which will eventually provide the required cross-platform locking functionality. Currently it is only implemented for Non-Windows though.

Caveats

  • file-system based locking may not be fully supported on all file-systems. Network filesystem can be problematic, even though both SMB and NFS provide locking support. If one cannot trust the file-system, cargo will unwillingly revert to the previous, somewhat undefined behaviour.
  • It's unsafe to delete a lock-file after creation, which means there will be additional (possibly hidden) files in the CARGO_HOME and target-dirs.

RFC

I am submitting the PR early to get a chance for early feedback. This shall help to get the code on track soon, instead of having a whole pile of unusable nonsense later. As these are my first steps within the cargo codebase (which is teaching me every day), code-quality will certainly need to improve in many ways. Therefore I will be happy to make adjustments as soon as a review is coming in.

This PR's text will be updated throughout its lifetime, based on comments and reviews.

Progress

  • configurable locking semantics
    • Currently using build.lock-kind, with support for wait and nowait.
  • build different packages without shared dependencies
    • unit-test
    • works thanks to lock on Registry::open() and Registry::do_update()
  • build same package
    • unit-test
  • fetch same package (more testing required)
    • unit-test
  • build different package with shared dependencies
    • unit-test
  • build documentation of various targets with shared dependencies into same target-dir
    • unit-test
    • It currently rebuilds even existing documentation, which might be a different issue.

Issues to fix

Issues mentioned here are the ones I am quite aware off, even though I have no clear idea on how to alleviate them. Input on these topics are particularly welcome.

  • cargo-rustc now allows invalid/half-finished files (see code)
  • CargoLock::lock(...) method is hacky (look here and here

Testing

Currently I test manually, which is accomplished as follows:

$ git clone https://github.com/Byron/google-apis-rs
$ cd google-apis-rs
# Build 4 targets with plenty of dependencies in parallel. Depending on the `build.jobs` variable 
# in `.cargo.config`, there is more or less intra-process contention.
$ CARGO_HOME=$PWD/cargo_home make -j4 groupsmigration1-cli-cargo discovery1-cli-cargo translate2-cli-cargo audit1-cli-cargo ARGS=build

The .cargo/config contained in the aforementioned repository is used to configure cargo. Currently it looks like this:

[build]
target-dir = "target"
lock-kind = "wait"

At the current time, intra-process contention seems to be the most troublesome, as I get deadlock avoided errors when trying to wait on a lock ... sometimes.

The lock type wraps the file_lock::FileLock into an interface suitable
for cargo's error handling.

It uses the cargo configuration to determine the locking semantics.

* build.lock-kind = "nowait" is non-blocking, which is the default
  to work similarly to what people have grown to except. After all,
  cargo fails ungracefully if there is any kind of concurrent access.
  In future, it would fail in a controlled fasion.
  Once we are sure about this, the default might change to 'wait'
* build.lock-kind = "wait" is blocking
Namely these are:

* `open(...)`
* `do_update(...)`
* `download_package(...)`

The general intention is to add an additional check for the originally
desired result and to verify it in case it already exists. That way,
duplication of work can be prevented (for example in case of
`download_package(...)`).
@rust-highfive
Copy link

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

Previously, a non-existant branch was set
by accident.
It might be easier to read as well, and doesn't incour unnecessary
overhead for converting the desired value to String, and parse it
back.

Also added latest version of Cargo.lock which contains the actual
branch we need in the `file_lock` crate.
It's primariily cool to see that there is concurrency going on, but
the user shouldn't have to care.

In most cases, there is no attempt concurrently download anything
as the crate will already be present.

Here is [an example](https://goo.gl/JqBrlI) on how it looked previously.
This will make the type easier to use, especially in concjunction with
utilities that don't necessarily have a `Config` instance available
at all times.

The adjustment will help implementing locking for the `Layout` type.
It's somewhat unlikely the directory creation fails, and I'd clearly
prefer to lock the entire build process on a higher level.
This is interesting when locks cannot be acquired.
It also shows that we probably want to provide the lock file path
as well in case a lock fails (for instance due to the `nowait` mode).

Added `debug!` information to indicate when we try to get a lock.
This is an intermediate version as it shows how locking can be done
on JobQueue level. However, as we usually run multiple targets for the
same package and stage pair at the same time, the FileLock
implementation may fail with `Deadlock avoided` errors. Apparently
it doesn't like multiple threads of the same process try to get an
exclusive lock. Admittedly, I don't understand why this is.

Nonetheless, I am confident that each thread should simply stay
on its own resource lock, which would require us to provide
additional information with each job. The `TargetKind` *should* do it.
The `lock` implementation will now indefinitely attempt to aquire a lock
unless the result has nothing to do with `deadlock`.

When there is high intra-process contention thanks to threads,
this issue occours even though I would hope that we don't try to
to have multiple threads obtain a lock on the same lock file.
However, apparently this does happen.
Even if it does I don't understand why this is a deadlock for him.
The good news is that building now works in my manual tests.

On the bright side, this made it work in its entirety, and serves
as proof-of-concept.
Just because this is not really an issue to handle.

[skip ci]
@alexcrichton
Copy link
Member

Nice work! Of the two strategies you've outlined (abort or handle locking) I personally favor aborting currently as it seems like there's not a great way to force Cargo to think about file locking just yet. I haven't given it an inordinate amount of thought of what needs to be locked and where, but whenever I've done so I've reached the conclusion that it's easiest to just have a global lock on all operations.

I think that it'd be a good idea to handle more fine-grained locking, but I'm worried about the ability to statically ensure that we handle locking at all the appropriate locations. For example this PR currently sprinkles around a few locks here and there, but I'd be much more comfortable if there was one central location to acquire a lock from and otherwise paths just weren't accessible at all.

For now I feel like it might be best to just try to acquire an exclusive lock on a global file for each Cargo invocation and just abort if there's another Cargo instance running elsewhere as it's what I feel is the most robust solution that can be added now. I also want to make sure that if Cargo dies unexpectedly (e.g. receives a signal) that the locks are cleaned up (e.g. released).

@Byron
Copy link
Member Author

Byron commented Jun 30, 2015

... but whenever I've done so I've reached the conclusion that it's easiest to just have a global lock on all operations.

I agree, and believe this would be a great starting point for cargo. It should also be something that could be working relatively soon.

I also want to make sure that if Cargo dies unexpectedly (e.g. receives a signal) that the locks are cleaned up (e.g. released).

Please note that the lock implementation is not alike the one used by Git. The latter opens a file with O_EXCL and thus can tell if a lock is already present. It will abort as only option. The implementation you see here relies on flock and friends, which works with existing files only but will also allow to wait on a lock. However, this locking style cannot safely remove lock file it created, as others might be waiting on it, and as it doesn't create them with O_EXCL.
This shouldn't be an issue though, it would just mean the said global lock would have to remain on disk.

For example this PR currently sprinkles around a few locks here and there [...]

That is true, and it's what I came up with while searching for the 'highest possible' lock locations. Maybe there are better ones especially for locks on the folder structure initialization. The lock in the JobQueue though seems to be particularly well working, and very fine-grained too.

For now I feel like it might be best to just try to acquire an exclusive lock on a global file for each Cargo invocation and just abort if there's another Cargo instance running elsewhere as it's what I feel is the most robust solution that can be added now.

I agree, and hope you are fine with me using the current lock implementation, i.e. wait is still supported, and could be provided as an option in cargo/config.
As this is quite a different route compared to the fine-grained locking I attempt here, I would certainly create a new PR for it.

However, I am not sure if I should abandon this one as it actually works for me except for some hick-ups that can surely be fixed. This PR is like a proof of concept, that would just need more work to work correctly. Having two PRs (fine-grained and global locking) with different times to completion seems acceptable to me. What do you think ?

@alexcrichton
Copy link
Member

Having two PRs (fine-grained and global locking) with different times to completion seems acceptable to me. What do you think ?

My worry about the approach taken in this PR is that there's very little static guarantee that the appropriate locks are held for various operations, and as Cargo grows to do more things in the future it seems inevitable that locks will be forgotten and/or mismanaged. I would prefer to see an approach that provides more of a static guarantee that a lock is held rather than sprinkling a few locks here and there throughout the codebase.

The case I'm trying to avoid is that we get a constant trickle of bug reports about concurrent invocations causing weird bugs (if they're expected to work) and we continue to patch up small areas where locks are needed and/or need to be cleaned up. If we instead just take a system-package-manager or git-like approach where we just abort if something else is running I feel like it's not necessarily the end of the world. We can guarantee that corruption and weird results won't happen, give nicer error messages, etc.

I suspect that running cargo concurrently is not used that widely, so I'm not sure if it's worth adding all the support to do so.

@Byron
Copy link
Member Author

Byron commented Jul 1, 2015

I agree, and am closing this PR in favour of a new one which will focus on a single, global cargo lock. As there are legitimate uses for non-blocking, wait-like functionality (see comment), I will keep the respective part from this PR alive, knowing that reviews will bring it into shape.

Thanks for your input, I found it very valuable.

@Byron Byron closed this Jul 1, 2015
@Byron Byron mentioned this pull request Jul 4, 2015
2 tasks
Byron added a commit to Byron/google-apis-rs that referenced this pull request Jul 15, 2015
This allows to build everything concurrently without failure provided
the latest cargo is used.

rust-lang/cargo#1764

It's still very early in development, but works for me nevertheless.

[skip ci]
@colin-kiegel
Copy link

Speaking of 'wait-like' functionality - I have written a tiny wrapper script for cargo on linux. Of course it's only a temporary workaround - not a final solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants