Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary directories to be Snapshotted #5801

Merged

Conversation

illicitonion
Copy link
Contributor

@illicitonion illicitonion commented May 10, 2018

We don't cache anything intermediate here, we just dump the Snapshot
into the Store and return it.

This is intended to be a short-term transition path to allow BinaryUtils
to be consumed in the v2 engine.

@illicitonion illicitonion requested a review from stuhood May 10, 2018 16:46
@stuhood
Copy link
Member

stuhood commented May 10, 2018

As discussed in slack, I think option (2) is the way to go, via a native method like: PyResult scheduler_capture_snapshots(Scheduler*, char*, Value*, uint64_t), which would take a base path and a list of Values wrapping PathGlobs, and return a value wrapping a list of Snapshots.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Ok(PosixFS {
root: Self::canonicalize(root)?,
pool: self.pool.clone(),
ignore: self.ignore.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ignore patterns make this a bit challenging I think... the patterns used here are relative to the root of the repository/buildroot, and so we construct the Gitignore instance with an empty root. But if you were going to be capturing from a subdirectory of the repository and you still wanted to apply the same patterns, you'd need to pass in the portion of the subdirectory relative to the buildroot (afaik).

This is to account for patterns like:

$ cat .gitignore
/please/ignore/this/long/specific/file.sh

I think that not applying ignore patterns to this type of capture would be reasonable, and would probably be fine for now. But if we're going to apply them, I think we'd need to re-construct them as relative to the buildroot (if the capture was under the buildroot), or ignore the patterns if the capture was outside the buildroot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an issue I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Done

@@ -216,6 +216,8 @@
PyResult execution_add_root_select(Scheduler*, ExecutionRequest*, Key, TypeConstraint);
RawNodes* execution_execute(Scheduler*, ExecutionRequest*);

PyResult capture_snapshot(Scheduler*, char*, Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, for synchronous methods with concurrent/async implementations, I think it's a good idea to expose batch interfaces by default, as that sends the right signal about usage (that a caller should attempt to batch their calls wherever possible). Would recommend making this a batch method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire, but will push back a little for now for two reasons:

  1. I don't expect people to actually call this much at all, it's really just a short-term hack; if the do see people calling this, we need to design a better API.
  2. The overhead of taking a list of things, which I think to do nicely would be: Create a new intrinsic type for PathGlobsAndRoot, pass it through everywhere, take a pointer to them with separate length, project out each thing from the tuple, seems like overkill. Happy to do this if we end up actually needing it, but again, if we find ourselves in that place we should probably design an actual API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect people to actually call this much at all, it's really just a short-term hack; if the do see people calling this, we need to design a better API.

It will be used until "all" of pants is ported to the v2 engine in order to transition back and forth between local files and snapshots. And we immediately have a need for it to be concurrent I think, in order to capture ~500 jars into 500 snapshots on every jvm_compile startup. See #5762 (comment)

The overhead of taking a list of things, which I think to do nicely would be: Create a new intrinsic type for PathGlobsAndRoot, pass it through everywhere, take a pointer to them with separate length, project out each thing from the tuple, seems like overkill.

I don't think you need a new intrinsic type, although wrapper datatypes would be useful. It would look like:

  1. project_multi for a field that contains a list of PathGlobsAndRoots
  2. project str for the root
  3. use existing PathGlobs listing code for the rest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before jumping into discussion here: This is now done.

I don't expect people to actually call this much at all, it's really just a short-term hack; if the do see people calling this, we need to design a better API.

It will be used until "all" of pants is ported to the v2 engine in order to transition back and forth between local files and snapshots.

I was expecting this to be a stop-gap until BinUtils is migrated into v2, and somehow exposed to v1 tasks...

And we immediately have a need for it to be concurrent I think, in order to capture ~500 jars into 500 snapshots on every jvm_compile startup. See #5762 (comment)

I think I maybe don't understand #5762 then... What are those 500 jars? My understanding is that we already create snapshots for targets when we parse BUILD files for them (that's why ./pants list :: fills up ~/.cache/pants/lmdb_store), but that we throw away the references to the Snapshots when we create the v1 Target type. I thought that #5762 was basically a "don't throw away those references" task, plus some "clean up some legacy stuff which means we have some convoluted paths" task, rather than a "do all of those snapshots twice" kind of task...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this to be a stop-gap until BinUtils is migrated into v2, and somehow exposed to v1 tasks...

There are lots of existing tasks that download things that will need to be ported/bootstrapped into the new model in order to be remoted... maybe this is the last integration that we will do starting from "mid graph", and all other task porting will be bottom up (and thus run resolvers in the datacenter with network access...?), but I have been assuming that ivy/coursier/npm/yarn/go/etc would need bootstrapping into Snapshots for now.

Copy link
Member

@stuhood stuhood May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I maybe don't understand #5762 then... What are those 500 jars?

The jars fetched by ivy and coursier: the 3rdparty dependencies. The ivy/coursier tasks run during the resolve goal, rather than during build file parsing.

@@ -286,6 +286,14 @@ def run_and_return_roots(self, execution_request):
self._native.lib.nodes_destroy(raw_roots)
return roots

def capture_snapshot(self, root_path, path_globs):
result = self._native.lib.capture_snapshot(self._scheduler, root_path, self._to_value(path_globs))
value = self._from_value(result.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth lifting out a little _from_pyresult helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})
}

fn capture_snapshot_inner(
Copy link
Member

@stuhood stuhood May 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this file slimmer and making this an impl Scheduler method would be preferable I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Slightly ugly because I can't use ?s in a function returning PyResult, but probably a reasonable separation

@stuhood stuhood force-pushed the dwagnerhall/binutil/snapshotoutsideroot branch from e8c3279 to 874350f Compare May 13, 2018 18:18
@stuhood
Copy link
Member

stuhood commented May 13, 2018

(rebased this to get a useful travis run)

We don't cache anything intermediate here, we just dump the Snapshot
into the Store and return it.

This is intended to be a short-term transition path to allow BinaryUtils
to be consumed in the v2 engine.
@illicitonion illicitonion force-pushed the dwagnerhall/binutil/snapshotoutsideroot branch from 874350f to f7ba084 Compare May 16, 2018 12:05
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@stuhood stuhood merged commit 777e500 into pantsbuild:master May 17, 2018
@illicitonion illicitonion deleted the dwagnerhall/binutil/snapshotoutsideroot branch August 15, 2018 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants