Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./pants --query for unifying target selection mechanisms to improve performance #7346

Closed
7 tasks
cosmicexplorer opened this issue Mar 9, 2019 · 9 comments
Closed
7 tasks

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Mar 9, 2019

Starting with #6501 (comment), it seems like it would be both useful and fun to merge our target selection options (currently: --changed-parent, --owner-of, and the filter, paths?, and dependees goals) into a composable interface. One existing example (ish) of this is the bazel query language, although that tries to output graphs, and has its own command-line goal.

Rather than making it into its own goal, it seems like making an option --query which wraps our existing target selection logic, and then allowing goals such as list, compile, or test to use the selected targets the same way they currently use e.g. --changed-parent seems like a great way to incorporate our current mechanisms, while making the use of multiple selection mechanisms significantly more performant by allowing query composition instead of invoking multiple pants processes, as well as by use of the v2 engine.

Another specific and possibly important point from #6501 (comment) is that while composing expressions in the bazel query language is done by nesting function calls, we can consider what should be a strict extension to that language by incorporating the | operator à la jq. Other extensions to that language would involve incorporating our existing selection mechanisms such as --owner-of and --changed-parent into functions in the pants --query language.

For now, the "output" would be whatever you get from ./pants --query <...> list, which is lines of text, each specifying a target address. This seems like a very good interface to start with, because the result of --query can then be immediately used by pants tasks without invoking a separate process (which should at least reduce our internal CI time). There may be other outputs we might want, but it seems we can avoid needing a complex output format in favor of adding more of our existing target selection mechanisms to the pants --query language.

TODO

  • Convert --owner-of and --changed-parent to be the first supported (single) expressions in a --query global option.
  • Create a method of registering --query functions which can also be used in @rules (less important), as per ./pants dependees should be deprecated in favor of a "root selection" option #6501 (comment).
  • Convert --owner-of, --changed-parent, all of the arguments to filter, and whatever else comes to mind into --query functions.
  • Parse the --query expression to allow chaining function calls with | like jq.
  • Support any functions that the bazel query language has that we don't already have existing syntax for.
  • Replace all uses of existing target selection mechanisms with --query.
  • Add any further functions that sound neat.
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Mar 9, 2019

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

I think one thing that could also be true is that the command-line target specs (and their dependencies) would always bound the size of the graph used in --query, so queries wouldn't end up traversing the filesystem (this may be obvious).

@cosmicexplorer
Copy link
Contributor Author

Preliminary research suggests that TargetRootsCalculator.create() is where we would inject v2 rules satisfying --querys:

request = OwnersRequest(sources=tuple(changed_files),
include_dependees=str(changed_request.include_dependees))
changed_addresses, = session.product_request(BuildFileAddresses, [request])
logger.debug('changed addresses: %s', changed_addresses)
dependencies = tuple(SingleAddress(a.spec_path, a.target_name) for a in changed_addresses)
return TargetRoots([Specs(dependencies=dependencies, exclude_patterns=exclude_patterns, tags=tags)])

@cosmicexplorer
Copy link
Contributor Author

Another thought -- shlexing strings may avoid parsing now, but we'll want to support keyword arguments pretty soon. Using the python ast module to parse the inputs may be an easy, hygienic, and stable way to provide a more comfortable syntax and avoid having to do any parsing work ourselves.

@cosmicexplorer
Copy link
Contributor Author

From #engine, an example of how --query composition could look:

> ./pants --query='changes_since master' --query='filter type=junit_tests' --query='dependencies transitive=True' --query='filter type=jar_library' my-random-3rdparty-jar-task ::
cosmicexplorer [18:40]
this is a contrived example
but would take the target specs (in this case ::) as the initial input
then intersect that against the target with changes since master
then filter to junit test targets, then find the transitive dependencies of those targets which are jar_library()s
then run some task on those targets (in this case my-random-3rdparty-jar-task), because this is all still just target root selection

@illicitonion
Copy link
Contributor

This is very exciting, glad to see it moving forwards! :D

./pants --query='changes_since master' --query='filter type=junit_tests' --query='dependencies transitive=True' --query='filter type=jar_library' my-random-3rdparty-jar-task ::

I find that a lot harder to read (particularly as other arguments may be interspersed between query arguments) than:
./pants --query='changes_since master | filter type=junit_tests | dependencies transitive=True | filter type=jar_library
so would rather have a single query argument, than several. Also, I don't know of any other places that repeated arguments are order-dependent...

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

The place I think this falls down is that not all operators are single argument. There are a bunch of binary-or-more-argument operators that I think are pretty hard to model nicely in this kind of pipeline.

As an example, how would you write:
deps(rdeps(foo) - rdeps(bar)) - deps(baz + rdeps(bash))? Function calls map pretty nicely to composition in pretty flexible ways, whereas in a jq style this would I guess look like:

rdeps foo | filter subquery="rdeps bar" | deps | filter subquery="rdeps(bash) | union baz | deps"

necessitating a kind of funky subquery kind of argument, but generally seeming a lot less clear to me as someone who writes a lot of function calls day-to-day...

This may seem like a contrived example, but in my experience it isn't; I've written this kind of thing before quite a bit, and found it super useful...

Other extensions to that language would involve incorporating our existing selection mechanisms such as --owner-of

FWIW the bazel way of phrasing this is kind(rule, rdeps(//..., some/file, 1)) - the first argument to rdeps is the universe in which to search for dependees, and the third argument is depth; 1 is direct dependees, and a target directly depends on the files in it.

The buck way of phrasing this is owner(some/file).

We should work out whether in pants we view targets as depending on files (as bazel does), or we consider the relationship to be something different (as buck does). We could have owner-of in either case, but it would be a useful distinction to nail down the semantics of dependencies and dependees :)

This was actually specifically the reason we went with a separate argument for --owners-of instead of taking files on the command line, so it is pretty important for us to work out whether files and targets should be mixable in this world. In particular, should I be able to say:

./pants --query="rdeps transitive=false" some/file.scala some:target? If so, what should it return?

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

I think one thing that could also be true is that the command-line target specs (and their dependencies) would always bound the size of the graph used in --query, so queries wouldn't end up traversing the filesystem (this may be obvious).

One answer buck found for this was to effectively allow templating into queries; see the "Executing multiple queries at once" section of https://buckbuild.com/command/query.html

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Mar 12, 2019

As an example, how would you write: deps(rdeps(foo) - rdeps(bar)) - deps(baz + rdeps(bash))?

I hadn't considered this! I was mostly skimming over the bazel query language spec.

Function calls map pretty nicely to composition in pretty flexible ways, whereas in a jq style this would I guess look like:
rdeps foo | filter subquery="rdeps bar" | deps | filter subquery="rdeps(bash) | union baz | deps"

I was thinking that filter would be analogous to its current implementation, and accepting the same options. I was thinking of something atrocious like:

((//foo | rdeps) - (//bar | rdeps) | deps) - (//baz + (//bash | rdeps) | deps)

After implementing some of it in #7356, I'm less concerned that we need to use any specific sort of syntax. However, I really do not appreciate nested function calls, and I would strongly prefer that there is a function composition operator. This makes it more likely people will write --query methods which can be composed with others in fun and interesting ways, instead of having one or two with many many options.

%>% operator for composition

magrittr is a truly wonderful R package that changed me, and we could consider merely adding it on top of the existing bazel query-esque syntax:

(rdeps(//foo) - rdeps(//bar) %>% deps) 
  - (//baz + rdeps(//bash) %>% deps)

which might also then be written as (for clarity):

(rdeps(//foo) - rdeps(//bar) %>% deps(.)) 
  - (//baz + rdeps(//bash) %>% deps(.))

where . is the implicit argument. As in R, this function composition operator wouldn't stop you from using function calls, so the original formulation:

deps(rdeps(//foo) - rdeps(//bar)) - deps(//baz + rdeps(//bash))

would still be allowed. I think composing queries like:

(changes_since('master')
  %>% filter(type='junit_tests')
  %>% dependencies(transitive=True)
  %>% filter(type='jar_library'))
  + (owner_of('some/file.py') %>% dependencies)

is much more readable than the alternative, much more diff-friendly, and that it encourages composition significantly more.

We could keep | for composition, but since the bazel query language diverges so much from jq syntax via + and - and other binary operators (and shell scripts, for that matter), R might be a more interesting syntax to emulate. I also don't want it to be confused for some sort of union operation.

EDIT: and of course, magrittr uses %>% because % is how you define infix operators in R, but we could do merely > -- but %>% is part of a consistent existing syntax, and it's pretty visually striking and unambiguous, and these all seem like positives.

EDIT2:

One answer buck found for this was to effectively allow templating into queries; see the "Executing multiple queries at once" section of https://buckbuild.com/command/query.html

The above solution could also satisfy the example provided in that section -- buck's:

> buck query "testsof(deps( %s ))" target1 target2 target3

could become pants's:

> ./pants --query "deps \
    %>% rdeps(transitive=False) \
    %>% filter(type='python_tests')" \
  test \
  target1 target2 target3

We wouldn't be able to immediately replicate the buck example using files in format strings:

> buck query "owner( %s )" main.cpp myclass.cpp myclass.h

But I was actually thinking we would keep the --owner-of argument as is (see #7356), and if we really need to read multiple --owner-of arguments from files (which sounds like it would happen if we're invoking pants multiple times and not using --query enough!), we could always do it the old-fashioned way:

> ./pants --query "owner_of($(printf "'%s', " <input-files.txt))" list

so I don't think we would need to include that formatting in the build tool itself.

EDIT3:

We should work out whether in pants we view targets as depending on files (as bazel does), or we consider the relationship to be something different (as buck does). We could have owner-of in either case, but it would be a useful distinction to nail down the semantics of dependencies and dependees :)

I personally appreciate having the dual filedeps goal and --owner-of in pants, and it's not clear what kinds of queries would need to look at the files that a target owns, and then go back and look at the targets which own them, since that should be the same thing. Having rdeps on files and multiple universes seems like a neat parlor trick to me right now (who isn't familiar with any structure internal to bazel which would motivate these design decisions), and having files also act like targets in the query language, depending on whether the files are surrounded by the right arguments to e.g. rdeps, feels pretty confusing and difficult to read to me in comparison to consistently understanding that owner_of() is the way to turn file paths into target specs.

Also, deps or rdeps accepting a depth argument feels like a bit of a hack to me compared to a transitive=True kwarg. It feels like appropriate application of filter(type=..., tags=...) in a %>% pipeline might be able to satisfy the use case there, unless you're trying to do a graph algorithm.

I haven't thought about this for too long and would love to see a case where using rdeps() of files as well as targets enhances understanding of what the query is doing.

@stuhood
Copy link
Member

stuhood commented Apr 5, 2019

While we're on syntax, if it's not going to be | to pipe, consider ., which would slightly align with Haskell's function composition: https://stackoverflow.com/a/12526130

@thejcannon
Copy link
Member

Now that we have peek which outputs structured data (JSON, and since xargs and jq exist, I think users have everything they need to do this kind of thing on the CLI themselves instead of us making new ways to filter.

I'm reminded of the UNIX philosophy of doing one thing (and doing it well).

@Eric-Arellano
Copy link
Contributor

Now that we have peek which outputs structured data (JSON, and since xargs and jq exist, I think users have everything they need to do this kind of thing on the CLI themselves instead of us making new ways to filter.

Generally agreed. We need to document more recipes though: #14969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants