./pants --query for unifying target selection mechanisms to improve performance #7346

cosmicexplorer · 2019-03-09T20:01:31Z

Starting with #6501 (comment), it seems like it would be both useful and fun to merge our target selection options (currently: --changed-parent, --owner-of, and the filter, paths?, and dependees goals) into a composable interface. One existing example (ish) of this is the bazel query language, although that tries to output graphs, and has its own command-line goal.

Rather than making it into its own goal, it seems like making an option --query which wraps our existing target selection logic, and then allowing goals such as list, compile, or test to use the selected targets the same way they currently use e.g. --changed-parent seems like a great way to incorporate our current mechanisms, while making the use of multiple selection mechanisms significantly more performant by allowing query composition instead of invoking multiple pants processes, as well as by use of the v2 engine.

Another specific and possibly important point from #6501 (comment) is that while composing expressions in the bazel query language is done by nesting function calls, we can consider what should be a strict extension to that language by incorporating the | operator à la jq. Other extensions to that language would involve incorporating our existing selection mechanisms such as --owner-of and --changed-parent into functions in the pants --query language.

For now, the "output" would be whatever you get from ./pants --query <...> list, which is lines of text, each specifying a target address. This seems like a very good interface to start with, because the result of --query can then be immediately used by pants tasks without invoking a separate process (which should at least reduce our internal CI time). There may be other outputs we might want, but it seems we can avoid needing a complex output format in favor of adding more of our existing target selection mechanisms to the pants --query language.

TODO

Convert --owner-of and --changed-parent to be the first supported (single) expressions in a --query global option.
Create a method of registering --query functions which can also be used in @rules (less important), as per ./pants dependees should be deprecated in favor of a "root selection" option #6501 (comment).
Convert --owner-of, --changed-parent, all of the arguments to filter, and whatever else comes to mind into --query functions.
Parse the --query expression to allow chaining function calls with | like jq.
Support any functions that the bazel query language has that we don't already have existing syntax for.
Replace all uses of existing target selection mechanisms with --query.
Add any further functions that sound neat.

The text was updated successfully, but these errors were encountered:

cosmicexplorer · 2019-03-09T21:15:06Z

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

I think one thing that could also be true is that the command-line target specs (and their dependencies) would always bound the size of the graph used in --query, so queries wouldn't end up traversing the filesystem (this may be obvious).

cosmicexplorer · 2019-03-09T21:25:06Z

Preliminary research suggests that TargetRootsCalculator.create() is where we would inject v2 rules satisfying --querys:

pants/src/python/pants/init/target_roots_calculator.py

Lines 111 to 116 in 9873b14

    
           request = OwnersRequest(sources=tuple(changed_files), 
        
                                   include_dependees=str(changed_request.include_dependees)) 
        
           changed_addresses, = session.product_request(BuildFileAddresses, [request]) 
        
           logger.debug('changed addresses: %s', changed_addresses) 
        
           dependencies = tuple(SingleAddress(a.spec_path, a.target_name) for a in changed_addresses) 
        
           return TargetRoots([Specs(dependencies=dependencies, exclude_patterns=exclude_patterns, tags=tags)])

cosmicexplorer · 2019-03-10T04:32:57Z

Another thought -- shlexing strings may avoid parsing now, but we'll want to support keyword arguments pretty soon. Using the python ast module to parse the inputs may be an easy, hygienic, and stable way to provide a more comfortable syntax and avoid having to do any parsing work ourselves.

cosmicexplorer · 2019-03-10T05:14:16Z

From #engine, an example of how --query composition could look:

> ./pants --query='changes_since master' --query='filter type=junit_tests' --query='dependencies transitive=True' --query='filter type=jar_library' my-random-3rdparty-jar-task ::
cosmicexplorer [18:40]
this is a contrived example
but would take the target specs (in this case ::) as the initial input
then intersect that against the target with changes since master
then filter to junit test targets, then find the transitive dependencies of those targets which are jar_library()s
then run some task on those targets (in this case my-random-3rdparty-jar-task), because this is all still just target root selection

illicitonion · 2019-03-11T12:29:49Z

This is very exciting, glad to see it moving forwards! :D

./pants --query='changes_since master' --query='filter type=junit_tests' --query='dependencies transitive=True' --query='filter type=jar_library' my-random-3rdparty-jar-task ::

I find that a lot harder to read (particularly as other arguments may be interspersed between query arguments) than:
./pants --query='changes_since master | filter type=junit_tests | dependencies transitive=True | filter type=jar_library
so would rather have a single query argument, than several. Also, I don't know of any other places that repeated arguments are order-dependent...

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

The place I think this falls down is that not all operators are single argument. There are a bunch of binary-or-more-argument operators that I think are pretty hard to model nicely in this kind of pipeline.

As an example, how would you write:
deps(rdeps(foo) - rdeps(bar)) - deps(baz + rdeps(bash))? Function calls map pretty nicely to composition in pretty flexible ways, whereas in a jq style this would I guess look like:

necessitating a kind of funky subquery kind of argument, but generally seeming a lot less clear to me as someone who writes a lot of function calls day-to-day...

This may seem like a contrived example, but in my experience it isn't; I've written this kind of thing before quite a bit, and found it super useful...

Other extensions to that language would involve incorporating our existing selection mechanisms such as --owner-of

FWIW the bazel way of phrasing this is kind(rule, rdeps(//..., some/file, 1)) - the first argument to rdeps is the universe in which to search for dependees, and the third argument is depth; 1 is direct dependees, and a target directly depends on the files in it.

The buck way of phrasing this is owner(some/file).

We should work out whether in pants we view targets as depending on files (as bazel does), or we consider the relationship to be something different (as buck does). We could have owner-of in either case, but it would be a useful distinction to nail down the semantics of dependencies and dependees :)

This was actually specifically the reason we went with a separate argument for --owners-of instead of taking files on the command line, so it is pretty important for us to work out whether files and targets should be mixable in this world. In particular, should I be able to say:

./pants --query="rdeps transitive=false" some/file.scala some:target? If so, what should it return?

I think making --query an option actually pairs really nicely with using a jq-like | for composition, because there would then always be a single implicit input in the pipeline -- first the literal target specs from the command line, which are the input to the first component of the | pipeline, and each function and/or operator then ferries its output to the next. This is in contrast to the explicit target specs provided to function calls in the bazel query language.

I think one thing that could also be true is that the command-line target specs (and their dependencies) would always bound the size of the graph used in --query, so queries wouldn't end up traversing the filesystem (this may be obvious).

One answer buck found for this was to effectively allow templating into queries; see the "Executing multiple queries at once" section of https://buckbuild.com/command/query.html

cosmicexplorer · 2019-03-12T02:44:20Z

As an example, how would you write: deps(rdeps(foo) - rdeps(bar)) - deps(baz + rdeps(bash))?

I hadn't considered this! I was mostly skimming over the bazel query language spec.

Function calls map pretty nicely to composition in pretty flexible ways, whereas in a jq style this would I guess look like:
rdeps foo | filter subquery="rdeps bar" | deps | filter subquery="rdeps(bash) | union baz | deps"

I was thinking that filter would be analogous to its current implementation, and accepting the same options. I was thinking of something atrocious like:

After implementing some of it in #7356, I'm less concerned that we need to use any specific sort of syntax. However, I really do not appreciate nested function calls, and I would strongly prefer that there is a function composition operator. This makes it more likely people will write --query methods which can be composed with others in fun and interesting ways, instead of having one or two with many many options.

`%>%` operator for composition

magrittr is a truly wonderful R package that changed me, and we could consider merely adding it on top of the existing bazel query-esque syntax:

(rdeps(//foo) - rdeps(//bar) %>% deps) 
  - (//baz + rdeps(//bash) %>% deps)

which might also then be written as (for clarity):

(rdeps(//foo) - rdeps(//bar) %>% deps(.)) 
  - (//baz + rdeps(//bash) %>% deps(.))

where . is the implicit argument. As in R, this function composition operator wouldn't stop you from using function calls, so the original formulation:

deps(rdeps(//foo) - rdeps(//bar)) - deps(//baz + rdeps(//bash))

would still be allowed. I think composing queries like:

(changes_since('master')
  %>% filter(type='junit_tests')
  %>% dependencies(transitive=True)
  %>% filter(type='jar_library'))
  + (owner_of('some/file.py') %>% dependencies)

is much more readable than the alternative, much more diff-friendly, and that it encourages composition significantly more.

We could keep | for composition, but since the bazel query language diverges so much from jq syntax via + and - and other binary operators (and shell scripts, for that matter), R might be a more interesting syntax to emulate. I also don't want it to be confused for some sort of union operation.

EDIT: and of course, magrittr uses %>% because % is how you define infix operators in R, but we could do merely > -- but %>% is part of a consistent existing syntax, and it's pretty visually striking and unambiguous, and these all seem like positives.

EDIT2:

One answer buck found for this was to effectively allow templating into queries; see the "Executing multiple queries at once" section of https://buckbuild.com/command/query.html

The above solution could also satisfy the example provided in that section -- buck's:

> buck query "testsof(deps( %s ))" target1 target2 target3

could become pants's:

> ./pants --query "deps \
    %>% rdeps(transitive=False) \
    %>% filter(type='python_tests')" \
  test \
  target1 target2 target3

We wouldn't be able to immediately replicate the buck example using files in format strings:

> buck query "owner( %s )" main.cpp myclass.cpp myclass.h

But I was actually thinking we would keep the --owner-of argument as is (see #7356), and if we really need to read multiple --owner-of arguments from files (which sounds like it would happen if we're invoking pants multiple times and not using --query enough!), we could always do it the old-fashioned way:

> ./pants --query "owner_of($(printf "'%s', " <input-files.txt))" list

so I don't think we would need to include that formatting in the build tool itself.

EDIT3:

We should work out whether in pants we view targets as depending on files (as bazel does), or we consider the relationship to be something different (as buck does). We could have owner-of in either case, but it would be a useful distinction to nail down the semantics of dependencies and dependees :)

I personally appreciate having the dual filedeps goal and --owner-of in pants, and it's not clear what kinds of queries would need to look at the files that a target owns, and then go back and look at the targets which own them, since that should be the same thing. Having rdeps on files and multiple universes seems like a neat parlor trick to me right now (who isn't familiar with any structure internal to bazel which would motivate these design decisions), and having files also act like targets in the query language, depending on whether the files are surrounded by the right arguments to e.g. rdeps, feels pretty confusing and difficult to read to me in comparison to consistently understanding that owner_of() is the way to turn file paths into target specs.

Also, deps or rdeps accepting a depth argument feels like a bit of a hack to me compared to a transitive=True kwarg. It feels like appropriate application of filter(type=..., tags=...) in a %>% pipeline might be able to satisfy the use case there, unless you're trying to do a graph algorithm.

I haven't thought about this for too long and would love to see a case where using rdeps() of files as well as targets enhances understanding of what the query is doing.

stuhood · 2019-04-05T18:53:41Z

While we're on syntax, if it's not going to be | to pipe, consider ., which would slightly align with Haskell's function composition: https://stackoverflow.com/a/12526130

thejcannon · 2022-05-14T13:58:44Z

Now that we have peek which outputs structured data (JSON, and since xargs and jq exist, I think users have everything they need to do this kind of thing on the CLI themselves instead of us making new ways to filter.

I'm reminded of the UNIX philosophy of doing one thing (and doing it well).

Eric-Arellano · 2022-05-14T14:02:34Z

Now that we have peek which outputs structured data (JSON, and since xargs and jq exist, I think users have everything they need to do this kind of thing on the CLI themselves instead of us making new ways to filter.

Generally agreed. We need to document more recipes though: #14969

cosmicexplorer added the engine label Mar 9, 2019

cosmicexplorer mentioned this issue Mar 10, 2019

Add a --query option for unifying target selection #7350

Closed

cosmicexplorer mentioned this issue Mar 11, 2019

[WIP] --query pipelining #7356

Closed

cosmicexplorer mentioned this issue Apr 2, 2019

add some comments to the example CI script in orgs.md #7478

Merged

stuhood mentioned this issue Apr 5, 2019

./pants dependees should be deprecated in favor of a "root selection" option #6501

Closed

Eric-Arellano mentioned this issue May 17, 2020

Add a graph-query goal to run advanced queries over targets. #3153

Closed

stuhood mentioned this issue Nov 19, 2020

Runtime graph awareness for introspection #11206

Open

Eric-Arellano mentioned this issue Dec 21, 2020

Provide a peek goal to easily view BUILD metadata from command line #11347

Merged

benjyw removed the engine label Sep 9, 2021

thejcannon closed this as completed May 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

./pants --query for unifying target selection mechanisms to improve performance #7346

./pants --query for unifying target selection mechanisms to improve performance #7346

cosmicexplorer commented Mar 9, 2019 •

edited

Loading

cosmicexplorer commented Mar 9, 2019 •

edited

Loading

cosmicexplorer commented Mar 9, 2019

cosmicexplorer commented Mar 10, 2019

cosmicexplorer commented Mar 10, 2019

illicitonion commented Mar 11, 2019

cosmicexplorer commented Mar 12, 2019 •

edited

Loading

stuhood commented Apr 5, 2019 •

edited

Loading

thejcannon commented May 14, 2022

Eric-Arellano commented May 14, 2022

./pants --query for unifying target selection mechanisms to improve performance #7346

./pants --query for unifying target selection mechanisms to improve performance #7346

Comments

cosmicexplorer commented Mar 9, 2019 • edited Loading

TODO

cosmicexplorer commented Mar 9, 2019 • edited Loading

cosmicexplorer commented Mar 9, 2019

cosmicexplorer commented Mar 10, 2019

cosmicexplorer commented Mar 10, 2019

illicitonion commented Mar 11, 2019

cosmicexplorer commented Mar 12, 2019 • edited Loading

%>% operator for composition

stuhood commented Apr 5, 2019 • edited Loading

thejcannon commented May 14, 2022

Eric-Arellano commented May 14, 2022

cosmicexplorer commented Mar 9, 2019 •

edited

Loading

cosmicexplorer commented Mar 9, 2019 •

edited

Loading

cosmicexplorer commented Mar 12, 2019 •

edited

Loading

`%>%` operator for composition

stuhood commented Apr 5, 2019 •

edited

Loading