Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filtering top-level resource based on embedded resource filter #1949

Merged
merged 2 commits into from
Oct 4, 2021

Conversation

steve-chavez
Copy link
Member

@steve-chavez steve-chavez commented Sep 15, 2021

Closes #1075. Allow filtering the top-level resource based on the embedded filter ** with the LATERAL/GROUP BY approach mentioned on #1075 (comment), like so:

GET /projects?select=id,clients!inner(id)&clients.id=eq.2

[{"id":3,"client":{"id":2}},
 {"id":4,"client":{"id":2}}]

# the usual left join would give
GET /projects?select=id,clients(id)&clients.id=eq.2
[{"id":1,"clients":null},
{"id":2,"clients":null},
{"id":3,"clients":{"id":2}},
{"id":4,"clients":{"id":2}},
{"id":5,"clients":null}]

This PR will only cover tables, views will need more work and smartness from our schema cache; mainly because when doing GROUP BY, pg is able to infer the pk on a table but not a view, more details here.

Edit: Thanks to Iced-Sun's query below(#1949 (comment)), this will now work for both tables and views.

This behavior can also be enabled by default with the following config option

db-embed-default-join="inner"

Which saves the need for specifying !inner on every request. In this case, you can go back to the previous behavior per request by specifying !left on the embedded resource, e.g /projects?select=*,clients!left(*)&clients.id=eq.12

Edit: The db-embed-default-join config was removed in #2034

Steps

  • many-to-one relationships
  • one-to-many relationships
  • many-to-many relationships
  • decide on syntax for the inner join
  • config for default embed mode
  • CHANGELOG

This PR won't cause a breaking change. Embedding will still use subqueries by default, only on !inner it will use LATERAL.

Hints used like `select=projects.client_id(*)` were already
deprecated.

'!' should be used from now on `select=projects!client_id(*)`
@wolfgangwalther
Copy link
Member

Sorry for not bringing this up earlier in the issue, but looking at the example ...

GET /projects?select=id,clients!inner(id)&clients.id=eq.2

... it suddenly strikes me as odd to have something in the select= part change the number of rows. Imho, the select= part should only change the shape of the output / each row - but should not filter anything.

Since we already have embedded filters as <alias>.<column>= - how about extending that to support a special operator on only <alias>=?

Maybe something like this could make it an inner join:

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=inner

@steve-chavez
Copy link
Member Author

steve-chavez commented Sep 16, 2021

... it suddenly strikes me as odd to have something in the select= part change the number of rows. Imho, the select= part should only change the shape of the output / each row - but should not filter anything.

@wolfgangwalther True. I was mostly going from what we discussed in #915 (comment). Mainly that the embedding was a function and that the syntax for parameters would be func!param1!param2 so embed!join_type!hint(..).

Since we already have embedded filters as .= - how about extending that to support a special operator on only =?

I like the direction. Maybe we could have have something like:

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=join.inner

In fact, the hints could also be expressed on the query param:

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=join(inner).hint1,hint2

(The join(inner)) syntax would be similar to our full text search with the lang)

That keeps the select param cleaner. It could also provide a path for doing the future function syntax:

GET ?select=avg&avg=args.amount

@steve-chavez
Copy link
Member Author

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=join(inner).hint1,hint2

I guess one drawback of that syntax is that it's more verbose(both for the REST interface and libraries probably).

@wolfgangwalther
Copy link
Member

Mainly that the embedding was a function and that the syntax for parameters would be func!param1!param2 so embed!join_type!hint(..).

I see. The "embedding as a function" pov only holds true for the SELECT part, though - at least if we think of "function" as a "PostgreSQL FUNCTION". This is because we can't replicate an inner join with a computed column function.

I like the direction. Maybe we could have have something like:

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=join.inner

Hm, I think it's enough to specify the inner only. That's assuming join is the only operator anyway. I can imagine we could support inner and anti (for an ANTI-JOIN) down the road. Not sure about whether right and/or full make any sense. But I don't see anything other than join, so we might as well omit it.

In fact, the hints could also be expressed on the query param:

GET /projects?select=id,clients(id)&clients.id=eq.2&clients=join(inner).hint1,hint2

(The join(inner)) syntax would be similar to our full text search with the lang)

That keeps the select param cleaner. It could also provide a path for doing the future function syntax:

GET ?select=avg&avg=args.amount

Assuming we'd keep the ability to put hints (and therefore the embedding specification) in only one place (either query param or select), this would be a breaking change.

Once we're at that stage, where we implement such a big breaking change to the query syntax, I'd have a few other points to raise to change in the query syntax. Two biggest things that come to mind are correct quoting / escaping and moving the operator to the left of the equal sign like e.g. column.eq=value. This would make things a fair bit easier for plain-html-forms, to support no javascript environments. Also the column+operator used tends to be quite static, while the value used for the query is not - separating the static part from the dynamic part via = (which is supported by all the client librarys for query params) seems the logical choice.

So, I suggest to just go with the non-breaking change for this PR - and then open another issue where we can discuss a "new query syntax" that could get a few things straight. We could implement both at the same time and have them configurable via config option, to be able to slowly deprecate the old syntax.

@steve-chavez
Copy link
Member Author

Hm, I think it's enough to specify the inner only. That's assuming join is the only operator anyway. I can imagine we could support inner and anti (for an ANTI-JOIN) down the road. But I don't see anything other than join, so we might as well omit it.

It would be too inconsistent at the parsers level, I'm not even sure if it can be done cleanly. All of our filters have a prefix., with the sole exception of function args, which where originally meant to be called with the arg. prefix(arg1=arg.value) but that was decided against because of verbosity. This is done by checking if the rpc route is being used(all filters without prefix are args).

and moving the operator to the left of the equal sign like e.g. column.eq=value. This would make things a fair bit easier for plain-html-forms, to support no javascript environments.

Hm, if this is meant for the eq filter, we could have a similar rule as the one mentioned above regarding the function arg: for table routes, default to eq for filters with no prefix.(so id=1 would work). This has been asked before on an issue IIRC, also on reddit. There would be no big breaking change for this.

Assuming we'd keep the ability to put hints (and therefore the embedding specification) in only one place (either query param or select), this would be a breaking change.

I was thinking to have both actually, so no breaking change. I'm still not set on either though. For now I'll continue with the select=..!inner, doing the queries is the hardest part about this PR.

@wolfgangwalther
Copy link
Member

Hm, if this is meant for the eq filter,

No, I propose to have all filters on the left side. id.lt=5, id.gt=5, ...

I was thinking to have both actually, so no breaking change.

Imho, that would make it quite complicated to parse. You could have conflicting targets/hints/... in both parts - this will be very hard to efficiently create a good query.

@Iced-Sun
Copy link

Iced-Sun commented Sep 21, 2021

The LATERAL/GROUP BY approach for resource embedding with inner join depends on the fact that pg could group by pk even when the group keys include other columns (group by pk, c1, c2, c3 => group by pk). But the inference doesn't happen for a view, nor a not null unique key, hence a potentially expensive full group-by must be applied.

I propose to implement resource embedding with correlated join, which is essentially the same as the current approach of correlated subquery, but with the extensibility to support left/inner/anti-joins.

Specifically, instead of translating the left-join resource embedding of GET /projects?select=id,clients(id)&clients.id=eq.2 to

with pg_source as (
  select projects.id, coalesce(
           (select json_agg(clients.*)
              from (select clients.id 
                      from clients
                     where clients.project_id = projects.id and clients.id = 2
              ) clients), 
         '[]') as clients
    from projects
)
select coalesce(json_agg(_postgrest_t), '[]')::character varying as body
from (select * from pg_source) _postgrest_t;

a correlated join version could be

with pg_source as (
     select projects.id, coalesce(nullif(clients_clients.clients::text, '[null]'), '[]')::json AS clients
       from projects
  left join lateral (select json_agg(clients) clients 
                       from (select clients.id 
                               from clients
                              where clients.project_id = projects.id and clients.id = 2
                            ) clients
                    ) clients_clients
         on clients_clients.clients is not null
)
select coalesce(json_agg(_postgrest_t), '[]')::character varying as body
from (select * from pg_source) _postgrest_t;

An inner-join resource embedding is then

with pg_source as (
     select projects.id, coalesce(nullif(clients_clients.clients::text, '[null]'), '[]')::json AS clients
       from projects
       join lateral (select json_agg(clients) clients 
                       from (select clients.id 
                               from clients
                              where clients.project_id = projects.id and clients.id = 2
                            ) clients
                    ) clients_clients
         on clients_clients.clients is not null
)
select coalesce(json_agg(_postgrest_t), '[]')::character varying as body
from (select * from pg_source) _postgrest_t;

And an anti-join resource embedding is

with pg_source as (
     select projects.id, coalesce(nullif(clients_clients.clients::text, '[null]'), '[]')::json AS clients
       from projects
       join lateral (select json_agg(clients) clients 
                       from (select clients.id 
                               from clients
                              where clients.project_id = projects.id and clients.id = 2
                            ) clients
                    ) clients_clients
         on clients_clients.clients is null
)
select coalesce(json_agg(_postgrest_t), '[]')::character varying as body
from (select * from pg_source) _postgrest_t;

The correlated-join approach doesn't rely on pk inference, works on view and not null unique key. And the performance is essentially the same as subquery (#1075 (comment)).

Does that make any sense?

@Iced-Sun
Copy link

BTW, the ability to do aggregation and flatten join seems to be a popular requested feature (#1233, #211, #1126, #915 and others). Do we have any plan to support explicit join/group by?

Say, something like

  1. GET /projects,projects_traits?select=projects.id,projects_traits.category
  2. GET /projects?select=count(*)&groupby=zone
  3. GET /projects,project_traits?select=count(*),projects_traits.category&groupby=projects_traits.category

Look like it will take a huge effort.

@steve-chavez
Copy link
Member Author

steve-chavez commented Sep 23, 2021

@Iced-Sun Awesome! Your approach is much better since it doesn't rely on GROUP BY and it also works on views 🎉 🎉

So to understand it better, taking my approach on #1075 (comment), the query for inner joins would now be

WITH pg_source AS (
   SELECT "test"."clients"."id",
          "test"."clients"."name",
          "projects_projects"."projects" AS "projects"
   FROM "test"."clients"
   INNER JOIN LATERAL (
     SELECT json_agg("projects") "projects" -- important change here
     FROM(
       SELECT "test"."projects"."id", "test"."projects"."name"
       FROM "test"."projects"
       WHERE "test"."projects"."client_id" = "test"."clients"."id") "projects"
   ) AS "projects_projects" ON "projects_projects"."projects" IS NOT NULL -- important change here
)
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

Where the main changes are the correlated join(which avoids the need for GROUP BY) plus the LATERAL ... ON <target> IS NOT NULL.

As you pointed out, this practically has the same cost that our correlated subquery has(cost=2229.33..2229.34). The correlated join cost(2261.08..2261.09) is higher than the GROUP BY(125.21..125.22) query but of course it doesn't have any of the downsides - which is definitely good enough.

(The above is just my simplistic view, I think you've determined that GROUP BY is actually worse in perf)

So I'll proceed with the correlated join approach for this feature.

Do we have any plan to support explicit join/group by?

Yes, but besides the syntax, we still need a way to protect against a potentially expensive group by, this was discussed on #915.

@Iced-Sun
Copy link

As you pointed out, this practically has the same cost that our correlated subquery has(cost=2229.33..2229.34). The correlated join cost(2261.08..2261.09) is higher than the GROUP BY(125.21..125.22) query but of course it doesn't have any of the downsides - which is definitely good enough.

(The above is just my simplistic view, I think you've determined that GROUP BY is actually worse in perf)

I did some simple performance tests in #1075 (comment). It suggests that the GROUP BY approach is much more performent in the absence of a proper index; while with create index on project (client_id), the correlated join (and correlated subquery) overtakes marginally (but not so marginal for a small result set). So I think group by may be preferable if we didn't have the pk problem to consider because it is more stable (on perf).

I have no solid facts on grouping by full-size keys (in the case that group-by-pk is not applicable), but I do have some not-so-good experiences when I have to group by multiple text columns to work around bad schema design.

As you said, correlated join doesn't break things or bring surprises, and it unifies the left/inner/anti-join resource embedding.

Thanks for the great piece of software.

@steve-chavez steve-chavez changed the title Support embedding with inner join(tables only) Support embedding with inner join Sep 24, 2021
@steve-chavez
Copy link
Member Author

.. it suddenly strikes me as odd to have something in the select= part change the number of rows. Imho, the select= part should only change the shape of the output / each row - but should not filter anything.

@wolfgangwalther Been thinking about this, why is it "odd" to change the shape of the output on select? Because when using count(and we'd have more aggregates later) this happens:

GET /projects?select=count

[{"count":5}]

And of course in SQL the same thing happens - not only WHERE modifies the rows, SELECT also does.

@wolfgangwalther
Copy link
Member

Been thinking about this, why is it "odd" to change the shape of the output on select?
[...]
[...] in SQL the same thing happens - not only WHERE modifies the rows, SELECT also does.

That's odd for me, already. For me, using an aggregate function without a GROUP BY basically adds an implicit GROUP BY with a constant value. I would have liked that to be more explicit in SQL, too. But that ship has sailed for a long time, I guess.

@steve-chavez
Copy link
Member Author

steve-chavez commented Sep 28, 2021

That's odd for me, already. For me, using an aggregate function without a GROUP BY basically adds an implicit GROUP BY with a constant value

@wolfgangwalther I see. I've also always find it odd that these queries work

select p.json_agg from test.projects p;
select p.count from projects p;
select p.array_agg from projects p;
-- same through postgrest

(I think we discussed this somewhere before)

Edit: Also, using a different !hint in select can change the number of rows.

I would have liked that to be more explicit in SQL, too. But that ship has sailed for a long time, I guess.

Yeah. So considering this, WDYT about just going with the select=*,embed!inner(*) syntax? Considering select is not consistent, having an embed=join.inner syntax only adds more verbosity.

@steve-chavez steve-chavez force-pushed the inner-join branch 3 times, most recently from 1adedef to b1a42c0 Compare September 29, 2021 23:23
@steve-chavez steve-chavez changed the title Support embedding with inner join Support filtering top-level resource based on embedded resource filter Sep 29, 2021
@steve-chavez steve-chavez marked this pull request as ready for review September 29, 2021 23:27
@steve-chavez
Copy link
Member Author

Alright, this is now ready for a final review. I've added tests for views(with m2o/o2m/m2m cases), rpc and embedding with hints.

I've also added a db-embed-default-join config option(mentioned above #1949 (comment)) for users that want this join type by default.

CHANGELOG.md Show resolved Hide resolved
test/io-tests/configs/expected/no-defaults-with-db.config Outdated Show resolved Hide resolved
test/io-tests/configs/expected/no-defaults.config Outdated Show resolved Hide resolved
src/PostgREST/Request/Types.hs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
This is enabled by adding `!inner` to the embedded resource

/projects?select=*,clients!inner(*)&clients.id=eq.12

This behaviour can be enabled by default with the config option

db-embed-default-join='inner'

Which saves the need for specifying `!inner` on every request.
If this is enabled, the previous behavior can be restored
per request by specifying `!left`  on the embedded resource.

/projects?select=*,clients!left(*)&clients.id=eq.12`

Tested on M20/02M/M2M relationships, views, RPC.
@colearendt
Copy link
Contributor

colearendt commented Oct 13, 2021

How does this work with joins that are already using ! for the foreign key in use? Is it possible to specify a foreign key and an inner join?

i.e. if I have this:

select=*,alias:clients!fk_main_clients(*)

How should it switch to an inner join short of changing the default? Is this in a test case?

supplier_id int references suppliers(id),
trade_union_id int references trade_unions(id),
primary key (supplier_id, trade_union_id)
);
Copy link
Contributor

@colearendt colearendt Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a newline, FYI. Some editors struggle with opening such files, or so I have heard

@steve-chavez
Copy link
Member Author

How should it switch to an inner join short of changing the default? Is this in a test case?

@colearendt Yes, this should work like

it "works when using hints" $ do
get "/projects?select=id,clients!client!inner(id)&clients.id=eq.2" `shouldRespondWith`
[json| [{"id":3,"clients":{"id":2}}, {"id":4,"clients":{"id":2}}] |]
{ matchHeaders = [matchContentTypeJson] }

@colearendt
Copy link
Contributor

Thanks! That's perfect! I totally missed that example 🙈 Found a little bug in #1977, but this is really awesome!! Well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

filtering on main table with embedded table criteria(inner join)
4 participants