-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to generate queries with distinct/group by #915
Comments
This has certainly been a frequently requested feature. My only hesitation is making "distinct" another reserved word, meaning it would block any filters on a column that happened to be called "distinct." However, like "select," this one is a sql reserved word too so it makes sense that we reserve it as a special url parameter. I hesitate to try messing with the url parsing and query generation part of the code myself, that's more @steve-chavez's domain. Perhaps he can estimate the difficulty of this feature. |
Not sure if this should be implemented, by allowing I've bumped into a slow distinct query in postgresql a while ago and solved it by using a https://dba.stackexchange.com/questions/93158/how-to-speed-up-select-distinct As mentioned in one of the answers, you can mostly avoid doing So in general I think |
@steve-chavez this is a good point against this. PostgrREST should expose only safe/fast operations |
For what it's worth, |
I have many situations where I've wished for postgrest to have a distinct operator. |
The query would have to be behind a stored procedure/view, it's definitely necessary to use |
When it's behind a view/stored procedure then you can no longer support the arbitrary filters that I love about postgrest.
In that case I think it's a requirement that we include the functionality.
If you are afraid of performance issues, then perhaps we can only allow distinct when the jwt claim (or other setting?) permits it? |
You can create a stored procedure that |
But the filtering needs to happen before the distinct operation |
There will always be a fight between the two opposing needs expose more sql features vs protecting the database, striking the balance is hard. in this particular thread i am with @steve-chavez for now. No if features like this http://www.databasesoup.com/2015/02/why-you-might-need-statementcostlimit.html make it into the core, we might relax a bit and expose more of the sql, in that situation i would go with implementing the groupby (while at the same time, having the default setting for cost limit to a low value). should we close this for now? i don't see it going further in the immediate future. |
This is probably my most wanted feature; and it certainly gets requested a lot in the chat room. Perhaps we could have a toggle in the config file |
@ruslantalpa I also think that the idea of the cost limit is a good one, though is not that reliable a max limit could be worked out, that also is a must for other features like multiple requests in a transaction, I made a related comment on "Customize query timeouts" #249, maybe we should continue there with the discussion and see if we can implement that feature independently. @daurnimator That toggle could be easily abused by the users, I don't think it should be added to the config, this phrase summarizes my reasoning: A well-designed system makes it easy to do the right things and annoying (but not impossible) to do the wrong things. Users should be thinking twice before exposing expensive operations as |
@daurnimator Have you had a look at pREST? It exposes more SQL directly, at the cost of potential performance issues, so may satisfy your |
Just chiming in to request this enhancement be implemented. Distinct and Group-by will make quite a few views unnecessary in our case. |
If one's really careful and uses the format function, this can be done with a proc and dynamic sql. Example for DISTINCT: create function projects(dist name) returns setof projects as $_$
begin
return query execute format($$
select distinct on(%I) * from projects
$$, dist);
end; $_$ language plpgsql; Then you can use all of pgrst filters normally: -- filtering
curl "localhost:3000/rpc/projects?id=eq.1&dist=name&"
-- selecting
curl "localhost:3000/rpc/projects?select=id,name&dist=name" Still, exposing DISTINCT to clients should be thought twice. |
I've thought about this and came up with a solution. first you have a function like this one
and you define the view like so
This is the way 😁. The first parameter of validate can be any boolean expression and you can make arbitrary complex check on the incoming requests, for example on a time series table one could check that both start/end dates are provided and the diff is no more then a month. For this to work, the |
So the general idea would be to protect an endpoint against unwanted, unlimited requests, by throwing an error for those. I guess you could achieve that in a function called through |
If that one refers to the url query string, then it would be better as
Support for aggregates looks cool. Personally I would prefer the Come to think of it, the required filter approach looks like an extended version of pg-safeupdate. |
|
about the values, you are talking as if the programing env in pg is some crippled thing, it's trivial to write a function to split the |
How about checking a custom header for validity based on data in the DB? That's exactly what I am doing for a subdomain-based multi-tenancy setup. Proxy sets the subdomain as a header and redirects the request. If that client/subdomain doesn't exist, I need to throw. I wouldn't know how to do that without
That's for sure a positive, I agree. Although you can turn that around as well: If you need multiple endpoints to be limited in a very similar way, you can reduce code duplication a lot by putting that in a But anyway: When setting the query string parameters as variables, both approaches can be taken.
I think it was just a different name instead of the |
It will not be as trivial, though, once you start adding nested logic, like I thought about adding the already parsed filters for a second, but I doubt that will be any less complex.. - because this would mean we had to pass the whole logic tree in... I'm not sure whether that would be worth it, tbh, - but if we wanted to have the logic tree available, passing this as a Given the complexity of implementing something that really works well not only for a couple of cases, but also in general, I wonder, whether we should maybe take a completely different route to solve the original problem: What do you guys think about adding some kind of limit on the expected cost of the query? So if we were to do something roughly along the lines of: BEGIN;
SET request.x=... -- all kinds of variables, like we currently do (no need for .get. / .qs.)
EXPLAIN read_query;
SET pgrst.estimated_cost=<from the explain>;
SELECT pre_request_function(); -- can use the estimated cost here to throw if you'd like
read_query; -- or use the estimated cost in the query implicitly: VIEW definition throwing like Ruslan suggested. |
We could even provide a sane default for |
@wolfgangwalther Yeah, I believe that approach would also be worth exploring. Check this related issue: #249 (comment) (some ideas around the |
This reminds me of this project's syntax(got good feedback on HN) where they also do aggregates without a group by. So I agree, we don't necessarily have to expose full SQL and if we can improve/restrict its syntax we should do it(the http QUERY method will also require a DSL from us).
Hm, |
It's a reserved character, yes, but not all reserved chracters need percent encoding everywhere. See RFC 3986 2.2:
and RFC 3986 3.4:
|
In #2164 (comment) I mentioned that it would be helpful to think about a syntax to express for filters to apply to either There are probably a few ways to do that, but the one that I have in mind currently is like this:
Edit: One thing I am not sure about, yet, is what happens when we use aggregation inside embeddings. Example:
Ok, should work fine. Just wanted to write those down to see it. |
Looks amazingly consistent 🥇
So right now the syntax is:
One question, does it ever make sense to have an aggregate function call with two columns as an input? select aggregate(col1, col2) from tbl t Would we need an |
I'd say no. Those cases seem special enough, that we don't need to support them out of the box with our http syntax. It's still possible to create custom aggregates on the whole row, that use those two columns internally. |
Have you guys given up on group by? It is just as dangerous as any other filter. This is extremely important for basic use cases without the extra work of a View. J |
Why do you think so? We have 175 open issues. Anything we've given up on is closed. |
Because this is a necessary feature for basic use cases (like sorting a table by reactions, votes), and it has been 2 years with no updates. If you have to use VIEWS for basic use cases, the library is incomplete. I 100% disagree with the dangerous argument here. Even with that argument, there are safety things you could put into place. J |
That's a bold statement.
Not true. You could either say: It has not been implemented for more than 4 years since when the issue was opened. Or you could say: There were no updates since February, 16th of this year.
100% agree. There's so many ideas to make PostgREST better - I doubt we will be "complete" any time soon.
Yes, certainly. We discussed a few here and elsewhere already. Maybe you can help us put those in place. PR welcome! |
I don't see how. Any app that is past a todo app is going to need to total counts.
Ok, there is no action for more than 4 years, and there are negative comments that are very controversial on the usefulness of this. Just because you can do anything with a View, doesn't mean you should have to for basic use cases.
I think I meant complete with basic features. I wasn't trying to sound negative. It is a trigger for me personally when people turn down feature requests for one-sided reasons, and nothing gets done for years because of it. It is definitely fair to say PRs are welcome. That being said, I know nothing of Haskell. GraphQL handles a lot of these cases with aggregations, which works due to the nature of graphs. I just want to make sure this feature is not "turned down" more than anything. J |
An alternative syntax to the above:
GET /employees?select=total:$f.sum(salary) This is related to the dollar operators proposal on #2125 (comment). |
By any chance can we do GROUP BY now on postgREST APIS?? |
I wanted to get the distinct tags for this schema: CREATE TABLE tags (
name text,
post_id uuid REFERENCES posts(post_id) ON DELETE CASCADE,
created_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (post_id, name)
); So, using some computed fields hack, I found a use for it in filters: GET /tags?distinct_tag=is.true Choose your version:Compare post_idCREATE OR REPLACE FUNCTION distinct_tag(tags)
RETURNS boolean AS $$
SELECT NOT EXISTS (
SELECT 1
FROM tags
WHERE name = $1.name
AND post_id < $1.post_id
);
$$ LANGUAGE sql; Positive equivalentCREATE OR REPLACE FUNCTION distinct_tag(tags)
RETURNS boolean AS $$
SELECT EXISTS (
SELECT 1
FROM tags t
WHERE t.name = $1.name
AND t.post_id <= $1.post_id
EXCEPT
SELECT 1
FROM tags t
WHERE t.name = $1.name
AND t.post_id < $1.post_id
);
$$ LANGUAGE sql; DISTINCT ONCREATE OR REPLACE FUNCTION distinct_tag(tags)
RETURNS boolean AS $$
SELECT EXISTS (
SELECT 1
FROM (SELECT DISTINCT ON (name) post_id, name FROM tags) subquery
WHERE name = $1.name
AND post_id = $1.post_id
);
$$ LANGUAGE sql; Window FunctionsCREATE OR REPLACE FUNCTION distinct_tag(tags)
RETURNS boolean AS $$
SELECT EXISTS (
SELECT 1
FROM (
SELECT name, post_id, ROW_NUMBER() OVER (PARTITION BY name ORDER BY post_id) AS rn
FROM tags
) subquery
WHERE rn = 1
AND name = $1.name
AND post_id = $1.post_id
);
$$ LANGUAGE sql; There might also be a way to use "in" and "isdistinct", but not sure. But at what point is an CREATE OR REPLACE FUNCTION distinct_tags()
RETURNS SETOF tags AS $$
SELECT DISTINCT ON (name) * FROM tags
$$ LANGUAGE sql; Of course in reality we would want to sort by the count usually too: CREATE OR REPLACE FUNCTION tag_count(tags)
RETURNS bigint as $$
SELECT COUNT(*) FROM tags
WHERE name = $1.name;
$$ LANGUAGE sql; GET /tags?select=name,count:tag_count&distinct_tag=is.true&order=tag_count I find these Computed Fields to be much more manageable than long Of course if it just worked like this: GET /tags?distinct=name&order=distinct.asc We maybe wouldn't need all this mess. Either way, cool to know some hacks. NOTE: If you can simplify any of my examples, please let me know :) J |
With PostgREST 12's support for aggregate functions this can easily be done like this:
|
Thanks for adding support for aggregate functions! I'll admit I'm a little lost on how this solves the group by issue though? For example, here's a test_results table: {
attachments: string[]
attempts: number[]
attributes: Database["public"]["Enums"]["test_attribute"][]
created_at: string
id: number
notes: string
result: number
test_group_id: number
test_type: Database["public"]["Enums"]["test_type"]
user_id: string
} Doing a select * will return |
Could you:
This would make it much easier to understand what you're trying to achieve. |
PostgREST should have a way to perform a
SELECT DISTINCT
query.I suggest a new reserved query word 'distinct'.
select distinct
e.g./mytable?distinct
SELECT DISTINCT ON(....)
e.g./mytable?distinct=name
would beSELECT DISTINCT ON(name) * FROM mytable
Note that the columns that you are distinct on will also need to have order applied.
The text was updated successfully, but these errors were encountered: