Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the ambiguous check the same as in runtime #21

Open
wants to merge 2 commits into
base: null
Choose a base branch
from

Conversation

jongleb
Copy link

@jongleb jongleb commented Sep 6, 2024

Description

This PR adds the missing ambiguous checks for WHERE, ORDER BY, GROUP BY and HAVING for some cases which were not previously supported

One of these cases (that are reflected in tests) is:

DDL:

CREATE TABLE test21 (id INT, column_a TEXT, column_b BOOL);
CREATE TABLE test22 (id INT, column_d INT);

Queries

First
SELECT test21.id 
FROM test21 JOIN test22 ON test21.id = test22.id 
WHERE id > 2 ORDER BY id
Second
SELECT test21.id 
FROM test21 JOIN test22 ON test21.id = test22.id 
ORDER BY id

The first one fails unlike the second one, because ORDER BY, HAVING, GROUP BY allow have column without explicit referring to source if it's specified in SELECT when ambiguous fields are present.

Ambiguous Column Rule:

I have empirically tested all cases across three databases: MySQL, PostgreSQL and SQLite. Here are the main tests that are designed to be false positives (i.e., cases that should fail) along with their corresponding Playground links:

Testing on PostgreSQL
Testing on MySQL
Testing on SQLite

Comparison of versions

DDL
CREATE TABLE test21 (id INT, column_a TEXT, column_b BOOL);
CREATE TABLE test22 (id INT, column_d INT);
CREATE TABLE test2 (id INT, str TEXT);
CREATE TABLE test (id INT, str TEXT, name TEXT);

Ambiguous error - 🔴
No errors - 🟢
Warnings - 🟡
Syntax error - ❓

Query Before this PR Branch Since this PR Mysq Pgsql Sqlite Ambiguous reason/expression Playground
SELECT test21.id 
FROM test21 
JOIN test22 ON test21.id = test22.id 
WHERE id > 2;
🟢 🔴 🔴 🔴 🔴 WHERE expression Link
UPDATE test, (SELECT * FROM test2) AS x
SET str = x.str
WHERE test.id = x.id;
🟢 🔴 🔴 SET expression Link
SELECT *
FROM test21
JOIN test22 on test21.id = test22.id
GROUP BY id;
🟢 🟡 It throws
warning . But this
warning isn't
related with GROUP BY !
🔴 🔴 🔴 GROUP expression Link
SELECT *
FROM test21
JOIN test22 on test21.id = test22.id
ORDER BY id;
🟢 🟡 It throws
warning . But this
warning isn't
related with ORDER BY !
🔴 🔴 🔴 ORDER expression Link

@jongleb jongleb requested a review from rr0gi September 6, 2024 20:08
@jongleb jongleb marked this pull request as ready for review September 6, 2024 20:08
@jongleb jongleb self-assigned this Sep 11, 2024
@rr0gi
Copy link

rr0gi commented Sep 11, 2024

to clarify : this ambiguity rule is something from SQL standard (which?) or database specific (mysql?) or observed empirically?

(* use schema without aliases here *)
let p1 = get_params_of_columns env columns in
let env = { env with schema = Schema.Join.cross env.schema final_schema |> make_unique } in (* enrich schema in scope with aliases *)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enrich comment lost but it is useful

lib/syntax.ml Outdated
handle ~is_agg:false

let resolve_ambiguous_columns all_schema final_schema =
List.filter (fun s1 ->
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter only applies to all_schema - make it more clear

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not addressed
i mean let l = List.filter ... all_schema in l @ final_schema

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought you meant to me to name it more clearly. Made more clear ( it to be not to looked that @ applies to final_shema and all_schema before filter applies to all_schema)

lib/syntax.ml Outdated
handle ~is_agg:false
handle ~is_agg:false

let resolve_ambiguous_columns all_schema final_schema =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name resolve_ is confusing - it is not resolving anything, it is just adding alias columns to schema

@jongleb
Copy link
Author

jongleb commented Sep 11, 2024

to clarify : this ambiguity rule is something from SQL standard (which?) or database specific (mysql?) or observed empirically?

Added this part to the description

@rr0gi
Copy link

rr0gi commented Sep 25, 2024

i still hard time wrapping my head around the change
the links to db-fiddle showcase different queries than the ones in the table
the table is missing the column what is the behaviour of these queries in the databases
the second row in the table i don't see anything ambiguous

@jongleb
Copy link
Author

jongleb commented Sep 30, 2024

i still hard time wrapping my head around the change
the links to db-fiddle showcase different queries than the ones in the table

I also added the second table with queries from the db-fiddle example. The first table demonstrates examples from given from tests.

the table is missing the column what is the behaviour of these queries in the databases

I added it to the both tables.

the second row in the table i don't see anything ambiguous

Yes, and as we see it work since this PR

second_row

@rr0gi
Copy link

rr0gi commented Oct 3, 2024

Yes, and as we see it work since this PR

it works before this PR as well, there is no ambiguity there

[..]
  module List = struct
    let select_4 db  callback =
      let invoke_callback stmt =
        callback
          ~id:(T.get_column_Int_nullable stmt 0)
          ~id0:(T.get_column_Int_nullable stmt 1)
      in
      let r_acc = ref [] in
      IO.(>>=) (T.select db ("SELECT test21.id, test22.id\n\
FROM test21 \n\
JOIN test22 ON test21.id = test22.id") T.no_params (fun x -> r_acc := invoke_callback x :: !r_acc))
      (fun () -> IO.return (List.rev !r_acc))

  end (* module List *)
end (* module Sqlgg *)
Warning: this SQL statement will produce rowset with duplicate column names:
SELECT test21.id, test22.id
FROM test21 
JOIN test22 ON test21.id = test22.id

it gives the warning on duplicate column names which is valid, but it generates code alright

@rr0gi
Copy link

rr0gi commented Oct 3, 2024

SELECT id, id
FROM test21 
JOIN test22 ON test21.id = test22.id 
GROUP BY id;

this example is orthogonal to this PR, it is duplicate columns in the result set again, does not test ambiguity.

@rr0gi
Copy link

rr0gi commented Oct 3, 2024

so i mean PR is useful and i guess i understand which cases are handled, but the description in PR is still very confusing (or just plain wrong), need to fix because this is the only documentation that we have now.

@rr0gi
Copy link

rr0gi commented Oct 3, 2024

the queries in the test cases in the PR make sense, idu why the ones you pick for the documentation tables are different %)

tt "select test21.id from test21 join test22 on test21.id = test22.id order by id" [
attr' ~nullability:(Nullable) "id" Int;
] [];
wrong "select test21.id from test21 join test22 on test21.id = test22.id where id > 2 order by id";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put comment from the PR here why this one doesn't fail

Copy link
Author

@jongleb jongleb Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it fails. I call here the "wrong" function. I means assert fail. Added comment that says that wrong does assert fail

attr' ~nullability:(Nullable) "id" Int;
attr' ~nullability:(Nullable) "id" Int;
] [];
wrong "select id, id from test21 join test22 on test21.id = test22.id group by id";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wrong "select id, id from test21 join test22 on test21.id = test22.id group by id";
wrong "select id as id1, id as id2 from test21 join test22 on test21.id = test22.id group by id";

Copy link
Author

@jongleb jongleb Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it

wrong "select id, id from test21 join test22 on test21.id = test22.id group by id";

Since this call "wrong" function, that does assert fail.
And added your example too

wrong "select id as id1, id as id2 from test21 join test22 on test21.id = test22.id group by id";

src/test.ml Outdated
wrong "select * from test21 join test22 on test21.id = test22.id group by id" ;
tt "CREATE TABLE test23 (id INT)" [] [];
tt "CREATE TABLE test24 (id INT)" [] [];
wrong "select * from foo join bar on foo.id = bar.id order by id";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is same test as above (line 390)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deleted

src/test.ml Outdated
tt "CREATE TABLE test24 (id INT)" [] [];
wrong "select * from foo join bar on foo.id = bar.id order by id";
wrong "select * from foo join bar on foo.id";
tt "SELECT test21.id AS id, test22.id AS id FROM test21 JOIN test22 ON test21.id = test22.id" [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tt "SELECT test21.id AS id, test22.id AS id FROM test21 JOIN test22 ON test21.id = test22.id" [
tt "SELECT test21.id AS id1, test22.id AS id2 FROM test21 JOIN test22 ON test21.id = test22.id" [

nobody should write two AS id so lets not confuse ppl

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

lib/syntax.ml Outdated
Comment on lines 113 to 115
(List.length a2.Schema.Source.Attr.sources = 0
(* Check if columns are from the same table (source) *)
|| (List.length a1.sources = List.length a2.sources && List.for_all2 ( = ) a1.sources a2.sources))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(List.length a2.Schema.Source.Attr.sources = 0
(* Check if columns are from the same table (source) *)
|| (List.length a1.sources = List.length a2.sources && List.for_all2 ( = ) a1.sources a2.sources))
(* Check if columns are from the same table (source) *)
(a2.Schema.Source.Attr.sources = [] || a1.sources = a2.sources)

less confusing indentation and simplify

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a2.sources=[] check is needed? it means if a2.sources is empty and a1.sources is not empty then result will be true, ie equality is not symmetric? 🤔

Copy link
Author

@jongleb jongleb Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a2.sources=[] check is needed? it means if a2.sources is empty and a1.sources is not empty then result will be true, ie equality is not symmetric? 🤔

SELECT COUNT(column_a) as column_a FROM test21 WHERE column_a = @column_a

This is when we alias a column to the same name as a column in selected tables.
Maybe this is a good pattern to do that at all , but

  1. sql supports it
  2. we have this examples in our code

lib/syntax.ml Outdated
handle ~is_agg:false

let resolve_ambiguous_columns all_schema final_schema =
List.filter (fun s1 ->
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not addressed
i mean let l = List.filter ... all_schema in l @ final_schema


let update_schema_with_aliases all_schema final_schema =
List.filter (fun s1 ->
List.for_all (fun s2 -> s2.Schema.Source.Attr.attr.name <> s1.Schema.Source.Attr.attr.name) final_schema
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about unnamed columns btw, "" <> "" is it important here?

Copy link
Author

@jongleb jongleb Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's function applied to Column expression, but unnamed. Like example above but without alias.

SELECT COUNT(column_a) FROM test21 WHERE column_a = @column_a

If we have more unnamed function applied to columns expressiono as columns,

SELECT COUNT(column_a), AVG(column_a) FROM test21 WHERE column_a = @column_a

the case of uniqueness will always be matched here.
we give names at a later stage already during code generation for and only for the arguments in the function for mapping results

@@ -107,10 +107,17 @@ let exists_grouping columns =
List.exists (function Expr (e,_) -> is_grouping e | All | AllOf _ -> false) columns

(* all columns from tables, without duplicates *)
(* FIXME check type of duplicates *)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixme disappeared, i think it is important, at some point we want to warn if there are columns with same name but different types

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll return it. I just thought my PR put a dot to the ambiguity check, and this is why I deleted it.

@jongleb
Copy link
Author

jongleb commented Oct 4, 2024

SELECT id, id
FROM test21 
JOIN test22 ON test21.id = test22.id 
GROUP BY id;

this example is orthogonal to this PR, it is duplicate columns in the result set again, does not test ambiguity.

Removed

@jongleb
Copy link
Author

jongleb commented Oct 4, 2024

it gives the warning on duplicate column names which is valid, but it generates code alright

First of all, you are right, sorry. I didn't see this warning when I wrote tests. Because this warning happens on the later stages, but tests are checked for the earlier stage (parsing + param type inferring). I would say since my PR it happens at the early stage (since in runtime it's runtime error, not warning) + GROUP BY, ORDER BY, WHERE, HAVING statements weren't cheked at all

and make this query valid

UPDATE test, (SELECT * FROM test2) AS x 
SET str = x.str 
WHERE test.id = x.id;

, since

          if not (Sql.Schema.is_unique stmt.schema) then
            Printf.eprintf "Warning: this SQL statement will produce rowset with duplicate column names:\n%s\n" (fst sql);

this warning checks only schema. I Added "Main distinguishing requests" chapter to the description of the PR. Or does it make senee to delete tables above and to make a new one ?

UPD: I updated table, warnings aren't related to WHERE/GROUP BY/ even SET and the rest any statements not from schema (code with warning checks schema)

@jongleb jongleb requested a review from rr0gi October 4, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants