-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/more string #203
Feature/more string #203
Conversation
delete: Delete _interval_definition and replace it by a simple string rule because it add nothing regarding syntax highlighting. feat: Add string syntax of repeated ''.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I find it always confusion when one commit is part of two PRs. Also Some of the suggested changes have nothing to do with scope of the PR.
@dmfay can you take another look at this PR please?
@@ -2427,52 +2444,10 @@ module.exports = grammar({ | |||
$._type, | |||
), | |||
|
|||
interval_definitions: $ => repeat1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this been removed? I makes the parser too liberal in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make the parser more liberal and in the end I think you want the whole string highlighted as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was only ever stringified in interval
which does make it feel like a little bit of a reach imo, I think I'm fine with ditching it
@@ -1109,6 +1123,12 @@ module.exports = grammar({ | |||
$.keyword_leakproof, | |||
), | |||
|
|||
function_security: $ => seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of this? Is this related to the string parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related but I come across it while making tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a separate commit for these kind of things in the future. Otherwise git history becomes meaningless.
// This is only used in create function statement, it is not needed to check | ||
// the start tag match the end one. The usage of this syntax in other | ||
// context is done by _dollar_string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use dollar quotes with strings and especially with format()
, even nested format
invocations with their own dollar quotes, inside or outside a function body. I'm having trouble thinking offhand of a case in which they aren't strictly hierarchical though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the only thing I said is $._dollar_string
and $.dollar_quote
don't interfere and there is no need to check the closing $.dollar_quote
to match.
And for format I don't understand, here is how it pase:
select format($$hello %s$$, format($$all the %s$$, $$world$$));
(program
(statement
(select (keyword_select)
(select_expression
(term
value:
(invocation
(object_reference name: (identifier))
(
parameter: (term value: (literal))
parameter: ,
(term
value:
(invocation
(object_reference name: (identifier))
(
parameter: (term value: (literal))
parameter: ,
(term value: (literal))
)))
))))))
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it what you meant:
CREATE FUNCTION public.update_time() RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
RETURN $$abc$$;
END;
$$;
It parse fine:
(program
(statement
alias:
(create_function (keyword_create) (keyword_function)
(object_reference schema: (identifier) . name: (identifier))
( ) (keyword_returns) (keyword_trigger)
(function_language (keyword_language) (identifier))
(function_body (keyword_as) (dollar_quote) (keyword_begin) (keyword_return)
(literal)
; (keyword_end) ; (dollar_quote))))
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it what you meant:
CREATE FUNCTION public.update_time() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN RETURN $$abc$$; END; $$;
right, ideally it shouldn't because the real parser sees the second $$ right after RETURN
, thinks the function body is complete, and doesn't know what to do with the rest of the input:
ERROR: syntax error at or near "abc$$"
LINE 5: RETURN $$abc$$;
your first example is fine because each of the strings is separate, there's no nesting. I'm thinking of something more like your second, or like this case:
create table if not exists mytable (id serial primary key, val text);
do $a$
begin
execute format(
$b$
-- note the escaped %% in the "c" format!
insert into %1$s (val)
select format($c$%%1$L$c$, 'alpha');
$b$,
'mytable'
);
end
$a$;
select * from mytable;
I think your implementation is a big improvement on what we've got! but it's more permissive than Postgres' actual parser in a way that is likely to be a trap for the unwary (and it's split between this dollar_quote and the external dollar_string, so even more complicated) so I think we've got to make that really clear. I'll add a more detailed explanation to scanner.c.
// This is only used in create function statement, it is not needed to check | |
// the start tag match the end one. The usage of this syntax in other | |
// context is done by _dollar_string. | |
// dollar_quote is only used in the `create function` statement. Postgres-style | |
// dollar quoting in other contexts is done by the external _dollar_string. See | |
// scanner.c, and especially the note about parsing of nested dollar quotes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My scanner was a simpler version of https://github.com/m-novikov/tree-sitter-sql/blob/main/src/scanner.cc , which produce 3 tokens , start tag, end tag and string content.
Maybe it is a better choice for nested dollar strings with the same name inside a body function.
And also for language injection https://tree-sitter.github.io/tree-sitter/syntax-highlighting#language-injection.
// TODO Maybe we should do different version of function_body_statement in | ||
// regard to the defined language to match either sql, plsql or | ||
// plpgsql. Currently the function_body_statement support only sql. And | ||
// maybe for other language the function_body should be a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql functions without begin atomic
are strings already, but it's useful to parse them anyway. Other sql-derived dialects are also useful to parse (I'm leaning on that with pl/pgsql in pdot) but full support is a question mark atm, see also #141 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the inside of $.dollar_quote
... $.dollar_quote
will not parse the first time you put a (pg)plsql for loop for instance.
One easy solution is to parse the whole body as a string, but the highlight will be one of a string.
Currently it generate an error:
CREATE FUNCTION public.update_time() RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
NEW.updated_at = CURRENT_TIMESTAMP;
RETURN NEW;
END;
$$;
(program
(statement
alias:
(create_function (keyword_create) (keyword_function)
(object_reference schema: (identifier) . name: (identifier))
( ) (keyword_returns) (keyword_trigger)
(function_language (keyword_language) (identifier))
(function_body (keyword_as) (dollar_quote) (keyword_begin)
(ERROR . = (keyword_current_timestamp) ;)
(keyword_return)
(field name: (identifier))
; (keyword_end) ; (dollar_quote))))
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An other solution might be:
function_body: $ => choice(
seq(
$._function_return,
';'
),
seq(
$.keyword_begin,
$.keyword_atomic,
repeat1(
seq(
$._function_body_statement,
';',
),
),
$.keyword_end,
),
seq(
$.keyword_as,
$.dollar_quote,
optional(
seq(
$.keyword_declare,
repeat1(
$.function_declaration,
),
),
),
$.keyword_begin,
repeat($.junk)
$.keyword_end,
optional(';'),
$.dollar_quote,
),
seq(
$.keyword_as,
alias(
choice(
$._single_quote_string,
$._double_quote_string,
),
$.literal
),
),
),
junk: $ => choice(
$.all_keyword_expect_end,
$.object_reference,
$.all_operator,
$.all_punctuation,
$.literal,
wrapped_in_parenthesis($.junk),
wrapped_in_bracket($.junk),
wrapped_in_curly_bracket($.junk),
),
It match all without structure but give you base highlight capability (like keyword, number, operator...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually need the "best effort" parsing of pl/pgsql -- even with ERROR nodes it still recognizes regular queries and DML statements. Try it with longer function bodies. Here's a function with if x then (update statement); end if; functioncall()
:
(program [0, 0] - [11, 0]
(statement [0, 0] - [10, 10]
(create_function [0, 0] - [10, 10]
(keyword_create [0, 0] - [0, 6])
(keyword_or [0, 7] - [0, 9])
(keyword_replace [0, 10] - [0, 17])
(keyword_function [0, 18] - [0, 26])
(object_reference [0, 27] - [0, 47]
schema: (identifier [0, 27] - [0, 39])
name: (identifier [0, 40] - [0, 47]))
(function_arguments [0, 47] - [0, 49])
(keyword_returns [1, 1] - [1, 8])
(keyword_trigger [1, 9] - [1, 16])
(function_language [2, 1] - [2, 17]
(keyword_language [2, 1] - [2, 9])
(keyword_plpgsql [2, 10] - [2, 17]))
(function_body [3, 0] - [10, 10]
(keyword_as [3, 0] - [3, 2])
(dollar_quote [3, 3] - [3, 13])
(keyword_begin [3, 14] - [3, 19])
(ERROR [4, 2] - [4, 15]
(keyword_then [4, 11] - [4, 15]))
(statement [5, 4] - [5, 65]
(update [5, 4] - [5, 65]
(keyword_update [5, 4] - [5, 10])
(relation [5, 11] - [5, 32]
(object_reference [5, 11] - [5, 32]
schema: (identifier [5, 11] - [5, 23])
name: (identifier [5, 24] - [5, 32])))
(keyword_set [5, 33] - [5, 36])
(assignment [5, 37] - [5, 65]
left: (field [5, 37] - [5, 41]
name: (identifier [5, 37] - [5, 41]))
right: (literal [5, 44] - [5, 65]))))
(keyword_end [6, 2] - [6, 5])
(ERROR [6, 6] - [9, 3])
(dollar_quote [10, 0] - [10, 10])))))
the errors don't stop it from processing the rest and returning useful output, although in this case the select statement is not recognized after the end if
.
@@ -2427,52 +2444,10 @@ module.exports = grammar({ | |||
$._type, | |||
), | |||
|
|||
interval_definitions: $ => repeat1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was only ever stringified in interval
which does make it feel like a little bit of a reach imo, I think I'm fine with ditching it
// regard to the defined language to match either sql, plsql or | ||
// plpgsql. Currently the function_body_statement support only sql. And | ||
// maybe for other language the function_body should be a string. | ||
$.identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
test/corpus/functions.txt
Outdated
from knn | ||
where knn.alpha = htable.alpha; | ||
|
||
return $a$test$a$; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return $a$test$a$; | |
return $b$test$b$; |
reusing a
errors:
ERROR: syntax error at or near "test$a$"
LINE 29: return $a$test$a$;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is make compile
to do before make test
, work fine for me.
select $a$A$b$B$b$A$a$ + $$A$$ + $b$B$b$;
(program
(statement
(select (keyword_select)
(select_expression
(term
value:
(binary_expression
left: (binary_expression left: (literal) operator: + right: (literal))
operator: + right: (literal))))))
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry it parses here, but if you copy the statement into psql it's a syntax error -- it's less confusing to have valid sql where possible
test/corpus/functions.txt
Outdated
from knn | ||
where knn.alpha = htable.alpha; | ||
|
||
return $a$test$a$; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry it parses here, but if you copy the statement into psql it's a syntax error -- it's less confusing to have valid sql where possible
// This is only used in create function statement, it is not needed to check | ||
// the start tag match the end one. The usage of this syntax in other | ||
// context is done by _dollar_string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it what you meant:
CREATE FUNCTION public.update_time() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN RETURN $$abc$$; END; $$;
right, ideally it shouldn't because the real parser sees the second $$ right after RETURN
, thinks the function body is complete, and doesn't know what to do with the rest of the input:
ERROR: syntax error at or near "abc$$"
LINE 5: RETURN $$abc$$;
your first example is fine because each of the strings is separate, there's no nesting. I'm thinking of something more like your second, or like this case:
create table if not exists mytable (id serial primary key, val text);
do $a$
begin
execute format(
$b$
-- note the escaped %% in the "c" format!
insert into %1$s (val)
select format($c$%%1$L$c$, 'alpha');
$b$,
'mytable'
);
end
$a$;
select * from mytable;
I think your implementation is a big improvement on what we've got! but it's more permissive than Postgres' actual parser in a way that is likely to be a trap for the unwary (and it's split between this dollar_quote and the external dollar_string, so even more complicated) so I think we've got to make that really clear. I'll add a more detailed explanation to scanner.c.
// This is only used in create function statement, it is not needed to check | |
// the start tag match the end one. The usage of this syntax in other | |
// context is done by _dollar_string. | |
// dollar_quote is only used in the `create function` statement. Postgres-style | |
// dollar quoting in other contexts is done by the external _dollar_string. See | |
// scanner.c, and especially the note about parsing of nested dollar quotes. |
#include "tree_sitter/parser.h" | ||
#include <stdlib.h> | ||
#include <string.h> | ||
#include <wctype.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <wctype.h> | |
#include <wctype.h> | |
// We do not currently guarantee that the start and end tags match. This is a major | |
// difference between tree-sitter and Postgres' own parser. For example, a function | |
// body declared with nested dollar quotes: | |
// | |
// $function$ | |
// select $$1$$; | |
// $function$ | |
// | |
// Postgres' LALR parser can only distinguish nested dollar-quoted strings by the | |
// "tag" between dollar signs ("function" and nothing here). However, tree-sitter's | |
// more powerful LR parser can track the nested dollar quotes with or without tags, | |
// so this function body: | |
// | |
// $$ | |
// select $$1$$; | |
// $$ | |
// | |
// parses successfully with tree-sitter but will cause a syntax error in Postgres, | |
// which will think you meant a string "\n select" with a 1 and some more garbage | |
// after it. |
// TODO Maybe we should do different version of function_body_statement in | ||
// regard to the defined language to match either sql, plsql or | ||
// plpgsql. Currently the function_body_statement support only sql. And | ||
// maybe for other language the function_body should be a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually need the "best effort" parsing of pl/pgsql -- even with ERROR nodes it still recognizes regular queries and DML statements. Try it with longer function bodies. Here's a function with if x then (update statement); end if; functioncall()
:
(program [0, 0] - [11, 0]
(statement [0, 0] - [10, 10]
(create_function [0, 0] - [10, 10]
(keyword_create [0, 0] - [0, 6])
(keyword_or [0, 7] - [0, 9])
(keyword_replace [0, 10] - [0, 17])
(keyword_function [0, 18] - [0, 26])
(object_reference [0, 27] - [0, 47]
schema: (identifier [0, 27] - [0, 39])
name: (identifier [0, 40] - [0, 47]))
(function_arguments [0, 47] - [0, 49])
(keyword_returns [1, 1] - [1, 8])
(keyword_trigger [1, 9] - [1, 16])
(function_language [2, 1] - [2, 17]
(keyword_language [2, 1] - [2, 9])
(keyword_plpgsql [2, 10] - [2, 17]))
(function_body [3, 0] - [10, 10]
(keyword_as [3, 0] - [3, 2])
(dollar_quote [3, 3] - [3, 13])
(keyword_begin [3, 14] - [3, 19])
(ERROR [4, 2] - [4, 15]
(keyword_then [4, 11] - [4, 15]))
(statement [5, 4] - [5, 65]
(update [5, 4] - [5, 65]
(keyword_update [5, 4] - [5, 10])
(relation [5, 11] - [5, 32]
(object_reference [5, 11] - [5, 32]
schema: (identifier [5, 11] - [5, 23])
name: (identifier [5, 24] - [5, 32])))
(keyword_set [5, 33] - [5, 36])
(assignment [5, 37] - [5, 65]
left: (field [5, 37] - [5, 41]
name: (identifier [5, 37] - [5, 41]))
right: (literal [5, 44] - [5, 65]))))
(keyword_end [6, 2] - [6, 5])
(ERROR [6, 6] - [9, 3])
(dollar_quote [10, 0] - [10, 10])))))
the errors don't stop it from processing the rest and returning useful output, although in this case the select statement is not recognized after the end if
.
…string_start_tag, dollar_quoted_string_end_tag and dollar_quoted_string_content_tag. Unfortunatly it doesn't resolve the problem when we want to parse the inner of the string we still can't ensure the start_tag and end_tag are balanced.
…_tag and dollar_quoted_string_end_tag which are balanced and are specialy used in $.create_function. And a token dollar_quoted_string which match a dollar string in one swoop, it is intented to be used in a literal context.
I have rework the scanner, now the $.dollar_quote rules are balanced. |
Also change one of the test to avoid (UNEXPECTED '\n') because it is platform dependant, windows will emit (UNEXPECTED '\r').
thank you! |
No description provided.