Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/more string #203

Merged
merged 6 commits into from
Oct 25, 2023
Merged

Feature/more string #203

merged 6 commits into from
Oct 25, 2023

Conversation

antoineB
Copy link
Contributor

No description provided.

delete: Delete _interval_definition and replace it by a simple string rule because it
add nothing regarding syntax highlighting.
feat: Add string syntax of repeated ''.
Copy link
Collaborator

@matthias-Q matthias-Q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I find it always confusion when one commit is part of two PRs. Also Some of the suggested changes have nothing to do with scope of the PR.

@dmfay can you take another look at this PR please?

@@ -2427,52 +2444,10 @@ module.exports = grammar({
$._type,
),

interval_definitions: $ => repeat1(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been removed? I makes the parser too liberal in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make the parser more liberal and in the end I think you want the whole string highlighted as a string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was only ever stringified in interval which does make it feel like a little bit of a reach imo, I think I'm fine with ditching it

@@ -1109,6 +1123,12 @@ module.exports = grammar({
$.keyword_leakproof,
),

function_security: $ => seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the purpose of this? Is this related to the string parsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related but I come across it while making tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a separate commit for these kind of things in the future. Otherwise git history becomes meaningless.

@matthias-Q matthias-Q requested a review from dmfay October 15, 2023 19:27
Comment on lines +970 to +972
// This is only used in create function statement, it is not needed to check
// the start tag match the end one. The usage of this syntax in other
// context is done by _dollar_string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use dollar quotes with strings and especially with format(), even nested format invocations with their own dollar quotes, inside or outside a function body. I'm having trouble thinking offhand of a case in which they aren't strictly hierarchical though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the only thing I said is $._dollar_string and $.dollar_quote don't interfere and there is no need to check the closing $.dollar_quote to match.

And for format I don't understand, here is how it pase:

select format($$hello %s$$, format($$all the %s$$, $$world$$));
(program
 (statement
  (select (keyword_select)
   (select_expression
    (term
     value: 
      (invocation
       (object_reference name: (identifier))
       (
       parameter: (term value: (literal))
       parameter: ,
       (term
        value: 
         (invocation
          (object_reference name: (identifier))
          (
          parameter: (term value: (literal))
          parameter: ,
          (term value: (literal))
          )))
       ))))))
 ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it what you meant:

CREATE FUNCTION public.update_time() RETURNS trigger
  LANGUAGE plpgsql
AS $$
BEGIN
  RETURN $$abc$$;
END;
$$;

It parse fine:

(program
 (statement
  alias: 
   (create_function (keyword_create) (keyword_function)
    (object_reference schema: (identifier) . name: (identifier))
    ( ) (keyword_returns) (keyword_trigger)
    (function_language (keyword_language) (identifier))
    (function_body (keyword_as) (dollar_quote) (keyword_begin) (keyword_return)
                   (literal)
     ; (keyword_end) ; (dollar_quote))))
 ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it what you meant:

CREATE FUNCTION public.update_time() RETURNS trigger
  LANGUAGE plpgsql
AS $$
BEGIN
  RETURN $$abc$$;
END;
$$;

right, ideally it shouldn't because the real parser sees the second $$ right after RETURN, thinks the function body is complete, and doesn't know what to do with the rest of the input:

ERROR:  syntax error at or near "abc$$"
LINE 5:   RETURN $$abc$$;

your first example is fine because each of the strings is separate, there's no nesting. I'm thinking of something more like your second, or like this case:

create table if not exists mytable (id serial primary key, val text);
do $a$
begin
  execute format(
    $b$
    -- note the escaped %% in the "c" format!
    insert into %1$s (val)
    select format($c$%%1$L$c$, 'alpha');
    $b$,
    'mytable'
  );
end
$a$;
select * from mytable;

I think your implementation is a big improvement on what we've got! but it's more permissive than Postgres' actual parser in a way that is likely to be a trap for the unwary (and it's split between this dollar_quote and the external dollar_string, so even more complicated) so I think we've got to make that really clear. I'll add a more detailed explanation to scanner.c.

Suggested change
// This is only used in create function statement, it is not needed to check
// the start tag match the end one. The usage of this syntax in other
// context is done by _dollar_string.
// dollar_quote is only used in the `create function` statement. Postgres-style
// dollar quoting in other contexts is done by the external _dollar_string. See
// scanner.c, and especially the note about parsing of nested dollar quotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My scanner was a simpler version of https://github.com/m-novikov/tree-sitter-sql/blob/main/src/scanner.cc , which produce 3 tokens , start tag, end tag and string content.
Maybe it is a better choice for nested dollar strings with the same name inside a body function.
And also for language injection https://tree-sitter.github.io/tree-sitter/syntax-highlighting#language-injection.

Comment on lines +1108 to +1111
// TODO Maybe we should do different version of function_body_statement in
// regard to the defined language to match either sql, plsql or
// plpgsql. Currently the function_body_statement support only sql. And
// maybe for other language the function_body should be a string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql functions without begin atomic are strings already, but it's useful to parse them anyway. Other sql-derived dialects are also useful to parse (I'm leaning on that with pl/pgsql in pdot) but full support is a question mark atm, see also #141 (comment)

Copy link
Contributor Author

@antoineB antoineB Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the inside of $.dollar_quote ... $.dollar_quote will not parse the first time you put a (pg)plsql for loop for instance.
One easy solution is to parse the whole body as a string, but the highlight will be one of a string.

Currently it generate an error:

CREATE FUNCTION public.update_time() RETURNS trigger
  LANGUAGE plpgsql
AS $$
BEGIN
  NEW.updated_at = CURRENT_TIMESTAMP;
  RETURN NEW;
END;
$$;
(program
 (statement
  alias: 
   (create_function (keyword_create) (keyword_function)
    (object_reference schema: (identifier) . name: (identifier))
    ( ) (keyword_returns) (keyword_trigger)
    (function_language (keyword_language) (identifier))
    (function_body (keyword_as) (dollar_quote) (keyword_begin)
     (ERROR . = (keyword_current_timestamp) ;)
     (keyword_return)
     (field name: (identifier))
     ; (keyword_end) ; (dollar_quote))))
 ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An other solution might be:

    function_body: $ => choice(
      seq(
        $._function_return,
        ';'
      ),
      seq(
        $.keyword_begin,
        $.keyword_atomic,
        repeat1(
          seq(
            $._function_body_statement,
            ';',
          ),
        ),
        $.keyword_end,
      ),
      seq(
        $.keyword_as,
        $.dollar_quote,
        optional(
          seq(
            $.keyword_declare,
            repeat1(
              $.function_declaration,
            ),
          ),
        ),
        $.keyword_begin,
        repeat($.junk)
        $.keyword_end,
        optional(';'),
        $.dollar_quote,
      ),
      seq(
        $.keyword_as,
        alias(
          choice(
            $._single_quote_string,
            $._double_quote_string,
          ),
          $.literal
        ),
      ),
    ),

    junk: $ => choice(
      $.all_keyword_expect_end,
      $.object_reference,
      $.all_operator,
      $.all_punctuation,
      $.literal,
      wrapped_in_parenthesis($.junk),
      wrapped_in_bracket($.junk),
      wrapped_in_curly_bracket($.junk),
    ),

It match all without structure but give you base highlight capability (like keyword, number, operator...).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually need the "best effort" parsing of pl/pgsql -- even with ERROR nodes it still recognizes regular queries and DML statements. Try it with longer function bodies. Here's a function with if x then (update statement); end if; functioncall():

(program [0, 0] - [11, 0]
  (statement [0, 0] - [10, 10]
    (create_function [0, 0] - [10, 10]
      (keyword_create [0, 0] - [0, 6])
      (keyword_or [0, 7] - [0, 9])
      (keyword_replace [0, 10] - [0, 17])
      (keyword_function [0, 18] - [0, 26])
      (object_reference [0, 27] - [0, 47]
        schema: (identifier [0, 27] - [0, 39])
        name: (identifier [0, 40] - [0, 47]))
      (function_arguments [0, 47] - [0, 49])
      (keyword_returns [1, 1] - [1, 8])
      (keyword_trigger [1, 9] - [1, 16])
      (function_language [2, 1] - [2, 17]
        (keyword_language [2, 1] - [2, 9])
        (keyword_plpgsql [2, 10] - [2, 17]))
      (function_body [3, 0] - [10, 10]
        (keyword_as [3, 0] - [3, 2])
        (dollar_quote [3, 3] - [3, 13])
        (keyword_begin [3, 14] - [3, 19])
        (ERROR [4, 2] - [4, 15]
          (keyword_then [4, 11] - [4, 15]))
        (statement [5, 4] - [5, 65]
          (update [5, 4] - [5, 65]
            (keyword_update [5, 4] - [5, 10])
            (relation [5, 11] - [5, 32]
              (object_reference [5, 11] - [5, 32]
                schema: (identifier [5, 11] - [5, 23])
                name: (identifier [5, 24] - [5, 32])))
            (keyword_set [5, 33] - [5, 36])
            (assignment [5, 37] - [5, 65]
              left: (field [5, 37] - [5, 41]
                name: (identifier [5, 37] - [5, 41]))
              right: (literal [5, 44] - [5, 65]))))
        (keyword_end [6, 2] - [6, 5])
        (ERROR [6, 6] - [9, 3])
        (dollar_quote [10, 0] - [10, 10])))))

the errors don't stop it from processing the rest and returning useful output, although in this case the select statement is not recognized after the end if.

@@ -2427,52 +2444,10 @@ module.exports = grammar({
$._type,
),

interval_definitions: $ => repeat1(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was only ever stringified in interval which does make it feel like a little bit of a reach imo, I think I'm fine with ditching it

// regard to the defined language to match either sql, plsql or
// plpgsql. Currently the function_body_statement support only sql. And
// maybe for other language the function_body should be a string.
$.identifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

from knn
where knn.alpha = htable.alpha;

return $a$test$a$;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return $a$test$a$;
return $b$test$b$;

reusing a errors:

ERROR:  syntax error at or near "test$a$"
LINE 29:   return $a$test$a$;

Copy link
Contributor Author

@antoineB antoineB Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is make compile to do before make test, work fine for me.

select $a$A$b$B$b$A$a$ + $$A$$ + $b$B$b$;

(program
 (statement
  (select (keyword_select)
   (select_expression
    (term
     value: 
      (binary_expression
       left: (binary_expression left: (literal) operator: + right: (literal))
       operator: + right: (literal))))))
 ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry it parses here, but if you copy the statement into psql it's a syntax error -- it's less confusing to have valid sql where possible

from knn
where knn.alpha = htable.alpha;

return $a$test$a$;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry it parses here, but if you copy the statement into psql it's a syntax error -- it's less confusing to have valid sql where possible

Comment on lines +970 to +972
// This is only used in create function statement, it is not needed to check
// the start tag match the end one. The usage of this syntax in other
// context is done by _dollar_string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it what you meant:

CREATE FUNCTION public.update_time() RETURNS trigger
  LANGUAGE plpgsql
AS $$
BEGIN
  RETURN $$abc$$;
END;
$$;

right, ideally it shouldn't because the real parser sees the second $$ right after RETURN, thinks the function body is complete, and doesn't know what to do with the rest of the input:

ERROR:  syntax error at or near "abc$$"
LINE 5:   RETURN $$abc$$;

your first example is fine because each of the strings is separate, there's no nesting. I'm thinking of something more like your second, or like this case:

create table if not exists mytable (id serial primary key, val text);
do $a$
begin
  execute format(
    $b$
    -- note the escaped %% in the "c" format!
    insert into %1$s (val)
    select format($c$%%1$L$c$, 'alpha');
    $b$,
    'mytable'
  );
end
$a$;
select * from mytable;

I think your implementation is a big improvement on what we've got! but it's more permissive than Postgres' actual parser in a way that is likely to be a trap for the unwary (and it's split between this dollar_quote and the external dollar_string, so even more complicated) so I think we've got to make that really clear. I'll add a more detailed explanation to scanner.c.

Suggested change
// This is only used in create function statement, it is not needed to check
// the start tag match the end one. The usage of this syntax in other
// context is done by _dollar_string.
// dollar_quote is only used in the `create function` statement. Postgres-style
// dollar quoting in other contexts is done by the external _dollar_string. See
// scanner.c, and especially the note about parsing of nested dollar quotes.

#include "tree_sitter/parser.h"
#include <stdlib.h>
#include <string.h>
#include <wctype.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <wctype.h>
#include <wctype.h>
// We do not currently guarantee that the start and end tags match. This is a major
// difference between tree-sitter and Postgres' own parser. For example, a function
// body declared with nested dollar quotes:
//
// $function$
// select $$1$$;
// $function$
//
// Postgres' LALR parser can only distinguish nested dollar-quoted strings by the
// "tag" between dollar signs ("function" and nothing here). However, tree-sitter's
// more powerful LR parser can track the nested dollar quotes with or without tags,
// so this function body:
//
// $$
// select $$1$$;
// $$
//
// parses successfully with tree-sitter but will cause a syntax error in Postgres,
// which will think you meant a string "\n select" with a 1 and some more garbage
// after it.

Comment on lines +1108 to +1111
// TODO Maybe we should do different version of function_body_statement in
// regard to the defined language to match either sql, plsql or
// plpgsql. Currently the function_body_statement support only sql. And
// maybe for other language the function_body should be a string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually need the "best effort" parsing of pl/pgsql -- even with ERROR nodes it still recognizes regular queries and DML statements. Try it with longer function bodies. Here's a function with if x then (update statement); end if; functioncall():

(program [0, 0] - [11, 0]
  (statement [0, 0] - [10, 10]
    (create_function [0, 0] - [10, 10]
      (keyword_create [0, 0] - [0, 6])
      (keyword_or [0, 7] - [0, 9])
      (keyword_replace [0, 10] - [0, 17])
      (keyword_function [0, 18] - [0, 26])
      (object_reference [0, 27] - [0, 47]
        schema: (identifier [0, 27] - [0, 39])
        name: (identifier [0, 40] - [0, 47]))
      (function_arguments [0, 47] - [0, 49])
      (keyword_returns [1, 1] - [1, 8])
      (keyword_trigger [1, 9] - [1, 16])
      (function_language [2, 1] - [2, 17]
        (keyword_language [2, 1] - [2, 9])
        (keyword_plpgsql [2, 10] - [2, 17]))
      (function_body [3, 0] - [10, 10]
        (keyword_as [3, 0] - [3, 2])
        (dollar_quote [3, 3] - [3, 13])
        (keyword_begin [3, 14] - [3, 19])
        (ERROR [4, 2] - [4, 15]
          (keyword_then [4, 11] - [4, 15]))
        (statement [5, 4] - [5, 65]
          (update [5, 4] - [5, 65]
            (keyword_update [5, 4] - [5, 10])
            (relation [5, 11] - [5, 32]
              (object_reference [5, 11] - [5, 32]
                schema: (identifier [5, 11] - [5, 23])
                name: (identifier [5, 24] - [5, 32])))
            (keyword_set [5, 33] - [5, 36])
            (assignment [5, 37] - [5, 65]
              left: (field [5, 37] - [5, 41]
                name: (identifier [5, 37] - [5, 41]))
              right: (literal [5, 44] - [5, 65]))))
        (keyword_end [6, 2] - [6, 5])
        (ERROR [6, 6] - [9, 3])
        (dollar_quote [10, 0] - [10, 10])))))

the errors don't stop it from processing the rest and returning useful output, although in this case the select statement is not recognized after the end if.

…string_start_tag, dollar_quoted_string_end_tag and

dollar_quoted_string_content_tag.
Unfortunatly it doesn't resolve the problem when we want to parse the inner of
the string we still can't ensure the start_tag and end_tag are balanced.
…_tag and

dollar_quoted_string_end_tag which are balanced and are specialy used in
$.create_function. And a token dollar_quoted_string which match a dollar string
in one swoop, it is intented to be used in a literal context.
@antoineB
Copy link
Contributor Author

I have rework the scanner, now the $.dollar_quote rules are balanced.

Also change one of the test to avoid (UNEXPECTED '\n') because it is platform
dependant, windows will emit (UNEXPECTED '\r').
@dmfay dmfay merged commit 087d5b4 into DerekStride:main Oct 25, 2023
4 checks passed
@dmfay
Copy link
Collaborator

dmfay commented Oct 25, 2023

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants