Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_sql vs parse_plpgsql usage #88

Open
remingtonc opened this issue Aug 25, 2021 · 5 comments
Open

parse_sql vs parse_plpgsql usage #88

remingtonc opened this issue Aug 25, 2021 · 5 comments

Comments

@remingtonc
Copy link

Hello! Nice projection! Trying to determine if this library could be utilized to parse many SQL statements to form an AST of a schema - or at least have the per-statement AST parsed in useful ways to formulate things about the schema. I am uncertain of the difference between the parse_sql and parse_plpgsql functions and how to utilize them correctly. An example...

CREATE OR REPLACE FUNCTION test.test_parse (
    p_time_start timestamptz,
    p_time_end timestamptz,
    p_time_interval interval default NULL
) RETURNS TABLE (
    ts timestamptz,
    arbitrary_return bigint
) AS $$
BEGIN
    -- some comment
    -- some other comment

    IF p_time_interval IS NULL
        THEN p_time_interval := interval_from_start_end(p_time_start, p_time_end);
    END IF;
    RETURN QUERY
    SELECT
        bucket_function(p_time_interval, timestamp) AS ts,
        arbitrary_return
    FROM test.some_table
    WHERE
        start >= p_time_start
        AND end < p_time_end
    GROUP BY 1;
END; $$ LANGUAGE plpgsql SECURITY DEFINER PARALLEL UNSAFE;

Using parse_plpgsql fails:

>>> parse_plpgsql(raw_sql)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/x/.venv/lib/python3.9/site-packages/pglast/__init__.py", line 31, in parse_plpgsql
    return loads(parse_plpgsql_json(statement))
  File "pglast/parser.pyx", line 346, in pglast.parser.parse_plpgsql_json
pglast.parser.ParseError: "p_time_interval" is not a known variable

Using parse_sql here works, but yields a big blob unparsed:

>>> stmt_ast = parse_sql(raw_sql)
>>> stmt_ast[0].stmt.options[0].arg[0].val
'\nBEGIN\n    -- some comment\n    -- some other comment\n\n    IF p_time_interval IS NULL\n        THEN p_time_interval := interval_from_start_end(p_time_start, p_time_end);\n    END IF;\n    RETURN QUERY\n    SELECT\n        bucket_function(p_time_interval, timestamp) AS ts,\n        arbitrary_return\n    FROM test.some_table\n    WHERE\n        start >= p_time_start\n        AND end < p_time_end\n    GROUP BY 1;\nEND; '

Trying to parse that inner blob as plpgsql similarly does not work:

>>> parse_plpgsql(stmt_ast[0].stmt.options[0].arg[0].val)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/x/.venv/lib/python3.9/site-packages/pglast/__init__.py", line 31, in parse_plpgsql
    return loads(parse_plpgsql_json(statement))
  File "pglast/parser.pyx", line 346, in pglast.parser.parse_plpgsql_json
pglast.parser.ParseError: syntax error at or near "IF", at index 58

Any tips?

Further - any heuristic to not need to explicitly declare "this function should be parsed as SQL" versus "this function should be parsed as PL/pgSQL", or do I need to decide that per statement?

@lelit
Copy link
Owner

lelit commented Aug 26, 2021 via email

@lelit
Copy link
Owner

lelit commented Oct 9, 2021

As the related issue is now closed, I will try to get at this soon.

lelit added a commit that referenced this issue Oct 9, 2021
lelit added a commit that referenced this issue Oct 9, 2021
This fixes (one aspect of) issue #88.
@lelit
Copy link
Owner

lelit commented Oct 9, 2021

The parse error is now fixed in just released v3.6, but I'm leaving this open to remind me a possible clarification in the doc about other issues:

  • difference between the two main parse functions
  • make it more evident that the support for PLpgSQL is still vaporware

@VaibhaveS
Copy link

@lelit I am confused here as well, when to use either of the two. Can you please clarify.

@lelit
Copy link
Owner

lelit commented Sep 24, 2024

I will try to clarify this in the documentation.
In the meantime, expanding what I said in a previous comment above, the parse_plpgsql function is severely underpowered because while it properly execute the parsing of the statement as PostgreSQL would, what it returns is a a little more of a raw sequence of tokens, not an AST like parse_sql does.

Consider the following examples: they parse the same

from pprint import pprint
from pglast import parse_plpgsql

STMT = """\
CREATE FUNCTION add (a integer, b integer)
RETURNS integer AS $$
BEGIN 
  RETURN a + b;
END;
$$ LANGUAGE plpgsql
"""

as_plpgsql = parse_sql(STMT)
pprint(as_plpgsql)

This prints out

[{'PLpgSQL_function': {'action': {'PLpgSQL_stmt_block': {'body': [{'PLpgSQL_stmt_return': {'expr': {'PLpgSQL_expr': {'parseMode': 2,
                                                                                                                     'query': 'a '
                                                                                                                              '+ '
                                                                                                                              'b'}},
                                                                                           'lineno': 1}}],
                                                         'lineno': 1}},
                       'datums': [{'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'a'}},
                                  {'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'b'}},
                                  {'PLpgSQL_var': {'datatype': {'PLpgSQL_type': {'typname': 'UNKNOWN'}},
                                                   'refname': 'found'}}]}}]

that, as you can see, is just a list of plain Python dictionaries.

If you use parse_sql instead, you obtain a richer representation of the statement:

from pprint import pprint
from pglast import parse_sql

STMT = """\
CREATE FUNCTION add (a integer, b integer)
RETURNS integer AS $$
BEGIN 
  RETURN a + b;
END;
$$ LANGUAGE plpgsql
"""

as_sql = parse_sql(STMT)
pprint([stmt(skip_none=True) for stmt in as_sql])

This emits

[{'@': 'RawStmt',
  'stmt': {'@': 'CreateFunctionStmt',
           'funcname': ({'@': 'String', 'sval': 'add'},),
           'is_procedure': False,
           'options': ({'@': 'DefElem',
                        'arg': ({'@': 'String',
                                 'sval': '\nBEGIN \n  RETURN a + b;\nEND;\n'},),
                        'defaction': {'#': 'DefElemAction',
                                      'name': 'DEFELEM_UNSPEC',
                                      'value': 0},
                        'defname': 'as',
                        'location': 59},
                       {'@': 'DefElem',
                        'arg': {'@': 'String', 'sval': 'plpgsql'},
                        'defaction': {'#': 'DefElemAction',
                                      'name': 'DEFELEM_UNSPEC',
                                      'value': 0},
                        'defname': 'language',
                        'location': 96}),
           'parameters': ({'@': 'FunctionParameter',
                           'argType': {'@': 'TypeName',
                                       'location': 23,
                                       'names': ({'@': 'String',
                                                  'sval': 'pg_catalog'},
                                                 {'@': 'String',
                                                  'sval': 'int4'}),
                                       'pct_type': False,
                                       'setof': False,
                                       'typemod': -1},
                           'mode': {'#': 'FunctionParameterMode',
                                    'name': 'FUNC_PARAM_DEFAULT',
                                    'value': 'd'},
                           'name': 'a'},
                          {'@': 'FunctionParameter',
                           'argType': {'@': 'TypeName',
                                       'location': 34,
                                       'names': ({'@': 'String',
                                                  'sval': 'pg_catalog'},
                                                 {'@': 'String',
                                                  'sval': 'int4'}),
                                       'pct_type': False,
                                       'setof': False,
                                       'typemod': -1},
                           'mode': {'#': 'FunctionParameterMode',
                                    'name': 'FUNC_PARAM_DEFAULT',
                                    'value': 'd'},
                           'name': 'b'}),
           'replace': False,
           'returnType': {'@': 'TypeName',
                          'location': 51,
                          'names': ({'@': 'String', 'sval': 'pg_catalog'},
                                    {'@': 'String', 'sval': 'int4'}),
                          'pct_type': False,
                          'setof': False,
                          'typemod': -1}},
  'stmt_len': 0,
  'stmt_location': 0}]

The parse_plpgsql function is meant to parse more complex procedure language statements, containing PG extensions to the SQL language like loops, conditions and the like. But for now, it is of little utility, until someone implement a proper AST for it, either at the lower libpg_query level, or in pglast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants