add CodeQL grammar #3906

flyinox · 2023-12-21T10:04:30Z

add CodeQL grammar

CodeQL grammar built from the CodeQL Specification.

All test cases have been extracted from the QL Language Reference and slightly modified. It should be noted that some of these cases may contain semantic errors; however, the focus here is solely on verifying the correctness of the lexer and parser.

kaby76 · 2023-12-21T12:11:49Z

codeql/CodeQLParser.g4

+
+//ql ::= QL_DOC? moduleBody
+ql
+    : QL_DOC? moduleBody


@flyinox

The reason the build is failing is because the new Trash-based testers don't know that ql is the start symbol because it doesn't fit the form of an EOF-terminated rule, e.g., ql: QL_DOC? moduleBody EOF;. The use of an EOF at the end of the rule forces the parser to consume all input, whereas without the EOF, the parse can stop short in the parse due to a parse error but still return "success".

You should either use an EOF-terminated rule (preferably as I mention above), or add <entry-point>ql</entry-point> to the desc.xml. See this as an example.

Yes sir. I'll add EOF at the end of the root rule

The build fails for PHP because of a "symbol conflict". In Antlr, some targets have problems with certain symbol names that other targets don't have a problem with. "CLASS" in the lexer .g4 and "class_" in the parser .g4 are two of those with PHP. These symbols could be renamed to "CLASS_" and "class_" respectively, but after trying that, I see the target "works" for a few inputs quickly (e.g., examples/annotation.ql), but is ridiculously slow for the entire test suite in examples/ (e.g., expressions.ql takes 10s to parse, and then hangs on formula.ql). The build has a 5 minute time limit for any step, so it will still fail.

The grammar is ambiguous for moduleBody, conjunction_formula, and maybe a few other rules. Rather than try to solve that for PHP, let's just not consider PHP instead.

So, please remove the "PHP;" from the desc.xml.

flyinox · 2023-12-21T12:39:33Z

@kaby76

Practically all tests have passed except for those related to PHP... From my limited understanding, when it comes to ANTLR, the generated parsers for different languages should have no discrepancies in correctness. Also, I lack experience with PHP, so I'm wondering if you might be able to offer some guidance.

kaby76 · 2023-12-21T12:52:46Z

@kaby76

Practically all tests have passed except for those related to PHP... From my limited understanding, when it comes to ANTLR, the generated parsers for different languages should have no discrepancies in correctness. Also, I lack experience with PHP, so I'm wondering if you might be able to offer some guidance.

Just remove the "PHP;" from the desc.xml. That tells the tester to just the test the PHP target. The grammar can't work as is for PHP because it has some ambiguity, which isn't a fatal error in itself since that's one of the great joys Antlr can work around. But, it doesn't work well for PHP because the target is very slow.

kaby76 · 2023-12-21T12:57:26Z

There are a few "useless parentheses" in your grammar. For example, "(expr)?" is the same as "expr?". You'll probably be asked to clean up these minor coding issues, so best to just remove the parentheses where they're not needed. See https://github.com/antlr/grammars-v4/actions/runs/7288003947/job/19859775472#step:15:47 for a list of what the build found.

flyinox · 2023-12-21T13:06:57Z

There are a few "useless parentheses" in your grammar. For example, "(expr)?" is the same as "expr?". You'll probably be asked to clean up these minor coding issues, so best to just remove the parentheses where they're not needed. See https://github.com/antlr/grammars-v4/actions/runs/7288003947/job/19859775472#step:15:47 for a list of what the build found.

Okay, I will attempt to clean up all the warnings observed in the tests.

kaby76 · 2023-12-21T15:19:06Z

Looking good so far.

What's the plan with these unused parser symbols?

$ bash /c/Users/Kenne/Documents/GitHub/g4-scripts/unused.sh
No arguments were provided.
Finding unused parser symbols in grammars...
./desc.xml
CodeQLParser.g4:L429: disjunction
                      ^
CodeQLParser.g4:L434: conjunction
                      ^
CodeQLParser.g4:L439: implies
                      ^
CodeQLParser.g4:L593: binop
                      ^
12/21-09:54:23 ~/issues/g4-3906/codeql
$

You might want to add something to say these are for future use?

These rules seem to have a couple of issues:

grammars-v4/codeql/CodeQLLexer.g4

Lines 138 to 145 in d90705f

    
           INT_LITERAL :     [0-9]+; 
        
           FLOAT_LITERAL:      INT_LITERAL ('.' INT_LITERAL)? ; 
        
           BOOL_LITERAL:       TRUE 
        
                       |       FALSE 
        
                       ;

INT_LITERAL shadows FLOAT_LITERAL mostly. So, that means the string 123 would be matched by INT_LITERAL or FLOAT_LITERAL. But, since INT_LITERAL is first listed in the grammar, it matches. On the other hand, '123.0' would be matched only by FLOAT_LITERAL. The question is: why isn't the rule for FLOAT_LITERAL just FLOAT_LITERAL: INT_LITERAL '.' INT_LITERAL;?
BOOL_LITERAL is shadowed by TRUE and FALSE. In other words, there is no way BOOL_LITERAL can match anything. Did you mean a parser rule instead bool_literal: TRUE | FALSE;?

I think it's important to remember that Antlr lexers don't work like normal EBNF. Antlr lexers work by two rules: (1) The longest string is always matched. (2) If two or more rules match a string of the same length, the first one "wins". Lexing is done prior to the parser running at all, and a parser rule does not "guide" how the lexer to chop up the input into tokens.

I ran a check for useless lexer symbols. It also checks whether the string literal is used on the right hand side of any parser rule. These symbols are not used, nor are their corresponding string literals.

$ bash /c/Users/Kenne/Documents/GitHub/g4-scripts/unused-lexer.sh
Finding unused lexer symbols in grammars...
./desc.xml
CodeQLLexer.g4:BOOL_LITERAL
CodeQLLexer.g4:AT
CodeQLLexer.g4:ELLIPSIS
12/21-09:55:11 ~/issues/g4-3906/codeql

Rule AT: '@'; isn't referenced in any of the two .g4's, and the string literal '@' doesn't appear in the parser grammar. What is the purpose of the rule? The same for ELLIPSIS: '...';. What is the plan for these symbols?

kaby76 · 2023-12-21T16:05:10Z

Code coverage looks really good--87%! Great set of tests! Here's what my trcover generated for a "heat map" of the grammar.
cover.html.txt (You'll have to rename to "cover.html", and then open in a browser.)

flyinox · 2023-12-22T02:42:07Z

You might want to add something to say these are for future use?

These rules originated from the specification, but were replaced by related rules due to left recursion and precedence reasons. however, they were not removed from the code. I will remove this part of the code.

INT_LITERAL shadows FLOAT_LITERAL mostly. So, that means the string 123 would be matched by INT_LITERAL or FLOAT_LITERAL. But, since INT_LITERAL is first listed in the grammar, it matches. On the other hand, '123.0' would be matched only by FLOAT_LITERAL. The question is: why isn't the rule for FLOAT_LITERAL just FLOAT_LITERAL: INT_LITERAL '.' INT_LITERAL;?

Yes, my initial thought, which I didn't fully consider, was that a float representation without a '.' is also a valid way to represent a float, without taking into account the issue of prioritizing matches.

BOOL_LITERAL is shadowed by TRUE and FALSE. In other words, there is no way BOOL_LITERAL can match anything. Did you mean a parser rule instead bool_literal: TRUE | FALSE;?

The BOOL_LITERAL has not been used, so I will remove it for now and will carefully consider how to add this rule again when it is needed later on.

Rule AT: '@'; isn't referenced in any of the two .g4's, and the string literal '@' doesn't appear in the parser grammar. What is the purpose of the rule? The same for ELLIPSIS: '...';. What is the plan for these symbols?

The AT symbol is utilized within the ATLOWERID lexer rule; I will replace the '@' character with the rule name AT in the ATLOWERID rule. As for ELLIPSIS, it is indeed not used, and there is no definition for it in the spec either. So, from the perspective of clarity and preventing confusion,would it be best practice not to preemptively reserve some keywords?

flyinox · 2023-12-22T02:54:52Z

Code coverage looks really good--87%! Great set of tests! Here's what my trcover generated for a "heat map" of the grammar. cover.html.txt (You'll have to rename to "cover.html", and then open in a browser.)

Thx, an excellent coverage tool! I'll continue to make optimizations based on these coverage statistics.

teverett · 2023-12-23T18:03:43Z

@flyinox this is a great addition to grammars-v4 and thank-you for the extensive tests! I'll merge this when I hear from @kaby76 and @KvanTTT that it's ready.

kaby76 · 2023-12-23T18:31:21Z

@teverett All set. All the suggestions I made were implemented.

teverett · 2023-12-27T20:47:13Z

@flyinox thanks!

kaby76 reviewed Dec 21, 2023

View reviewed changes

flyinox force-pushed the master branch from a32d08a to e8f9db8 Compare December 21, 2023 12:12

add CodeQL grammar

16adde8

flyinox force-pushed the master branch from e8f9db8 to 16adde8 Compare December 21, 2023 12:17

dijun.ldj added 2 commits December 21, 2023 20:48

fix symbol conflict in PHP

85d373d

remove PHP target for performance reason

302413e

remove useless parentheses

d90705f

remove ununsed rules

067fcf0

teverett merged commit 5320398 into antlr:master Dec 27, 2023
23 checks passed

KvanTTT added new-grammar New grammar issue or pull request example New example of file(s) parsed by grammar-generated parser labels Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add CodeQL grammar #3906

add CodeQL grammar #3906

flyinox commented Dec 21, 2023

kaby76 Dec 21, 2023

flyinox Dec 21, 2023

kaby76 Dec 21, 2023 •

edited

Loading

flyinox commented Dec 21, 2023

kaby76 commented Dec 21, 2023

kaby76 commented Dec 21, 2023

flyinox commented Dec 21, 2023

kaby76 commented Dec 21, 2023

kaby76 commented Dec 21, 2023

flyinox commented Dec 22, 2023 •

edited

Loading

flyinox commented Dec 22, 2023

teverett commented Dec 23, 2023

kaby76 commented Dec 23, 2023 •

edited

Loading

teverett commented Dec 27, 2023

add CodeQL grammar #3906

add CodeQL grammar #3906

Conversation

flyinox commented Dec 21, 2023

kaby76 Dec 21, 2023

Choose a reason for hiding this comment

flyinox Dec 21, 2023

Choose a reason for hiding this comment

kaby76 Dec 21, 2023 • edited Loading

Choose a reason for hiding this comment

flyinox commented Dec 21, 2023

kaby76 commented Dec 21, 2023

kaby76 commented Dec 21, 2023

flyinox commented Dec 21, 2023

kaby76 commented Dec 21, 2023

kaby76 commented Dec 21, 2023

flyinox commented Dec 22, 2023 • edited Loading

flyinox commented Dec 22, 2023

teverett commented Dec 23, 2023

kaby76 commented Dec 23, 2023 • edited Loading

teverett commented Dec 27, 2023

kaby76 Dec 21, 2023 •

edited

Loading

flyinox commented Dec 22, 2023 •

edited

Loading

kaby76 commented Dec 23, 2023 •

edited

Loading