Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CodeQL grammar #3906

Merged
merged 5 commits into from
Dec 27, 2023
Merged

add CodeQL grammar #3906

merged 5 commits into from
Dec 27, 2023

Conversation

flyinox
Copy link
Contributor

@flyinox flyinox commented Dec 21, 2023

add CodeQL grammar

CodeQL grammar built from the CodeQL Specification.

All test cases have been extracted from the QL Language Reference and slightly modified. It should be noted that some of these cases may contain semantic errors; however, the focus here is solely on verifying the correctness of the lexer and parser.


//ql ::= QL_DOC? moduleBody
ql
: QL_DOC? moduleBody
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyinox

The reason the build is failing is because the new Trash-based testers don't know that ql is the start symbol because it doesn't fit the form of an EOF-terminated rule, e.g., ql: QL_DOC? moduleBody EOF;. The use of an EOF at the end of the rule forces the parser to consume all input, whereas without the EOF, the parse can stop short in the parse due to a parse error but still return "success".

You should either use an EOF-terminated rule (preferably as I mention above), or add <entry-point>ql</entry-point> to the desc.xml. See this as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir. I'll add EOF at the end of the root rule

Copy link
Contributor

@kaby76 kaby76 Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build fails for PHP because of a "symbol conflict". In Antlr, some targets have problems with certain symbol names that other targets don't have a problem with. "CLASS" in the lexer .g4 and "class_" in the parser .g4 are two of those with PHP. These symbols could be renamed to "CLASS_" and "class_" respectively, but after trying that, I see the target "works" for a few inputs quickly (e.g., examples/annotation.ql), but is ridiculously slow for the entire test suite in examples/ (e.g., expressions.ql takes 10s to parse, and then hangs on formula.ql). The build has a 5 minute time limit for any step, so it will still fail.

The grammar is ambiguous for moduleBody, conjunction_formula, and maybe a few other rules. Rather than try to solve that for PHP, let's just not consider PHP instead.

So, please remove the "PHP;" from the desc.xml.

@flyinox
Copy link
Contributor Author

flyinox commented Dec 21, 2023

@kaby76

Practically all tests have passed except for those related to PHP... From my limited understanding, when it comes to ANTLR, the generated parsers for different languages should have no discrepancies in correctness. Also, I lack experience with PHP, so I'm wondering if you might be able to offer some guidance.

@kaby76
Copy link
Contributor

kaby76 commented Dec 21, 2023

@kaby76

Practically all tests have passed except for those related to PHP... From my limited understanding, when it comes to ANTLR, the generated parsers for different languages should have no discrepancies in correctness. Also, I lack experience with PHP, so I'm wondering if you might be able to offer some guidance.

Just remove the "PHP;" from the desc.xml. That tells the tester to just the test the PHP target. The grammar can't work as is for PHP because it has some ambiguity, which isn't a fatal error in itself since that's one of the great joys Antlr can work around. But, it doesn't work well for PHP because the target is very slow.

@kaby76
Copy link
Contributor

kaby76 commented Dec 21, 2023

There are a few "useless parentheses" in your grammar. For example, "(expr)?" is the same as "expr?". You'll probably be asked to clean up these minor coding issues, so best to just remove the parentheses where they're not needed. See https://github.com/antlr/grammars-v4/actions/runs/7288003947/job/19859775472#step:15:47 for a list of what the build found.

@flyinox
Copy link
Contributor Author

flyinox commented Dec 21, 2023

There are a few "useless parentheses" in your grammar. For example, "(expr)?" is the same as "expr?". You'll probably be asked to clean up these minor coding issues, so best to just remove the parentheses where they're not needed. See https://github.com/antlr/grammars-v4/actions/runs/7288003947/job/19859775472#step:15:47 for a list of what the build found.

Okay, I will attempt to clean up all the warnings observed in the tests.

@kaby76
Copy link
Contributor

kaby76 commented Dec 21, 2023

Looking good so far.


What's the plan with these unused parser symbols?

$ bash /c/Users/Kenne/Documents/GitHub/g4-scripts/unused.sh
No arguments were provided.
Finding unused parser symbols in grammars...
./desc.xml
CodeQLParser.g4:L429: disjunction
                      ^
CodeQLParser.g4:L434: conjunction
                      ^
CodeQLParser.g4:L439: implies
                      ^
CodeQLParser.g4:L593: binop
                      ^
12/21-09:54:23 ~/issues/g4-3906/codeql
$

You might want to add something to say these are for future use?


These rules seem to have a couple of issues:

INT_LITERAL : [0-9]+;
FLOAT_LITERAL: INT_LITERAL ('.' INT_LITERAL)? ;
BOOL_LITERAL: TRUE
| FALSE
;

  • INT_LITERAL shadows FLOAT_LITERAL mostly. So, that means the string 123 would be matched by INT_LITERAL or FLOAT_LITERAL. But, since INT_LITERAL is first listed in the grammar, it matches. On the other hand, '123.0' would be matched only by FLOAT_LITERAL. The question is: why isn't the rule for FLOAT_LITERAL just FLOAT_LITERAL: INT_LITERAL '.' INT_LITERAL;?
  • BOOL_LITERAL is shadowed by TRUE and FALSE. In other words, there is no way BOOL_LITERAL can match anything. Did you mean a parser rule instead bool_literal: TRUE | FALSE;?

I think it's important to remember that Antlr lexers don't work like normal EBNF. Antlr lexers work by two rules: (1) The longest string is always matched. (2) If two or more rules match a string of the same length, the first one "wins". Lexing is done prior to the parser running at all, and a parser rule does not "guide" how the lexer to chop up the input into tokens.


I ran a check for useless lexer symbols. It also checks whether the string literal is used on the right hand side of any parser rule. These symbols are not used, nor are their corresponding string literals.

$ bash /c/Users/Kenne/Documents/GitHub/g4-scripts/unused-lexer.sh
Finding unused lexer symbols in grammars...
./desc.xml
CodeQLLexer.g4:BOOL_LITERAL
CodeQLLexer.g4:AT
CodeQLLexer.g4:ELLIPSIS
12/21-09:55:11 ~/issues/g4-3906/codeql

Rule AT: '@'; isn't referenced in any of the two .g4's, and the string literal '@' doesn't appear in the parser grammar. What is the purpose of the rule? The same for ELLIPSIS: '...';. What is the plan for these symbols?

@kaby76
Copy link
Contributor

kaby76 commented Dec 21, 2023

Code coverage looks really good--87%! Great set of tests! Here's what my trcover generated for a "heat map" of the grammar.
cover.html.txt (You'll have to rename to "cover.html", and then open in a browser.)

@flyinox
Copy link
Contributor Author

flyinox commented Dec 22, 2023

You might want to add something to say these are for future use?

These rules originated from the specification, but were replaced by related rules due to left recursion and precedence reasons. however, they were not removed from the code. I will remove this part of the code.

  • INT_LITERAL shadows FLOAT_LITERAL mostly. So, that means the string 123 would be matched by INT_LITERAL or FLOAT_LITERAL. But, since INT_LITERAL is first listed in the grammar, it matches. On the other hand, '123.0' would be matched only by FLOAT_LITERAL. The question is: why isn't the rule for FLOAT_LITERAL just FLOAT_LITERAL: INT_LITERAL '.' INT_LITERAL;?

Yes, my initial thought, which I didn't fully consider, was that a float representation without a '.' is also a valid way to represent a float, without taking into account the issue of prioritizing matches.

  • BOOL_LITERAL is shadowed by TRUE and FALSE. In other words, there is no way BOOL_LITERAL can match anything. Did you mean a parser rule instead bool_literal: TRUE | FALSE;?

The BOOL_LITERAL has not been used, so I will remove it for now and will carefully consider how to add this rule again when it is needed later on.

Rule AT: '@'; isn't referenced in any of the two .g4's, and the string literal '@' doesn't appear in the parser grammar. What is the purpose of the rule? The same for ELLIPSIS: '...';. What is the plan for these symbols?

The AT symbol is utilized within the ATLOWERID lexer rule; I will replace the '@' character with the rule name AT in the ATLOWERID rule. As for ELLIPSIS, it is indeed not used, and there is no definition for it in the spec either. So, from the perspective of clarity and preventing confusion,would it be best practice not to preemptively reserve some keywords?

@flyinox
Copy link
Contributor Author

flyinox commented Dec 22, 2023

Code coverage looks really good--87%! Great set of tests! Here's what my trcover generated for a "heat map" of the grammar. cover.html.txt (You'll have to rename to "cover.html", and then open in a browser.)

Thx, an excellent coverage tool! I'll continue to make optimizations based on these coverage statistics.

@teverett
Copy link
Member

@flyinox this is a great addition to grammars-v4 and thank-you for the extensive tests! I'll merge this when I hear from @kaby76 and @KvanTTT that it's ready.

@kaby76
Copy link
Contributor

kaby76 commented Dec 23, 2023

@teverett All set. All the suggestions I made were implemented.

@teverett
Copy link
Member

@flyinox thanks!

@teverett teverett merged commit 5320398 into antlr:master Dec 27, 2023
23 checks passed
@KvanTTT KvanTTT added new-grammar New grammar issue or pull request example New example of file(s) parsed by grammar-generated parser labels Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
example New example of file(s) parsed by grammar-generated parser new-grammar New grammar issue or pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants