-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add CodeQL grammar #3906
add CodeQL grammar #3906
Conversation
codeql/CodeQLParser.g4
Outdated
|
||
//ql ::= QL_DOC? moduleBody | ||
ql | ||
: QL_DOC? moduleBody |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the build is failing is because the new Trash-based testers don't know that ql
is the start symbol because it doesn't fit the form of an EOF-terminated rule, e.g., ql: QL_DOC? moduleBody EOF;
. The use of an EOF at the end of the rule forces the parser to consume all input, whereas without the EOF, the parse can stop short in the parse due to a parse error but still return "success".
You should either use an EOF-terminated rule (preferably as I mention above), or add <entry-point>ql</entry-point>
to the desc.xml. See this as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sir. I'll add EOF at the end of the root rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build fails for PHP because of a "symbol conflict". In Antlr, some targets have problems with certain symbol names that other targets don't have a problem with. "CLASS" in the lexer .g4 and "class_" in the parser .g4 are two of those with PHP. These symbols could be renamed to "CLASS_" and "class_" respectively, but after trying that, I see the target "works" for a few inputs quickly (e.g., examples/annotation.ql), but is ridiculously slow for the entire test suite in examples/ (e.g., expressions.ql takes 10s to parse, and then hangs on formula.ql). The build has a 5 minute time limit for any step, so it will still fail.
The grammar is ambiguous for moduleBody, conjunction_formula, and maybe a few other rules. Rather than try to solve that for PHP, let's just not consider PHP instead.
So, please remove the "PHP;" from the desc.xml.
Practically all tests have passed except for those related to PHP... From my limited understanding, when it comes to ANTLR, the generated parsers for different languages should have no discrepancies in correctness. Also, I lack experience with PHP, so I'm wondering if you might be able to offer some guidance. |
Just remove the "PHP;" from the desc.xml. That tells the tester to just the test the PHP target. The grammar can't work as is for PHP because it has some ambiguity, which isn't a fatal error in itself since that's one of the great joys Antlr can work around. But, it doesn't work well for PHP because the target is very slow. |
There are a few "useless parentheses" in your grammar. For example, "(expr)?" is the same as "expr?". You'll probably be asked to clean up these minor coding issues, so best to just remove the parentheses where they're not needed. See https://github.com/antlr/grammars-v4/actions/runs/7288003947/job/19859775472#step:15:47 for a list of what the build found. |
Okay, I will attempt to clean up all the warnings observed in the tests. |
Looking good so far. What's the plan with these unused parser symbols?
You might want to add something to say these are for future use? These rules seem to have a couple of issues: grammars-v4/codeql/CodeQLLexer.g4 Lines 138 to 145 in d90705f
I think it's important to remember that Antlr lexers don't work like normal EBNF. Antlr lexers work by two rules: (1) The longest string is always matched. (2) If two or more rules match a string of the same length, the first one "wins". Lexing is done prior to the parser running at all, and a parser rule does not "guide" how the lexer to chop up the input into tokens. I ran a check for useless lexer symbols. It also checks whether the string literal is used on the right hand side of any parser rule. These symbols are not used, nor are their corresponding string literals.
Rule |
Code coverage looks really good--87%! Great set of tests! Here's what my trcover generated for a "heat map" of the grammar. |
These rules originated from the specification, but were replaced by related rules due to left recursion and precedence reasons. however, they were not removed from the code. I will remove this part of the code.
Yes, my initial thought, which I didn't fully consider, was that a float representation without a '.' is also a valid way to represent a float, without taking into account the issue of prioritizing matches.
The BOOL_LITERAL has not been used, so I will remove it for now and will carefully consider how to add this rule again when it is needed later on.
The AT symbol is utilized within the ATLOWERID lexer rule; I will replace the '@' character with the rule name AT in the ATLOWERID rule. As for ELLIPSIS, it is indeed not used, and there is no definition for it in the spec either. So, from the perspective of clarity and preventing confusion,would it be best practice not to preemptively reserve some keywords? |
Thx, an excellent coverage tool! I'll continue to make optimizations based on these coverage statistics. |
@teverett All set. All the suggestions I made were implemented. |
@flyinox thanks! |
add CodeQL grammar
CodeQL grammar built from the CodeQL Specification.
All test cases have been extracted from the QL Language Reference and slightly modified. It should be noted that some of these cases may contain semantic errors; however, the focus here is solely on verifying the correctness of the lexer and parser.