Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CharacterClass: use ICharacterClass where possible #26

Merged

Conversation

mpsijm
Copy link
Contributor

@mpsijm mpsijm commented Jul 5, 2019

Related PRs: metaborg/jsglr#48 and metaborg/spg#1

The main rationale behind this refactoring is that the class CharacterClassSymbol (a subclass of Symbol) was being "abused".
It was used for instance in the lookahead restriction of symbols and reduce actions and in State.doReduces/State.addReduceAction.
Those use sites do not need to know anything related to a grammar symbol, but really only need to know which characters are in the CC.
Therefore, it is better to directly use the ICharacterClass interface at these places. In order to accomplish this, the following changes have been made:

  • union, intersection, and difference used to be called on the CharacterClassFactory, and isEmpty could only be called on CharacterClassSymbol. Now, they can be called directly on ICharacterClass.
  • toAterm used to be a method in the CharacterClassSymbol. It is now also a method in ICharacterClass, where a more specific implementation for the "single" and "range" classes (resolving the FIXME comment).
    • This adds a dependency from tableinterfaces to org.metaborg.terms.

Note that the CharacterClass symbol class has been renamed to CharacterClassSymbol, to reduce the confusion between the grammar symbol and the ICharacterClass data structure.

As a final remark: my autoformatter has sometimes reformatted spacing and import groups, even in classes where I only changed one line. Unfortunately, this does generate some noise in the diff.

The main rationale behind this refactoring is that the class
`CharacterClassSymbol` (a subclass of `Symbol`) was being "abused".
It was used for instance in the lookahead restriction of symbols
and reduce actions and in `State.doReduces`/`State.addReduceAction`.
Those use sites do not need to know anything related to a grammar
symbol, but really only need to know which characters are in the CC.
Therefore, it is better to directly use the `ICharacterClass` interface
at these places. In order to accomplish this, the following changes
have been made:

- `union`, `intersection`, and `difference` used to be called on the
  `CharacterClassFactory`, and `isEmpty` could only be called on
  `CharacterClassSymbol`. Now, they can be called directly on
  `ICharacterClass`.
- `toAterm` used to be a method in the `CharacterClassSymbol`.
  It is now also a method in `ICharacterClass`, where a more specific
  implementation for the "single" and "range" classes (resolving the
  `FIXME` comment).
  - This adds a dependency from `tableinterfaces` to
    `org.metaborg.terms`.

Note that the `CharacterClass` symbol class has been renamed to
`CharacterClassSymbol`, to reduce the confusion between the grammar
symbol and the `ICharacterClass` data structure.

As a final remark: my autoformatter has sometimes reformatted spacing
and import groups, even in classes where I only changed one line.
Unfortunately, this does generate some noise in the diff.
@@ -15,6 +17,8 @@
private boolean containsEOF; // [256]

private int min, max;
// This field is derived from the fields wordX and containsEOF, and is therefore not used in hashCode and equals
private boolean empty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CharacterClassOptimized() constructor rejects empty character classes, which I think makes sense. We don't need the field and isEmpty can just return false, right?

Copy link
Contributor Author

@mpsijm mpsijm Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CharacterClassOptimized() constructor is useless, now that I look at it, since it always throws an exception 😛 I can remove that one.

However, in the constructor CharacterClassOptimized(long word0, long word1, long word2, long word3, boolean containsEOF, int min, int max), no exceptions are thrown if the character class is empty, so in that case the field is still necessary. Or do you think we should actually throw an exception if it is empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it makes sense to throw an exception in case an empty optimized character is instantiated. At the point of finalizing character classes, parser generation is done and non-empty character classes are useless. I'm not sure if it could happen at all, but actions for empty character classes could be left out the parse table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this, and have run the entire build script again to make sure that nothing breaks because of this. I've also added some extra tests.

@jasperdenkers
Copy link
Contributor

Nice work! Besides the minor comment, this looks good to me.

What do you think @udesou?

@jasperdenkers jasperdenkers requested a review from udesou July 5, 2019 15:08
@udesou
Copy link
Contributor

udesou commented Jul 8, 2019

I think that it looks pretty good.
There is indeed a problem with classes that are used for parse table generation, and classes used for parsing. I think @jasperdenkers and I discussed something like that, where the parse table object was filled with noise from parse table generation.
One minor thing that I don't even know if is relevant or not; I think Hendrik found once that the JSGLR1 behaviour for the two productions below A.C1 and A.C2 is different:

A.C1 = 
A.C2 = []

I can't remember exactly was the use case, but if this seems relevant, maybe check with him what he wanted to achieve there, and if your changes do not break such behaviour.
Finally, I'm also changing a few things in the parse table generator to be able to remove some dependencies (from parsers to sdf2table, e.g.). The end goal is to improve the performance of the generator, but hopefully, I can also do some of this decoupling of classes and interfaces in the process.
Again, thanks @mpsijm!

@jasperdenkers
Copy link
Contributor

There is indeed a problem with classes that are used for parse table generation, and classes used for parsing. I think @jasperdenkers and I discussed something like that, where the parse table object was filled with noise from parse table generation.

Yes, the problem is that currently the resulting parse table class (that implements the IParseTable interface and that is an input to the parser) contains data structures from parser generation that are not used anymore after generation. These data structures are expensive to serialize and this slows down e.g. writing/reading parse tables to/from cache in the build system. Ideally, the algorithm to generate the parse table and the data structure to represent the generated parse table are separated.

One minor thing that I don't even know if is relevant or not; I think Hendrik found once that the JSGLR1 behaviour for the two productions below A.C1 and A.C2 is different:
A.C1 =
A.C2 = []
I can't remember exactly was the use case, but if this seems relevant, maybe check with him what he wanted to achieve there, and if your changes do not break such behaviour.

I was not aware of this case. But if it is a problem, I would suspect the problem to be in normalization.

Finally, I'm also changing a few things in the parse table generator to be able to remove some dependencies (from parsers to sdf2table, e.g.). The end goal is to improve the performance of the generator, but hopefully, I can also do some of this decoupling of classes and interfaces in the process.

Great!

@mpsijm
Copy link
Contributor Author

mpsijm commented Jul 8, 2019

One minor thing that I don't even know if is relevant or not; I think Hendrik found once that the JSGLR1 behaviour for the two productions below A.C1 and A.C2 is different:

A.C1 = 
A.C2 = []

I can't remember exactly was the use case, but if this seems relevant, maybe check with him what he wanted to achieve there, and if your changes do not break such behaviour.

I've asked @hendrikvanantwerpen, and he says he does not remember this issue 🙂

@jasperdenkers jasperdenkers merged commit 0c7fe40 into metaborg:develop/jsglr2 Jul 8, 2019
jasperdenkers pushed a commit to metaborg/jsglr that referenced this pull request Jul 8, 2019
jasperdenkers pushed a commit to metaborg/spg that referenced this pull request Jul 8, 2019
See metaborg/sdf@291e625f (PR metaborg/sdf#26).

Also implemented the placeholders using the new character class
representation. Because of this, the `SymbolUtils` class (and its tests)
could be removed.
@mpsijm mpsijm deleted the refactor-characterclass branch July 8, 2019 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants