Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the TDL parsing API #168

Closed
goodmami opened this issue Aug 27, 2018 · 1 comment
Closed

Update the TDL parsing API #168

goodmami opened this issue Aug 27, 2018 · 1 comment
Milestone

Comments

@goodmami
Copy link
Member

#81 made some parsing functions non-public (ones which weren't useful as part of the public API), but we left the tdl.lex() and tdl.tokenize() functions alone, even though they probably have limited utility outside of the tdl module. These functions can be deprecated if we adopt some changes to the API such that comments can be retrieved. Here is one proposal:

  • drop tokenize() and its regex
  • make lex() non-public (_lex()) and have it lex every token, not just the top-level constructions
  • let parse() specify which entities it will yield (by default, for example, ('typedef', 'typeaddendum', 'instance')), such that 'comment' can be included and thus yielded

The parse() function thus behaves something like Python's ElementTree.iterparse()

For actual TDL parsing (with unification, etc.) and not just inspection, a load() function could take all the parsed entities (ignoring comments) and construct the type hierarchy and return some kind of compiled namespace object. There might be separate load_types() and load_instances() functions, like in the LKB, that restrict what kinds of entities can be parsed in a file.

@goodmami goodmami added this to the v0.9.0 milestone Aug 27, 2018
@goodmami
Copy link
Member Author

The underlying parsing logic will change dramatically, so maybe leave parse() as-is and deprecate it and add a new iterparse() function for the new logic. The load() functions can then make use of the new iterparse() function.

goodmami added a commit that referenced this issue Sep 21, 2018
This lexer relies on the group identifiers of a regular expression,
and it has support for multiline patterns (comments and docstrings),
which are parsed separately from the regex (the regex then picks up
where the special parser stops). Yielded tokens include the group
identifier of the current and next tokens (helping with lookahead in
parsing), the token text, and its line number.

Addresses #167 and #168
goodmami added a commit that referenced this issue Sep 21, 2018
This adds a lot of code to tdl.py, although much of the old stuff will
be removed in a future release. The new-style parsing is ~36% slower
at reading the ERG's lexicon, but it is better able to deal with
malformed TDL, and it handles docstrings and comments in all valid
places.

Addresses #153, #167, #168, and #170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant