Update the TDL parsing API #168

goodmami · 2018-08-27T23:44:07Z

#81 made some parsing functions non-public (ones which weren't useful as part of the public API), but we left the tdl.lex() and tdl.tokenize() functions alone, even though they probably have limited utility outside of the tdl module. These functions can be deprecated if we adopt some changes to the API such that comments can be retrieved. Here is one proposal:

drop tokenize() and its regex
make lex() non-public (_lex()) and have it lex every token, not just the top-level constructions
let parse() specify which entities it will yield (by default, for example, ('typedef', 'typeaddendum', 'instance')), such that 'comment' can be included and thus yielded

The parse() function thus behaves something like Python's ElementTree.iterparse()

For actual TDL parsing (with unification, etc.) and not just inspection, a load() function could take all the parsed entities (ignoring comments) and construct the type hierarchy and return some kind of compiled namespace object. There might be separate load_types() and load_instances() functions, like in the LKB, that restrict what kinds of entities can be parsed in a file.

The text was updated successfully, but these errors were encountered:

goodmami · 2018-09-20T17:02:39Z

The underlying parsing logic will change dramatically, so maybe leave parse() as-is and deprecate it and add a new iterparse() function for the new logic. The load() functions can then make use of the new iterparse() function.

This lexer relies on the group identifiers of a regular expression, and it has support for multiline patterns (comments and docstrings), which are parsed separately from the regex (the regex then picks up where the special parser stops). Yielded tokens include the group identifier of the current and next tokens (helping with lookahead in parsing), the token text, and its line number. Addresses #167 and #168

This adds a lot of code to tdl.py, although much of the old stuff will be removed in a future release. The new-style parsing is ~36% slower at reading the ERG's lexicon, but it is better able to deal with malformed TDL, and it handles docstrings and comments in all valid places. Addresses #153, #167, #168, and #170

goodmami added this to the v0.9.0 milestone Aug 27, 2018

goodmami closed this as completed in 6247219 Sep 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the TDL parsing API #168

Update the TDL parsing API #168

goodmami commented Aug 27, 2018

goodmami commented Sep 20, 2018

Update the TDL parsing API #168

Update the TDL parsing API #168

Comments

goodmami commented Aug 27, 2018

goodmami commented Sep 20, 2018