Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect start_pos / end_pos in the tree #1406

Open
munching opened this issue Apr 9, 2024 · 8 comments
Open

Incorrect start_pos / end_pos in the tree #1406

munching opened this issue Apr 9, 2024 · 8 comments
Labels
discussion Discuss new features or other changes question

Comments

@munching
Copy link

munching commented Apr 9, 2024

Hi,
I'd like to report a possible bug with token's attributes start_pos, end_pos and line.
Below is an example to reproduce.

Grammar:

start: statement_list
statement_list: statement+
statement: variable "=" variable
variable: ID
ID: LETTER (LETTER | DIGIT)*
LETTER: /[a-zA-Z]/
DIGIT: /[0-9]/
%import common.WS
%ignore WS

Input:
myvar1 = myvar2

Python code:

from lark import Lark

with open('grammar.lark') as f:
    grammar = f.read()

with open('input.txt') as f:
    input = f.read()

parser = Lark(grammar, start='start', parser='lalr')
tree = parser.parse(input)

for node in tree.iter_subtrees_topdown():
    print(node.data + " start: " + str(node.data.start_pos) + " end: " + str(node.data.end_pos) + " line: " + str(node.data.line))

I get the following output which is obviously incorrect:

start start: 0 end: 5 line: 1
statement_list start: 22 end: 36 line: 2
statement start: 49 end: 58 line: 3
variable start: 82 end: 90 line: 4
variable start: 82 end: 90 line: 4

The input has only one line and 15 symbols total. Does this look like a bug or am I misusing attributes?
Thank you

@MegaIng
Copy link
Member

MegaIng commented Apr 9, 2024

node.data isn't part of the input, but of the grammar. Use node.meta instead.

@erezsh We really need to change it back so that Tree doesn't take Token, but just their values. Those being Token's results in many issues.

@munching
Copy link
Author

munching commented Apr 9, 2024

Hi @MegaIng
Thank you for the quick response. I totally misunderstood the "data" thing, now it works fine. Thank you very much for helping!

@erezsh
Copy link
Member

erezsh commented Apr 9, 2024

@MegaIng Yeah, makes sense. The tokens of the parsed grammar aren't relevant to the output tree.

@munching
Copy link
Author

munching commented Apr 9, 2024

Not sure what do you mean by that, I'm relatively new to Lark. But trying to access start_pos / end_pos of tokens is my attempt to bring comments into the tree that were ignored on parsing stage. I've done some research and it looks like there isn't a way to easily do that. I'm parsing a Pascal-like language and must use LALR parser because of its speed: my usual input is roughly 350 mb of source code and Earley works unacceptably slow. Tried to rewrite my grammar to not ignore the code but that's probably not possible with LALR. So in my case having tokens with information on where they were found in the input is the last hope of bringing in the comments.

@MegaIng
Copy link
Member

MegaIng commented Apr 9, 2024

@munching We aren't talking about your issue directly. If lark behaved correctly, you would have gotten an AttributeError from accessing those names on .data, which probably would have clued you in faster. That is what we are talking about fixing.

For collecting comments we normally suggest terminal callbacks, but that doesn't easily bring it into the tree, that is a bit of extra work.

@erezsh
Copy link
Member

erezsh commented Apr 9, 2024

@munching
Copy link
Author

munching commented Apr 9, 2024

@erezsh @MegaIng
Thank you very much for the explanations!
I'm actually already using terminal callbacks to collect comments and then knowing their start/end I go through the tree and try to find the "tightest" token that fully enclose my comment. Then the task is to figure out in between what children to put it to. And that's where I was stuck because data.start_pos was giving me seemingly irrelevant numbers :)

@erezsh
Copy link
Member

erezsh commented Apr 9, 2024

There is some code that I wrote once that did something like it.

Maybe I should clean it up and add it to Lark, as a utility function.

I don't know if it will be helpful for you, but this is the code:

def assign_comments(tree, comments):
    nodes_by_line = classify(tree.iter_subtrees(), lambda t: getattr(t.meta, 'line', None))
    nodes_by_line.pop(None,None)
    rightmost_nodes = {line: max(nodes, key=lambda n: n.meta.column) for line, nodes in nodes_by_line.items()}
    leftmost_nodes  = {line: min(nodes, key=lambda n: (n.meta.column, -(n.meta.end_pos - n.meta.start_pos))) for line, nodes in nodes_by_line.items()}

    for c in comments:
        if c.line == c.end_line:
            n = rightmost_nodes[c.end_line]
            assert not hasattr(n.meta, 'inline_comment')
            n.meta.inline_comment = c
        else:
            if c.end_line not in leftmost_nodes:
                # Probably past the end of the file
                # XXX verify this is the case
                continue

            n = leftmost_nodes[c.end_line]
            header_comments = getattr(n.meta, 'header_comments', [])
            n.meta.header_comments = header_comments + [c]

P.S. classify() is basically like itertools.groupby but it returns a dict.

@erezsh erezsh added question discussion Discuss new features or other changes labels Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discuss new features or other changes question
Projects
None yet
Development

No branches or pull requests

3 participants