-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unescape strings at tokenizer level #1885
base: main
Are you sure you want to change the base?
Conversation
6680c9e
to
9f2ceda
Compare
The step |
27b1579
to
4c266ab
Compare
(lp) ubaid@ubaids-MacBook-Pro lpython % cat examples/expr2.py
r"""
Text
123\n\t
"""
(lp) ubaid@ubaids-MacBook-Pro lpython % python -m tokenize examples/expr2.py
0,0-0,0: ENCODING 'utf-8'
1,0-4,3: STRING 'r"""\nText\n123\\n\\t\n"""'
4,3-4,4: NEWLINE '\n'
5,0-5,0: ENDMARKER ''
(lp) ubaid@ubaids-MacBook-Pro lpython % lpython examples/expr2.py --show-ast
(Module
[(Expr
(ConstantStr
"\nText\n123\n\t\n"
()
)
)]
[]
) In the above example, we have a raw string. For raw strings, escape sequences are treated as regular characters. We see that the current output of |
Good point. I see two ways forward:
|
4c266ab
to
01633ea
Compare
It seems that adding token for strings is possible, but is leading to many new tokens being added (raw_strings, bytes, raw_bytes, formatted strings, raw_formatted strings, etc.)
For the moment it seems it is better to proceed with unescaping at the parser level. |
I think if you use |
Looks good to me. @Shaikh-Ubaid let me know when this is ready for review. |
There is another concern here. For LFortran when unescaping strings, we need to know the quote used to create the string ( |
ac63787
to
66bf67e
Compare
"b'\n\n\\n'" | ||
"b'\n\\n\\\\n'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example above, the given string is
b'''
\n\\n'''
With respect to this example, it seems that all strings (including raw strings, bytes, raw bytes, etc.) need unescaping at the tokenizer level. And then we need to escape them if they are raw strings/ bytes/ raw bytes, etc (I think at the parser level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do raw strings have to be escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, Sir. I will look into it.
93afc8b
to
f1d76f0
Compare
@@ -499,7 +499,7 @@ | |||
(Expr | |||
(JoinedStr | |||
[(ConstantStr | |||
"\\n" | |||
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug.
src/libasr/asdl_cpp.py
Outdated
self.emit("} else {", 2) | ||
self.emit( 's.append("[]");', 3) | ||
self.emit("}", 2) | ||
else: | ||
self.emit('s.append("\\"" + get_escaped_str(x.m_%s) + "\\"");' % field.name, 2) | ||
self.emit('s.append("\\"" + str_escape_c(x.m_%s) + "\\"");' % field.name, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this change.
|| token == yytokentype::TK_RAW_STRING | ||
|| token == yytokentype::TK_BYTES | ||
|| token == yytokentype::TK_RAW_BYTES) { | ||
t = t + " " + "\"" + str_escape_c(yystype.string.str()) + "\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this change (the str_escape_c
).
@@ -184,7 +184,7 @@ | |||
(data (;0;) (i32.const 4) "\0c\00\00\00\01\00\00\00") | |||
(data (;1;) (i32.const 12) " ") | |||
(data (;2;) (i32.const 16) "\18\00\00\00\01\00\00\00") | |||
(data (;3;) (i32.const 24) "\0a ") | |||
(data (;3;) (i32.const 24) "\n ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this change.
See #1902 (comment). |
f1d76f0
to
7e0f33c
Compare
Support raw strings token Support bytes and raw bytes Add support for unicode, fmt, raw_fmt strings
7e0f33c
to
20d1db9
Compare
This PR currently unescapes at tokenizer level. (LFortran also currently unescapes at tokenizer level since it needs to know the quote used to create the string literal). The concern in the current approach of unescaping at the tokenizer level in LPython is that we are needing to create several/many new tokens like raw strings, unicode strings, formatted strings, raw formatted strings, bytes, raw bytes (and these also need support at the parser level). |
In the current approach, we unescape all |
Ready. |
@@ -129,7 +129,7 @@ | |||
(Expr | |||
(ConstantStr | |||
"Text" | |||
() | |||
"u" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change fixes a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the prefixes "u" and "U" were meaningful for Python 2. In Python 3, all string literals are Unicode by default. Thus, it seems the prefixes "u" and "U" do not add much/any value in Python 3 (Reference https://chat.openai.com/share/1270fc0a-fb93-4638-822d-3d1619488027).
With respect to the above, I think it might not be a bug fix. Previously, we were passing the kind only for "u" (prefix). Currently, we also pass the kind value for "U" (capital U
) (prefix).
Thanks for this. I think you might have found and fixed another bug (#1885 (comment)). That bug aside, it does seem this approach is more complicated than doing it in the parser. Let's keep this PR open for a while and think about it. @czgdp1807 let us know your opinion on this approach as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the status of this? Do we need this?
We need your opinion on it. Please, let us know what we should do. We currently handle unescaping at the parser level for LPython (and tokenizer level for LFortran). This PR supports unescaping at tokenizer level for LPython. Any approach works for me. Please feel free to close/merge it as required. |
If there is no significant benefit of bringing un-escaping to the tokenizer level in LPython, then I wouldn't work on it for the time being. There are higher priority things to be done (like bug fixing, performance improvements to perform better than LPython's competitors), so let's work on those. |
related lfortran/lfortran#1783