Unescape strings at tokenizer level #1885

Shaikh-Ubaid · 2023-06-08T18:52:05Z

Shaikh-Ubaid · 2023-06-08T19:40:06Z

The step Run mamba-org/provision-with-micromamba@main seems to fail for three jobs at the CI.

Shaikh-Ubaid · 2023-06-09T02:33:43Z

(lp) ubaid@ubaids-MacBook-Pro lpython % cat examples/expr2.py                     
r"""
Text
123\n\t
"""
(lp) ubaid@ubaids-MacBook-Pro lpython % python -m tokenize examples/expr2.py      
0,0-0,0:            ENCODING       'utf-8'        
1,0-4,3:            STRING         'r"""\nText\n123\\n\\t\n"""'
4,3-4,4:            NEWLINE        '\n'           
5,0-5,0:            ENDMARKER      ''             
(lp) ubaid@ubaids-MacBook-Pro lpython % lpython examples/expr2.py --show-ast        
(Module
    [(Expr
        (ConstantStr
            "\nText\n123\n\t\n"
            ()
        )
    )]
    []
)

In the above example, we have a raw string. For raw strings, escape sequences are treated as regular characters. We see that the current output of lpython is not as expected. Previously, the unescaping of string was done at the parser level, where we did not unescape strings which have PREFIX_STRING (see semantics.h) of r.

certik · 2023-06-09T02:42:26Z

Good point. I see two ways forward:

unescape at the parser level
add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

Shaikh-Ubaid · 2023-06-12T18:32:38Z

add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

It seems that adding token for strings is possible, but is leading to many new tokens being added (raw_strings, bytes, raw_bytes, formatted strings, raw_formatted strings, etc.)

raw_str1 = ("r" | "R") (string1 | string2);
raw_str2 = ("r" | "R") (string3 | string4);
bytes1 = ("b" | "B") (string1 | string2);
bytes2 = ("b" | "B") (string3 | string4);
raw_bytes1 = ("rb" | "rB" | "Rb" | "RB" | "br" | "bR" | "Br" | "BR") (string1 | string2);
raw_bytes2 = ("rb" | "rB" | "Rb" | "RB" | "br" | "bR" | "Br" | "BR") (string3 | string4);
...

For the moment it seems it is better to proceed with unescaping at the parser level.

certik · 2023-06-12T18:45:27Z

I think if you use 'rb' then it is case insensitive in re2c. But it's fine.

certik · 2023-06-12T18:46:54Z

Looks good to me. @Shaikh-Ubaid let me know when this is ready for review.

Shaikh-Ubaid · 2023-06-12T19:05:45Z

For the moment it seems it is better to proceed with unescaping at the parser level.

There is another concern here. For LFortran when unescaping strings, we need to know the quote used to create the string (' or "). I think this information is not available with the parser. So, it seems it needs unescaping at the tokenizer level.

Shaikh-Ubaid · 2023-06-12T22:20:58Z

tests/reference/ast_new-string1-96b90b3.stdout

-            "b'\n\n\\n'"
+            "b'\n\\n\\\\n'"


In the example above, the given string is

b''' \n\\n'''

With respect to this example, it seems that all strings (including raw strings, bytes, raw bytes, etc.) need unescaping at the tokenizer level. And then we need to escape them if they are raw strings/ bytes/ raw bytes, etc (I think at the parser level).

Why do raw strings have to be escaped?

I think you are right, Sir. I will look into it.

certik · 2023-06-13T16:51:30Z

tests/reference/ast_new-string1-96b90b3.stdout

@@ -499,7 +499,7 @@
    (Expr
        (JoinedStr
            [(ConstantStr
-                "\\n"
+                "\n"


I think this is a bug.

certik · 2023-06-13T16:56:56Z

src/libasr/asdl_cpp.py

                    self.emit("} else {", 2)
                    self.emit(    's.append("[]");', 3)
                    self.emit("}", 2)
                else:
-                    self.emit('s.append("\\"" + get_escaped_str(x.m_%s) + "\\"");' % field.name, 2)
+                    self.emit('s.append("\\"" + str_escape_c(x.m_%s) + "\\"");' % field.name, 2)


I would keep this change.

certik · 2023-06-13T16:57:14Z

src/lpython/parser/tokenizer.re

+        || token == yytokentype::TK_RAW_STRING
+        || token == yytokentype::TK_BYTES
+        || token == yytokentype::TK_RAW_BYTES) {
+        t = t + " " + "\"" + str_escape_c(yystype.string.str()) + "\"";


I would keep this change (the str_escape_c).

certik · 2023-06-13T16:57:52Z

tests/reference/wat-bool1-234bcd1.stdout

@@ -184,7 +184,7 @@
    (data (;0;) (i32.const 4) "\0c\00\00\00\01\00\00\00")
    (data (;1;) (i32.const 12) "    ")
    (data (;2;) (i32.const 16) "\18\00\00\00\01\00\00\00")
-    (data (;3;) (i32.const 24) "\0a   ")
+    (data (;3;) (i32.const 24) "\n   ")


Let's keep this change.

Shaikh-Ubaid · 2023-06-13T19:35:09Z

See #1902 (comment).

Support raw strings token Support bytes and raw bytes Add support for unicode, fmt, raw_fmt strings

Shaikh-Ubaid · 2023-06-13T23:36:17Z

Good point. I see two ways forward:

unescape at the parser level

add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

This PR currently unescapes at tokenizer level. (LFortran also currently unescapes at tokenizer level since it needs to know the quote used to create the string literal).

The concern in the current approach of unescaping at the tokenizer level in LPython is that we are needing to create several/many new tokens like raw strings, unicode strings, formatted strings, raw formatted strings, bytes, raw bytes (and these also need support at the parser level).

Shaikh-Ubaid · 2023-06-13T23:39:28Z

In the current approach, we unescape all strings, bytes unless it is marked raw in which case we do not unescape (and therefore the escape sequences present (if any) in the raw string/bytes would be treated as character literals).

Shaikh-Ubaid · 2023-06-13T23:40:06Z

Ready.

certik · 2023-06-14T00:33:35Z

tests/reference/ast_new-string1-96b90b3.stdout

@@ -129,7 +129,7 @@
    (Expr
        (ConstantStr
            "Text"
-            ()
+            "u"


This change fixes a bug?

It seems that the prefixes "u" and "U" were meaningful for Python 2. In Python 3, all string literals are Unicode by default. Thus, it seems the prefixes "u" and "U" do not add much/any value in Python 3 (Reference https://chat.openai.com/share/1270fc0a-fb93-4638-822d-3d1619488027).

With respect to the above, I think it might not be a bug fix. Previously, we were passing the kind only for "u" (prefix). Currently, we also pass the kind value for "U" (capital U) (prefix).

certik · 2023-06-14T00:35:50Z

Thanks for this. I think you might have found and fixed another bug (#1885 (comment)).

That bug aside, it does seem this approach is more complicated than doing it in the parser. Let's keep this PR open for a while and think about it.

@czgdp1807 let us know your opinion on this approach as well.

czgdp1807

What's the status of this? Do we need this?

Shaikh-Ubaid · 2023-07-20T08:20:09Z

What's the status of this? Do we need this?

We need your opinion on it. Please, let us know what we should do.

We currently handle unescaping at the parser level for LPython (and tokenizer level for LFortran). This PR supports unescaping at tokenizer level for LPython.

Any approach works for me. Please feel free to close/merge it as required.

czgdp1807 · 2023-08-28T04:58:39Z

If there is no significant benefit of bringing un-escaping to the tokenizer level in LPython, then I wouldn't work on it for the time being. There are higher priority things to be done (like bug fixing, performance improvements to perform better than LPython's competitors), so let's work on those.

Shaikh-Ubaid mentioned this pull request Jun 8, 2023

Support Escaping/Unescaping strings in tokenizer lfortran/lfortran#1783

Merged

Shaikh-Ubaid force-pushed the escape_unescape branch 2 times, most recently from 6680c9e to 9f2ceda Compare June 8, 2023 19:34

Shaikh-Ubaid marked this pull request as ready for review June 8, 2023 19:39

Shaikh-Ubaid requested a review from certik June 8, 2023 19:50

Shaikh-Ubaid force-pushed the escape_unescape branch 3 times, most recently from 27b1579 to 4c266ab Compare June 9, 2023 02:11

Shaikh-Ubaid marked this pull request as draft June 9, 2023 02:24

Shaikh-Ubaid force-pushed the escape_unescape branch from 4c266ab to 01633ea Compare June 12, 2023 02:40

Shaikh-Ubaid force-pushed the escape_unescape branch from ac63787 to 66bf67e Compare June 12, 2023 22:03

Shaikh-Ubaid commented Jun 12, 2023

View reviewed changes

Shaikh-Ubaid force-pushed the escape_unescape branch from 93afc8b to f1d76f0 Compare June 13, 2023 16:14

certik reviewed Jun 13, 2023

View reviewed changes

Shaikh-Ubaid mentioned this pull request Jun 13, 2023

Escape unescape improvements #1902

Merged

Shaikh-Ubaid force-pushed the escape_unescape branch from f1d76f0 to 7e0f33c Compare June 13, 2023 23:21

Shaikh-Ubaid added 3 commits June 14, 2023 04:53

Unescape at tokenizer level

da3b5c0

Support raw strings token Support bytes and raw bytes Add support for unicode, fmt, raw_fmt strings

TEST: Update reference tests

d2a2f11

Refactor: Merge/Combine token_str() and related funcs

20d1db9

Shaikh-Ubaid force-pushed the escape_unescape branch from 7e0f33c to 20d1db9 Compare June 13, 2023 23:24

Shaikh-Ubaid changed the title ~~Support Escaping/Unescaping strings in tokenizer~~ Unescape strings at tokenizer level Jun 13, 2023

Shaikh-Ubaid marked this pull request as ready for review June 13, 2023 23:26

Shaikh-Ubaid requested a review from certik June 13, 2023 23:40

certik reviewed Jun 14, 2023

View reviewed changes

Merge branch 'lcompilers:main' into escape_unescape

8af6002

czgdp1807 reviewed Jul 20, 2023

View reviewed changes

czgdp1807 marked this pull request as draft August 28, 2023 04:58

Shaikh-Ubaid mentioned this pull request Oct 1, 2023

old print syntax throws #2339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescape strings at tokenizer level #1885

Unescape strings at tokenizer level #1885

Shaikh-Ubaid commented Jun 8, 2023 •

edited

Loading

Shaikh-Ubaid commented Jun 8, 2023

Shaikh-Ubaid commented Jun 9, 2023

certik commented Jun 9, 2023

Shaikh-Ubaid commented Jun 12, 2023

certik commented Jun 12, 2023

certik commented Jun 12, 2023

Shaikh-Ubaid commented Jun 12, 2023 •

edited

Loading

Shaikh-Ubaid Jun 12, 2023

certik Jun 12, 2023

Shaikh-Ubaid Jun 13, 2023

certik Jun 13, 2023

certik Jun 13, 2023

certik Jun 13, 2023

certik Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

certik Jun 14, 2023

Shaikh-Ubaid Jun 14, 2023 •

edited

Loading

certik commented Jun 14, 2023

czgdp1807 left a comment

Shaikh-Ubaid commented Jul 20, 2023

czgdp1807 commented Aug 28, 2023

Unescape strings at tokenizer level #1885

Are you sure you want to change the base?

Unescape strings at tokenizer level #1885

Conversation

Shaikh-Ubaid commented Jun 8, 2023 • edited Loading

Shaikh-Ubaid commented Jun 8, 2023

Shaikh-Ubaid commented Jun 9, 2023

certik commented Jun 9, 2023

Shaikh-Ubaid commented Jun 12, 2023

certik commented Jun 12, 2023

certik commented Jun 12, 2023

Shaikh-Ubaid commented Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Shaikh-Ubaid commented Jun 13, 2023

Choose a reason for hiding this comment

Shaikh-Ubaid Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

certik commented Jun 14, 2023

czgdp1807 left a comment

Choose a reason for hiding this comment

Shaikh-Ubaid commented Jul 20, 2023

czgdp1807 commented Aug 28, 2023

Shaikh-Ubaid commented Jun 8, 2023 •

edited

Loading

Shaikh-Ubaid commented Jun 12, 2023 •

edited

Loading

Shaikh-Ubaid Jun 14, 2023 •

edited

Loading